# Description: # A script to reformat a plain text file document which contains a # set of urls and descriptions of those urls into some kind of html. The resulting # HTML also contains a form which allows the user to make comments about # the source document. This script can be used in conjuction with the # 'add-comment' script, which processes the form values and updates the # text file apon which the HTML file is based. # # The script recognises some special structures within the plain text # document. For examples of text files which use these 'special' structures look # at the file http://www.ella-associates.org/alexis-info/docs/resources.html and # follow some of the 'plain text' links. These plain text files are the 'source' files # which have been used, in conjuction with scripts such as the present one, to # create the HTML files. # # Special Text File Formats Recognized This Script: # # The '=' character, when the first non-whitespace character on a # line indicates that all the following text on the line should be # formatted as a 'heading' or 'page title'. # # The '*' character, indicates that the following white-space # delimited text should be formatted as an Html hyperlink, with the # text content of the hyperlink being the url itself. # # This script also accepts the format: # [Beginning Of Line][spaces]*[spaces]The Document Title|Url-Or-Path/to/Html/File|Url-Or-Path/To/Text/File| # # This script also accepts the format (all on one line): # [Beginning Of Line][spaces]*[spaces]The Document Title|Url-Or-Path/to/Html/File|Url-Or-Path/To/Text/File| # |Url-Or-Path/To/Pdf/File| # # This script also accepts the format: # [Beginning Of Line][spaces]*[spaces]The Document Title|Url-Or-Path/to/Base/FileName|||| # An example of this format would be # * A Interesting Analysis|/alexis-info/docs/the-ramble|||| # This example assumes that there are files # /alexis-info/docs/the-ramble.html # /alexis-info/docs/the-ramble.txt # /alexis-info/docs/the-ramble.pdf # # This format is useful when all the different 'versions' (that is, document formats) # have the same base name and directory location, but have the appropriate file name # extension for their documents type. The script will automatically generate links # to each of these document formats in the order: html, text, pdf # # # The script also accepts the format (all on one line) # [Beginning Of Line][spaces]* # [spaces]The Document Title|Url-Or-Path/to/Base/FileName|extension|extension|extension| # Where 'extension' is any file name extension # An example of this format would be # * A Interesting Analysis|/alexis-info/docs/the-ramble|txt|html|doc| # This example assumes that there are files # /alexis-info/docs/the-ramble.html # /alexis-info/docs/the-ramble.txt # /alexis-info/docs/the-ramble.doc # # For the sake of the 'readability' of the text file, this format is prefered to the previous # one. Both of these formats can also be used with two file name extensions instead of one. # # This script also accepts the format: # [Beginning Of Line][spaces]*[spaces]The Document Title|Url-Or-Path/to/Base/FileName||| # This produces the same results as the format above except that no link to a Adobe 'pdf' # file is created. # # The script also accepts the format (all on one line) # [Beginning Of Line][spaces]* # [spaces]The Document Title|Url-Or-Path/to/Base/FileName|extension|extension| # Where 'extension' is any file name extension # An example of this format would be # * A Interesting Analysis|/alexis-info/docs/the-ramble|txt|doc| # This example assumes that there are files # /alexis-info/docs/the-ramble.txt # /alexis-info/docs/the-ramble.doc # # This script also accepts the format (All on one line): # [Beginning Of Line][spaces]*[spaces] # The Document/Link Title|Url-Or-Path/to/File| # # The script also accepts the format: # [Beginning Of Line][spaces]http://blah # # The script will also format blocks of text between the strings -->> and --<< # (where they are the first string on the line) as an HTML
 block
# 
#    This filter script also ignors lines starting in a '#' character. That is
#    those lines will not be rendered into Html.
# 
#    Please see the file /var/www/alexis-info/docs/resources.txt for an
#    example of a file which utilizes some of the formats described above.
#
# Example:
#    ./linkdoc2html-forum.sh some-list-of-urls.txt > output-file.html
#     
# Parameters:
#
#   textFileName
#     The name of the text file which is to be transformed from text into html
#   [notran]
#     If the second parameter is the string 'notran' then the javascript links
#     to the google automatic language translation engine will NOT be inserted
#     into the HTML page. This is useful, for example, when the HTML page is 
#     going to be located within a 'password-protected' directory, because
#     the Google translation engine will not be able to access the page, and
#     therefor the translation links will not work.
#   [notoc]
#   [forum-processor-url]
#      This parameter specifies the location of the CGI script which will process
#      the comments entered by a web-visitor in the HTML form which is generated
#      by the present script. The CGI script which is currently used is called
#      'add-comment' and it probably should be located in which ever is the 
#      CGI directory for your web-server. However, since the CGI script 'add-comment'
#      is actually written using the Bash shell language, it will not be completely
#      straight-forward to induce the script to run on a Microsoft Windows computer.
#      However, nor should it be particularly difficult. The process involve finding
#      a Unix shell emulator which will run on Microsoft Windows and which the WEb
#      Server is capable of accessing via it CGI mechanism. These Unix shell 
#      emulators are quite common ('cygwin' is one example, although cygwin may actually
#      be more complicated to configure in this case than other less capable 
#      Bash shell emulators.
#
#      Also, since all this script does is transform a text stream in reasonably
#      straightforward ways, it should not be difficult to port the script to 
#      any language which supports 'regular expressions' of which there are many.
#      
#   [output-language]
#      This is the language in which various messages will appear in the generated
#      HTML pages. For example the messages which instruct the web-visitor how
#      to make a comment or other contribution to the page can be in English or 
#      Spanish. Currently only the languages English and Spanish are supported
#
# Dependencies:
#
#   The following unix utilities seem to be used: This may not be a complete list
#     echo, expand, sed, nl, pwd, dirname, tr ?
#
#   /var/www/utils/iso2html.sed
#      This 'iso2html.sed' file is a script written in the 'sed' language which
#      transforms 'special' characters, such as european accented characters
#      into their equivalent HTML entities which look approximately like this
#      á  for an acute accents 'a' character. This script was found at the
#      site 'sed.sourceforge.net' which also contains many other very useful
#      sed scripts, such as one which 'capitalizes' words, in the sense that the
#      first letter of each work is made into a capital (upper case) letter and all
#      the remaining characters are made lower case. Also at this site is a good
#      HTML hyper-link extractor.
#
#      The present script assumes that this 'iso2html.sed' script is in the exact
#      location as specified above. This present script will probably not work 
#      at all if this script is not present in the correct location.
#
# Notes:
#   The idea of this script is to allow the text file to be as free of 'mark-up'
#   as is possible. This can allow the simple maintenance of the text file, although
#   the precision and utility of a system such as XML is not available. 
#   It should be possible to modify this script to produce XML instead of HTML
#
#   This script has been successfully run on the Debian Linux bash shell as well
#   as the Redhat Linux Bash Shell.
#   It is possible that it would also run on a Microsoft Windows bash shell,
#   such as the Cygwin Bash shell.
#   
#   There is a GPL perl program called text2html which performs a similar task
#   to this script.
#
#   The HTML produced by this script is NOT friendly to Lynx, the text browser
#   because it uses an HTML table to create a 'left margin' for the document
#   A style sheet should be used instead.
#
#   The code which used 'mawk' or 'awk' or 'gawk' in order to number certain lines
#   which matched a regular expression have been removed and replaced with code
#   which uses the 'nl' program. For some reason 'nl' place empty lines in between
#   every line in the file when it uses a regular expression to number lines. These
#   'empty' lines actually contain a series of spaces and nothing else.
#   
#   For this reason, some extra 'sed' lines are necessary in order to get rid of this
#   unwanted blank lines.
#
#  See Also:
#    txtdoc2html.sh,
#      This was designed to transform a document which is in a similar style to
#      these notes here, to tranform this type of text into HTML with hyper-links etc
#    diary2html.sh,
#      Transform a set of entries which are labelled according to dates (not necessarily
#      sequential or valid, into an HTML file with a 'table of contents' consisting of
#      each of the dates referenced
#    plaintext2html.sh
#      Transform a document which contains section headings and URLs into HTML. The 
#      resulting HTML may have a hyperlinked table of contents.
#    plaintext2pdf.sh, 
#      This uses the 'htmldoc' program to create a pdf version of the original text 
#      file including a table of contents.
#    plaintext2html-forum.sh, 
#      This transforms a plain text file into HTML with URLs hyperlinked and with the 
#      capability for the web-visitors to make contributions to the 'source' text file
#      through an HTML form and using the 'add-comment' CGI script.
#    linkdoc2html.sh
#      This transforms a list of URLs and there descriptions into hyper-linked HTML
#    linkdoc2html-index.sh
#    linkdoc2html-forum.sh
#    resume2html.sh
#    glossary2xml.sh
#    script-summary.txt
#      This file contains short descriptions of what each script does. Also consult the
#      actual scripts themselves which contain detailed descriptions of their operation
#      at the head/ beginning of the file. The 'script-summary' file also contains some semi-philosophical
#      discussions of the mentality behind these scripts and, for instance, why DocBook
#      wasn't used instead.
#  Author:
#   m.j.bishop
# 
# Bugs and Ideas:
#   In some cases a URL will contain the characters " or ' even though they probably
#   shouldn't. If the URL contains one of these characters then the sed scripts below
#   will either break or not hyperlink the URL properly. 
#
#   I would also like to 'uncapitalize' the Section Headings so that they can be more
#   readable in the 'table of contents'. I think it is established that all Upper Case
#   letters are more difficult to read than lower-case or mixed case.
#   I can achieve this using a sed script at http://sed.sourceforge.net
#   In particular the part of this string which turns the links into HTML hyperlinks
#   should be examined and fixed in some way. The quote characters could probably be
#   'url encoded' in some manner to fix this problem, before SED gets to work on them.
#
#   Also, perhaps multiple levels (or at least two levels) of Section Headings should
#   be supported, since with very long tables of contents the readability degrades.
#
#

 if [ "$1" = "" ]
 then
   echo "usage: $0 textFileName [notran] [notoc] [forum-processor-url] [output-language]"
   cat $0 | sed -n "/^[ ]*#/p" 
   exit 1;
 fi


 #-- The section below creates the table of contents for the linkdoc.
 #-- This line is designed to only number lines which match a pattern
 #-- In theory 'nl -bpPATTERN' should also do this, but it insisted on
 #-- 'double-spacing' the output
 #-- Also the expressions below try and get rid of things like "can't" and "won't"
 #-- because I want to apply some formatting to the content of quotes, and these
 #-- things will get in my way.

 #-- This is the pattern which determines what sort of lines will
 #-- be interpreted as 'section headings'. I cannot use the for the 'awk' line
 #-- because awk does not seem to accept the notation \{n,\}
 
 sHeadingPattern='[ A-Z0-9.\/\\]*[A-Z]\{3,\}[ A-Z0-9.\/\\]*'
 sLanguage="ENGLISH" 

 if [ "$5" != "" ]
 then
   sLanguage="$5"
 fi

#-- I have disable the code below because in a cgi environment, this script doesn't
#-- seem to have permission to create a file. 
#-- This is a real gotcha. If the file $1.temp already exists and is not writable
#-- by 'other' then the 'add-comment' script falls over because it cant successfully
#-- call this script. This problem only arises in a CGI environment where the 
#-- Web server does not have root permissions. If the $1.temp file cannot be
#-- created then this script wont work. One solution is to manually give 
#-- write permission to 'other'.
#-- This script (and the 'add-comment' script) will succeed the FIRST time in
#-- a cgi environment if the $1.temp file does not exist at all. This is 
#-- because if the file does not exist then the Web Server has sufficient
#-- permissions to create it. HOWEVER, the second time and afterwards this
#-- script and the 'add-comment' script will FAIL because when the web
#-- Server creates the $1.temp file the first time it creates it without 
#-- write permission for 'other'. That is to say, the Web Server essentially
#-- is able to create a file which it is not allowed to subsequently 
#-- modify (nor re-create). Actually this whole second part may not be true
#-- The web server creates the file as 'mbishop' and probably cant write to 
#-- it.
#--
#-- There are, no doubt, various solutions to this problem, including giving the
#-- web server sufficient permissions to recreate the file. etc. However the
#-- simplest solution is just to not use $1.temp. It is/was only used in three 
#-- places. Removing it may or may not slow the script down. I dont know

# cat $1 | expand | \
#   mawk '/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/{ii++; print ii $0}!/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/' | \
#   sed "s/\([a-zA-Z]\{2,\}\)n[\"']t/\1nt/g" > $1.temp
   
 echo ""
 echo ""
 echo " "
 echo " "
 echo " "
 echo "        "
 
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 #-- The Google automatic translation links below, are sometimes disabled because they will
 #-- not work from within a password protected directory, since Google does not
 #-- have permission to view that directory.
 if [ "$2" != "notran" ]
 then
   echo "
" if [ $sLanguage = "ENGLISH" ] then echo "See this page in (approximate):" echo "Español|" echo "Français|" echo "Italiano|" echo "Deutsch|" echo "Português" else echo "Vea esta página en (aproximado):" echo "English" fi echo "
" fi #---- The file below contains a colorized table of the links #---- cat /var/www/utils/translator-bar.html #-- Put the page heading before the table of contents #-- expand $1 | \ sed "/^[ ]*=[ ]*[^=].*/!d" | \ sed "s/\([a-zA-Z]\{2,\}\)n[\"']t/\1nt/g" | \ sed -e "s//\>/g" | \ sed "s/^[ ]*=[ ]*\([^=].*\)/

\1<\/h2><\/center>/gi" #- This line below is not 'lynx friendly' as style sheet #- should be used instead. echo "
" echo "
" if [ $sLanguage = "ENGLISH" ] then echo " [make a comment about (or add to) this document]" else echo " [haz un comentario sobre este documento]" fi echo "
" #-- Insert the table of contents if [ "$3" != "notoc" ] then #-- This is probably faster than the code in 'plaintext2html.sh' because #-- no files have to be written. echo "
" #expand $1 | \ # mawk '/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/{ii++; print ii $0}!/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/' | \ expand $1 | \ sed "s/^[ ]*$//g" | \ nl -s" " -bp'^[ A-Z0-9.\/\\:]*[A-Z][A-Z][A-Z]+[ A-Z0-9.\/\\:]*$' | \ sed "/^[ ]\+$/d" | \ sed "s/^[ ]*\([1-9][0-9]*\) /\1/g" | \ sed "s/\([a-zA-Z]\{2,\}\)n[\"']t/\1nt/g" | \ sed "/^[0-9]\{1,\}$sHeadingPattern$/!d" | \ sed "s/^\([0-9]\{1,\}\)\($sHeadingPattern\)$/
\1. \2<\/a>/g" echo "
" # cat plain-text-toc.temp fi # We need a line to convert from UTF-8 to iso-8859 so that the sed script iso2html.sed will work expand $1 | \ #-- This old 'awk' code was causing problems. #mawk '/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/{ii++; print ii $0}!/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/' | \ sed "s/^[ ]*$//g" | \ #-- Number all lines that are 'section headings' nl -s" " -bp'^[ A-Z0-9.\/\\:]*[A-Z][A-Z][A-Z]+[ A-Z0-9.\/\\:]*$' | \ #-- Get rid of the 'blank' lines which nl puts into the output sed "/^[ ]\+$/d" | \ #-- Reformat the numbered section headings sed "s/^[ ]*\([1-9][0-9]*\) /\1/g" | \ #-- Get rid of contraction apostrophes (like in don't, can't, isn't etc). This is not really required #-- I have disabled this because it seems silly #--sed "s/\([a-zA-Z]\{2,\}\)n[\"']t/\1nt/g" | \ #-- Delete all comments lines (beginning in a hash symbol) sed "/^[ ]*#/d" | \ #-- Delete the page title because its already been output sed "/^[ ]*=[ ]*\([^=].*\)$/d" | \ #-- Encode special characters '<>' as HTML entities sed -e "s//\>/g" | \ #-- Encode special characters '<>' as HTML entities sed -f /var/www/utils/iso2html.sed | \ #-- Do a trick to get the '-->>' and '--<<' blocks of text to work sed -e "s/^[ ]*\-\-\>\>/
/g" -e "s/^[ ]*\-\-\<\</<\/pre>/g" | \
   #-- Make each 'section heading' into an HTML anchor to work with the 'Table of Contents'
   sed "s/^\([0-9]\{1,\}\)\($sHeadingPattern\)$/\1. \2<\/a><\/tt><\/strong> [TOC]<\/a>/g" | \
   #-- Hyperlink URL style pieces of text
   sed "s/^[ ]*\(http:\/\/[^ ]\{3,\}\)/\1<\/a>/gi" | \
   #-- Hyperlink email addresses with a 'mailto:' link
   sed "/
/!s/\([^ ]\{2,\}@[^ \"']\{2,\}\)/\1<\/a>/g" | \
   #-- Example of Format Below: * My Title|/my/path/to/file||||
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)||||/\1<\/b> (Formats:<\/em> html<\/a> | text<\/a> | pdf<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|html|txt|xml|pdf|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|/\1<\/b> (Formats:<\/em> \3<\/a> | \4<\/a> | \5<\/a> | \6<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|html|txt|pdf|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|/\1<\/b> (Formats:<\/em> \3<\/a> | \4<\/a> | \5<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|||
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|||/\1<\/b> (Formats:<\/em> html<\/a> | text<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|pdf|html|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|/\1<\/b> (Formats:<\/em> \3<\/a> | \4<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/full/path/to/htmlfile|/full/path/to/text/file|/full/path/to/pdffile|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([^|]*\)|\([^|]*\)|/\1<\/b> (Formats:<\/em> html<\/a> | text<\/a> | pdf<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/full/path/to/htmlfile|/full/path/to/text/file|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([^|]*\)|/\1<\/b> (Formats:<\/em> html<\/a> | text<\/a>)/gi" | \
   #-- Trick to make 'txt' links into 'text' links for readability
   sed "s/>txt<\/a>/>text<\/a>/gi" | \
   #-- Example of Format Below: * My Title|/full/path/to/any-old-file|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|/\1<\/b>(\2<\/a>)/gi" | \
   #-- Example of Format Below: * /full/path/to/any-old-file
   sed "s/^[ ]*\*[ ]*\([^ ]\{2,\}\)/\1<\/a>/gi" | \
   #-- Format comments added by web-users
   sed "s/^\([ ]*added[ ]\{0,4\}by:\)\([^,]\{1,\}\)\,[ ]*on[ ]*\(.*\)/\1<\/em> \2<\/tt> on \3<\/em><\/u>/gi" | \
   #-- Turn spaces into non-breaking-spaces unless they are between 'pre' tags
   sed "/
/!s/[ ]\{2\}/\ \ /g" | \
   #-- Turn line breaks into 
tags unles they are between 'pre' tags sed "/
/!s/^/
/g" echo "
" echo "
" #-- Define the cgi program which will handle the adding of #-- comments to a particular text file. if [ "$4" != "" ] then sProcessorUrl=$4 else #-- It would be possible to replace the Domain Name below with #-- an IP address, which would mean that the script would still #-- work even if the DNS configuration failed. I am not sure if this #-- is really a good idea or not. #sProcessorUrl="http://www.ella-associates.org/cgi-bin/add-comment" sProcessorUrl="http://63.105.73.195/cgi-bin/add-comment" fi #-- There is a problem in that I need to find the full path #-- name of the $1 variable, but I dont know how to do this. This #-- is necessary because the target processor is not in the same #-- directory as the source document (the text file) #-- For the time being I have used the remedy of seeing if the path #-- is relative or absolute. The slightly dodgy path generating code below #-- appears to be working. There is almost certainly a much easier way #-- of doing it sRelativePath=$(dirname $1) sFirstCharacter=$(echo $sRelativePath | sed "s/^\(.\).*$/\1/g") if [ "$sRelativePath" = "." ] then sFullPathName="$(pwd)/$1" elif [ "$sFirstCharacter" = "." ] then sFullPathName="$(pwd)/$1" elif [ "$sFirstCharacter" = "/" ] then sFullPathName="$1" else sFullPathName="$(pwd)/$1" fi # echo $sFullPathName echo "

BACK TO THE TABLE OF CONTENTS
" # The if/then below is an attempt to slightly 'internationalize' this script. if [ $sLanguage = "ENGLISH" ] then echo " If you wish, you may add a comment, suggestion or other contribution which will appear at the end of this document.
Any input you make is greatly appreciated." else echo " Si usted desea, usted puede agregar un comentario, sugerencia o otra contribución que aparecerá al final de este documento
Cualquier comentario que usted haga se aprecia grandemente. " fi echo "


" if [ $sLanguage = "ENGLISH" ] then echo "Your Comment (or other contribution to this document)
" else echo "Su comentario (o otra contribución a este documento)
" fi echo "

" if [ $sLanguage = "ENGLISH" ] then echo "Your Name [OPTIONAL BUT NICE]
" else echo "Tu nombre [OPCIONAL PERO AGRADABLE]
" fi echo "

" if [ $sLanguage = "ENGLISH" ] then echo "" else echo "" fi echo "
" if [ "$2" != "notran" ] then echo "
" if [ $sLanguage = "ENGLISH" ] then echo "See this page in (approximate):" echo "Español|" echo "Français|" echo "Italiano|" echo "Deutsch|" echo "Português" else echo "Vea esta página en (aproximado):" echo "English" fi echo "
" fi echo "" echo ""