# Description: # A script to reformat a plain text file document which contains # no particular format. The script also generates an HTML form which # allows the reader to add a comment or other text to the document. # The script recognises some special structures # within the plain text document. For example: # # Where the first non-whitespace character on a line is '=' then # all the following text on the line should be formatted as a # 'heading'. # If the first non-whitespace character is '*' then # the following text should be hyperlinked. # Also, url style strings should be recognised and given # a hyperlink token in from of them, such as '[*]'. I prefer this to underlining # the entire url, because I find that the underlining tends to interfer with # the readability of the text. Some people would say, "use style-sheets" but to # them I would reply that the 'heraldic' visual pattern of the underlined hyperlink # is imprinted in many internet users brains, and to change that 'iconography' can # lead to unnecessary confusion. # # Lines which consist of only capital letters and numbers (with at least a few # capital letters), are interpreted as headings, and constitute the automatically # generated table of contents. # # This script, like the linkdoc2html.sh script also accepts the format # * Document Title|Html-Url-Or-Path|Text-Url-Or-Path| # The script will render this into an emphasised 'document title' with # hyper-links to the different formats for the document. # # The script will also format blocks of text between the strings -->> and --<< # (where they are the first string on the line) as an HTML
 block
#
#   The script also formats lines starting in 'added by:' to make those
#   lines stand out from the rest of the text. This is a 'courtesy' to the
#   '/cgi-bin/add-comment' script which added this line to a text file
#   when it inserts a user provided comment in the text file.
#
# Examples:
#   ./plaintext2html-forum.sh mjb-work.txt notran > mjb-work.html
#     This command line, executed in some kind of a bash shell, will
#     transform a plain text file which isn't is any particular format,
#     into an HTML file (that is it will create a new HTML file and
#     leave the original text file unchanged) and will not display the
#     automatic translation links to Google. Also an HTML table of
#     contents (with one entry for each heading) will be inserted in the
#     HTML document.
#
#   ./plaintext2html-forum.sh mjb-work.txt notran notoc > mjb-work.html
#     The text file will be transformed into HTML but no table of contents
#     will be inserted nor any translation links.
#
#   ./plaintext2html-forum.sh mjb-work.txt tran notoc > mjb-work.html
#     If translation links are desired but no table of contents, use a 
#     command line similar to above. The string 'blah' could be anything
#     as long as its not 'notran'. This slighty dodgy 'feature' is owing to the
#     fact that I am not using any 'getopt' style option parsing.
#
#   ./plaintext2html-forum.sh stuff.txt notran toc "http://63.105.73.195/cgi-bin/add-comment"
#     This transforms the file stuff.txt omitting translation links, inserting a 
#     hyperlinked table of contents, and setting the target for the comment form
#     to the URL specified in the last parameter.
#
#
# Parameters:
#   textFileName  [required]
#     The name of the text file which is to be transformed from text into html
#   notran        [optional]
#     If the second parameter is the string 'notran' then the javascript links
#     to the google automatic language translation engine will NOT be inserted
#     into the HTML page. This is useful, for example, when the HTML page is 
#     going to be located within a 'password-protected' directory, because
#     the Google translation engine will not be able to access the page, and
#     therefor the translation links will not work.
#   notoc         [optional]
#     If the third parameter is the string "notoc", then no HTML table of
#     contents will be generated.
#   forumProcessorUrl           [optional]
#     This parameter indicates where the processing script is located.
#     If it is omitted, currently the url will default to 
#     http://www.ella-associates.org/cgi-bin/add-comment
#   output-language
#     Still to be implemented
#     This is the language in which the message on the generated HTML page
#     will appear. For example messages next to the comment boxes and the 
#     translation links.
#   path-to-style-sheet
#     Still to implement
#     This is the full path (relative to the Web Server Document Root)
#     to the style sheet which is to be used by the generated HTML page
#     
#
# Notes:
#   Because of the table used to create a left margin for the table of contents
#   and for the body of the text, this HTML is NOT friendly to 'lynx' which
#   does not support HTML tables. A CSS style-sheet command should be used 
#   instead of the tables.
#
#   This script should also transform quotes into " & into & etc
#   The script appears to be working reasonably well in conjunction with
#   the 'add-comment' cgi script.
#
#   It would be nice to make some kind of 'sub' table of contents for
#   any comments which are present in a document.
#   
#   The translation links wont work from within the 'output' generated 
#   by the 'add-comment' script
#
#   This script has had problems with 'gawk' and different versions of awk. For this reason
#   the 'gawk' or 'awk' code has been removed and replaced with code using the 'nl'
#   program. This program, when used with the -bp option double spaces the object file
#   with lines containing only spaces. Therefore some extra 'sed' lines are necessary
#   to remove these blank lines
#
# See Also:
#   diary2html.sh, 
#     Turns a 'diary' style text file into HTML
#   linkdoc2html.sh,
#     Turns a text file which has a list of URL links and descriptions into HTML
#   linkdoc2html-index.sh
#     As above but also adds an HTML 'table of contents' for possible 'section headings'
#   linkdoc2html-forum.sh
#     Turns a text file with a URL list into an HTML file which has the capability
#     to be contributed to by a web-visitor (using cgi-scripts)
#   plaintext2pdf.sh,
#     Turns a text file into a pdf file with an optional table of contents
#   plaintext2html-simple.sh
#     As below, but doesn't use certain 'bash' tricks
#   plaintext2html.sh
#     Turns a text file with possible section headings and urls into an HTML file
#   glossary2xml.sh
#     Turn a text file which is a sort of 'glossary' into a dodgy xml file
#   alphabetize-glossary.sh
#     Re-arranges a text file which contains a series of definitions of 'items' or 'terms'
#     so that the items are ordered alphabetically.
#   add-comment
#     a cgi-script which can be used in conjuction with some of the 
#     scripts above to add content specified by web-visitors to a web page
#   script-summary.txt
#     contains more short descriptions of scripts and what they do.
# Author:
#   m.j.bishop
#
# Bugs and Ideas
#   See the file linkdoc2html-forum.sh for the beginnings of an attempt to internationalize
#   the output of this script, in the sense that the messages which appear on the 
#   HTML page should be capable of being in various languages, depending on what language
#   the source file is in.
#
#   Add an output-language parameter to this script
#   Also, it would be good to add a 'style-sheet' parameter which would allow
#   this script to change the name or location of the style-sheet which is used
#   by the generated HTML file.
#
#   At the moment the script uses special 'stylesheet classes' for particular 
#   elements, such as the 
 element, although this is probably not really
#   necessary; the style should be attached to the 
 element itself rather
#   than to a CSS class of the pre element as in 
#   The second method is probably only necessary when there is more than one
#   type of style which you wish to apply to a particular HTML element in the
#   same document.
#
#   In Netscape Navigator 4.61, if the style-sheet does not exist at all
#   then the browser is unable to display anything at all. 
#
#   The script could also check if there are translations of the current 
#   HTML or text file, using the standard naming convention of name.file-type.language-code
#   An example of this naming convention is  stuff.html.es  which should
#   be an HTML file which contains Spanish language content. This present
#   script could check for files which have the same name as the source
#   file but which have a different language code extension, and could 
#   therefore automatically add a link to the translated file (in addition,
#   perhaps to the Google translation links). The script would only
#   check in the current directory for these 'translated' files.
#
# Dependencies:
#   iso2html.sed
#   various Unix tools, a Bash shell
 
 if [ "$1" = "" ]
 then
   echo "usage: $0  textFileName [notran] [notoc] [forum-processor-url]"
   cat $0 | sed -n "/^[ ]*#/p" 
   exit 1;
 fi

 #-- The section below creates the table of contents for the diary.
 #-- This line is designed to only number lines which match a pattern
 #-- In theory 'nl -bpPATTERN' should also do this, but it insisted on
 #-- 'double-spacing' the output
 #-- Also the expressions below try and get rid of things like "can't" and "won't"
 #-- because I want to apply some formatting to the content of quotes, and these
 #-- things will get in my way.

 #-- This is the pattern which determines what sort of lines will
 #-- be interpreted as 'section headings'. I cannot use the for the 'awk' line
 #-- because awk does not seem to accept the notation \{n,\}
 
 sHeadingPattern='[ A-Z0-9.\/\\:]*[A-Z][A-Z][A-Z][A-Z]*[ A-Z0-9.\/\\:]*'
 
#-- I have disable the code below because in a cgi environment, this script doesn't
#-- seem to have permission to create a file. It depends on who originally owns
#-- the $1.temp file. 

#-- Gnu awk (gawk) seems to assume that pattern matching should be case insensitive
#-- by default. The Begin clause below attempts to correct that. Although the script 
#-- says 'mawk' this is just a symbolic link to 'gawk'

#-->-->
#-- I am having all sorts of problems with GNU awk. For some reason it return lower case
#-- lines, even when the regular expression dictates upper case lines. 
#-- One solution to the problem is to use 'nl' instead. For example the line below
#-- almost does the trick
# expand $1 | \
#   sed "s/^[ ]*$//g" | \
#   nl -s" " -bp'^[ A-Z0-9.\/\\:]*[A-Z][A-Z][A-Z]+[ A-Z0-9.\/\\:]*$' | \
#   sed  "/^[ ]\+$/d" | \
#   sed "s/^[ ]*\([1-9][0-9]*\) /\1/g" | \
 
 # The trouble-some gawk line
 #  gawk 'BEGIN{IGNORECASE=0}/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/{ii++; print ii $0}!/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/' | \

 echo ""
 echo ""
 echo " "
 echo " "
 echo " "
 echo "        "
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 echo ""
 if [ "$2" != "notran" ]
 then
   echo "
" echo "See this page in (approximate):" echo "Español|" echo "Français|" echo "Italiano|" echo "Deutsch|" echo "Português" echo "
" fi #-- Put the page heading before the table of contents #-- cat $1 | \ sed "/^[ ]*=[ ]*[^=].*/!d" | \ sed -e "s//\>/g" | \ sed "s/^[ ]*=[ ]*\([^=].*\)/

\1<\/h2><\/center>/gi" #- This line below is not 'lynx friendly' as style sheet #- should be used instead. echo "
" echo "
[make a comment about (or add to) this document] | [Alexis Documentation Home]
" #-- Insert the table of contents if [ "$3" != "notoc" ] then #-- I put all this code here to get around a problem which arises #-- when this script is used from a cgi script (lack of write permissions) #-- The problem is not quite so simple as lack of write permissions since #-- this script reads and writes a file called $1.temp successfully #-- #-- This is probably faster than the code in 'plaintext2html.sh' because #-- no files have to be written. echo "
" expand $1 | \ sed "s/^[ ]*$//g" | \ nl -s" " -bp'^[ A-Z0-9.\/\\:]*[A-Z][A-Z][A-Z]+[ A-Z0-9.\/\\:]*$' | \ sed "/^[ ]\+$/d" | \ sed "s/^[ ]*\([1-9][0-9]*\) /\1/g" | \ sed "s/\([a-zA-Z]\{2,\}\)n[\"']t/\1nt/g" | \ sed "/^[0-9]\{1,\}$sHeadingPattern$/!d" | \ sed "s/^\([0-9]\{1,\}\)\($sHeadingPattern\)$/
\1. \2<\/a>/g" echo "
" fi #-- Transform the text to HTML, insert anchors #-- Also delete the heading line which has already been inserted in the HTML #-- But, the line will also delete lines beginning in == or === etc, which #-- may not be desirable. #-- The line below was designed to make the contents of quotes look different #-- but I think that it made the text less readable #-- #-- sed "s/\(['\"]\)[^'\"]\{1,\}\1/&<\/tt>/g" | \ #-- #-- I have disabled the line which turns * beginning lines into hyperlinks #-- since this was not desirable for the netbeans documentation #-- The version of SED on RedHat linux does not like the syntax "\{,4\}" but "\{0,4\}" #-- is ok. # expand $1 | \ sed "s/^[ ]*$//g" | \ #-- Number all lines that are 'section headings' nl -s" " -bp'^[ A-Z0-9.\/\\:]*[A-Z][A-Z][A-Z]+[ A-Z0-9.\/\\:]*$' | \ #-- Get rid of the 'blank' lines which nl puts into the output sed "/^[ ]\+$/d" | \ #-- Reformat the numbered section headings sed "s/^[ ]*\([1-9][0-9]*\) /\1/g" | \ #-- Delete the page title because its already been output sed "/^[ ]*=[ ]*\([^=].*\)$/d" | \ #-- Encode special characters '<>&' as HTML entities sed -e "s//\>/g" | \ #-- Do a trick to get the '-->>' and '--<<' blocks of text to work sed -e "s/^[ ]*\-\-\>\>/
/g" -e "s/^[ ]*\-\-\<\</<\/pre>/g" | \
  #-- Make each 'section heading' into an HTML anchor to work with the 'Table of Contents'
  sed "s/^\([0-9]\{1,\}\)\([ A-Z0-9.\/\\:]*[A-Z]\{3,\}[ A-Z0-9.\/\\:]*\)$/\1. \2<\/a><\/strong> [TOC]<\/a>/g" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|html|txt|xml|pdf|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|/\1<\/b> (Formats:<\/em> \3<\/a> | \4<\/a> | \5<\/a> | \6<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|html|txt|pdf|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|/\1<\/b> (Formats:<\/em> \3<\/a> | \4<\/a> | \5<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|||
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|||/\1<\/b> (Formats:<\/em> html<\/a> | text<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/my/path/to/file|pdf|html|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([a-zA-Z]\{1,8\}\)|\([a-zA-Z]\{1,8\}\)|/\1<\/b> (Formats:<\/em> \3<\/a> | \4<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/full/path/to/htmlfile|/full/path/to/text/file|/full/path/to/pdffile|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([^|]*\)|\([^|]*\)|/\1<\/b> (Formats:<\/em> html<\/a> | text<\/a> | pdf<\/a>)/gi" | \
   #-- Example of Format Below: * My Title|/full/path/to/htmlfile|/full/path/to/text/file|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|\([^|]*\)|/\1<\/b> (Formats:<\/em> html<\/a> | text<\/a>)/gi" | \
   #-- Trick to make 'txt' links into 'text' links for readability
   sed "s/>txt<\/a>/>text<\/a>/gi" | \
   #-- Example of Format Below: * My Title|/full/path/to/any-old-file|
   sed "s/^[ ]*\*[ ]*\([^|]*\)|\([^|]*\)|/\1<\/b>(\2<\/a>)/gi" | \
   #-- Example of Format Below: * /full/path/to/any-old-file
   sed "s/^[ ]*\*[ ]*\([^ ]\{2,\}\)/\1<\/a>/gi" | \
  #-- Hyperlink URLs beginning with http, except between 
 tags
  sed "/
/!s/[^\"]\(http:\/\/[-a-z\%0-9\~\\\/\"\'\.\@]\{3,\}\)/ [*]<\/a>\1<\/tt>/gi" | \
  #-- Hyperlink URLs beginning with http, except between 
 tags
  sed "/
/!s/^\(http:\/\/[-a-z\%0-9\~\\\/\"\'\.\@]\{3,\}\)/ [*]<\/a>\1<\/tt>/gi" | \
  #-- Hyperlink email addresses with a 'mailto:' link
  sed "/
/!s/\([^ ]\{2,\}@[^ \"']\{2,\}\)/\1<\/a>/g" | \
  #-- Hyperlink URLs beginnning with 'www.'
  sed "/
/!s/[^a-zA-Z\/]\(www\.[-a-z\%0-9\~\\\/\"\'\.\@]\{2,\}\)/ [*]<\/a>\1<\/tt>/gi" | \
  #-- Format comments added by web-users
   sed "s/^\([ ]*added[ ]\{0,4\}by:\)\([^,]\{1,\}\)\,[ ]*on[ ]*\(.*\)/\1<\/em> \2<\/tt> on \3<\/em><\/u>/gi" | \
  #-- Turn spaces into non-breaking-spaces unless they are between 'pre' tags
  sed "/
/!s/[ ]\{2\}/\ \ /g" | \
  #-- Turn line breaks into 
tags unles they are between 'pre' tags sed "/
/!s/^/
/g" echo "
" echo "
" #-- Define the cgi program which will handle the adding of #-- comments to a particular text file. if [ "$4" != "" ] then sProcessorUrl=$4 else #-- It would be possible to replace the Domain Name below with #-- an IP address, which would mean that the script would still #-- work even if the DNS configuration failed. I am not sure if this #-- is really a good idea or not. #sProcessorUrl="http://www.ella-associates.org/cgi-bin/add-comment" sProcessorUrl="http://63.105.73.195/cgi-bin/add-comment" fi #-- There is a problem in that I need to find the full path #-- name of the $1 variable, but I dont know how to do this. This #-- is necessary because the target processor is not in the same #-- directory as the source document (the text file) #-- For the time being I have used the remedy of seeing if the path #-- is relative or absolute. The slightly dodgy path generating code below #-- appears to be working. There is almost certainly a much easier way #-- of doing it sRelativePath=$(dirname $1) sFirstCharacter=$(echo $sRelativePath | sed "s/^\(.\).*$/\1/g") if [ "$sRelativePath" = "." ] then sFullPathName="$(pwd)/$1" elif [ "$sFirstCharacter" = "." ] then sFullPathName="$(pwd)/$1" elif [ "$sFirstCharacter" = "/" ] then sFullPathName="$1" else sFullPathName="$(pwd)/$1" fi # echo $sFullPathName echo "

BACK TO THE TABLE OF CONTENTS
If you wish, you may add a comment, suggestion or other contribution which will appear at the end of this document.
Any input you make is greatly appreciated.


Your Comment (or other contribution to this document)


Your Name [OPTIONAL BUT NICE]


" if [ "$2" != "notran" ] then echo "
" echo "See this page in (approximate):" echo "Español|" echo "Français|" echo "Italiano|" echo "Deutsch|" echo "Português" echo "
" fi echo "" echo "" #rm -f $1.temp #rm -f plain-text-toc.temp