# Description: # A script to reformat a plain text file document which contains # no particular format. The script recognises some special structures # within the plain text document. For example: # # Where the first non-whitespace character on a line is '=' then # all the following text on the line should be formatted as a # 'heading'. And if the first non-whitespace character is '*' then # the following text should be hyperlinked. Also, url style strings should be recognised and given # a hyperlink token in from of them, such as '[*]'. I prefer this to underlining # the entire url, because I find that the underlining tends to interfer with # the readability of the text. Some people would say, "use style-sheets" but to # them I would reply that the 'heraldic' visual pattern of the underlined hyperlink # is imprinted in many internet users brains, and to change that 'iconography' can # lead to unnecessary confusion. # # Lines which consist of only capital letters and numbers (with at least a few # capital letters, are interpreted as headings, and constitute the automatically # generated table of contents. # # This script, like the linkdoc2html.sh script also accepts the format # * Document Title|Html-Url-Or-Path|Text-Url-Or-Path| # The script will render this into an emphasised 'document title' with # hyper-links to the different formats for the document. # # The script will also format blocks of text between the strings -->> and --<< # (where they are the first string on the line) as an HTML
block
#
# Examples:
# ./plaintext2html.sh mjb-work.txt notran > mjb-work.html
# This command line, executed in some kind of a bash shell, will
# transform a plain text file which isn't is any particular format,
# into an HTML file (that is it will create a new HTML file and
# leave the original text file unchanged) and will not display the
# automatic translation links to Google. Also an HTML table of
# contents (with one entry for each heading) will be inserted in the
# HTML document.
#
# ./plaintext2html.sh mjb-work.txt notran notoc > mjb-work.html
# The text file will be transformed into HTML but no table of contents
# will be inserted nor any translation links.
#
# ./plaintext2html.sh mjb-work.txt blah notoc > mjb-work.html
# If translation links are desired but no table of contents, use a
# command line similar to above. The string 'blah' could be anything
# as long as its not 'notran'. This slighty dodgy 'feature' is owing to the
# fact that I am not using any 'getopt' style option parsing.
#
# Parameters:
# textFileName [required]
# The name of the text file which is to be transformed from text into html
# notran [optional]
# If the second parameter is the string 'notran' then the javascript links
# to the google automatic language translation engine will NOT be inserted
# into the HTML page. This is useful, for example, when the HTML page is
# going to be located within a 'password-protected' directory, because
# the Google translation engine will not be able to access the page, and
# therefor the translation links will not work.
# notoc [optional]
# If the third parameter is the string "notoc", then no HTML table of
# contents will be generated.
#
# Notes:
# This script contains an improved url detection regular expresion, better than that
# in say txtdoc2html.sh. But the url pattern matcher still has a problem when
# somebody puts a full stop after a url. It thinks that that dot is part of the
# url. There is possibly no reason why you couldn't just use the 'diary2html.sh'
# filter script, instead of this one. The Html generated is somewhat dodgy but
# attempts to avoid some of the more heinous html sins, such as tags
#
# Interestingly, almost all the functionality which this script provides,
# that is, making an html table of contents, could equally be achieved using
# the 'htmldoc' program with a line similar to
# htmldoc -f output.html --book --no-title theTextFile.txt
# This assumes that the html file contains heading tags in the correct order.
# The script 'plaintext2html-simple.sh' performs the same functions as this
# script but using only literal regular expressions, instead of variables.
# I have an inkling that the 'variable expansion' capabilities of MS Windows
# Bash Shell emulators is not that fantastic, although Cygwin is probably
# the exception. Therefore, I have kept this simpler version of the program
#
# Because of the table used to create a left margin for the table of contents
# and for the body of the text, this html is NOT friendly to 'lynx' which
# does not support HTML tables. A CSS style sheet command should be used
# instead of the tables.
#
# This script should also transform quotes into " & into & etc
# See Also:
# diary2html.sh, linkdoc2html.sh, plaintext2pdf.sh, plaintext2html-simple.sh
# Author:
# m.j.bishop
if [ "$1" = "" ]
then
echo "usage: $0 textFileName [notran] [notoc] [forum]"
cat $0 | sed -n "/^[ ]*#/p"
exit 1;
fi
#-- The section below creates the table of contents for the diary.
#-- This line is designed to only number lines which match a pattern
#-- In theory 'nl -bpPATTERN' should also do this, but it insisted on
#-- 'double-spacing' the output
#-- Also the expressions below try and get rid of things like "can't" and "won't"
#-- because I want to apply some formatting to the content of quotes, and these
#-- things will get in my way.
#-- This is the pattern which determines what sort of lines will
#-- be interpreted as 'section headings'. I cannot use the for the 'awk' line
#-- because awk does not seem to accept the notation \{n,\}
sHeadingPattern='[ A-Z0-9.\/\\]*[A-Z]\{3,\}[ A-Z0-9.\/\\]*'
cat $1 | expand | \
mawk '/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/{ii++; print ii $0}!/^[ A-Z0-9.\/\\]*[A-Z]+[ A-Z0-9.\/\\]*$/' | \
sed "s/\([a-zA-Z]\{2,\}\)n[\"']t/\1nt/g" > $1.temp
(echo ""; \
cat $1.temp | \
sed "/^[0-9]\{1,\}$sHeadingPattern$/!d" | \
sed "s/^\([0-9]\{1,\}\)\($sHeadingPattern\)$/
\1. \2<\/a>/g"; \
echo "
";) > plain-text-toc.temp
echo ""
echo ""
echo " "
echo " "
echo " "
echo " "
echo ""
echo ""
echo ""
echo ""
echo ""
echo ""
echo ""
echo ""
echo ""
if [ "$2" != "notran" ]
then
echo ""
echo "See this page in (approximate):"
echo "Español|"
echo "Français|"
echo "Italiano|"
echo "Deutsch|"
echo "Português"
echo " "
fi
#-- Put the page heading before the table of contents
#--
cat $1.temp | \
sed "/^[ ]*=[ ]*[^=].*/!d" | \
sed -e "s/\</g" -e "s/>/\>/g" | \
sed "s/^[ ]*=[ ]*\([^=].*\)/\1<\/h2><\/center>/gi"
echo "
"
if [ "$2" != "notran" ]
then
echo ""
echo "See this page in (approximate):"
echo "Español|"
echo "Français|"
echo "Italiano|"
echo "Deutsch|"
echo "Português"
echo " "
fi
echo ""
echo ""
rm -f $1.temp
rm -f plain-text-toc.temp