# Description: # A script to reformat a plain text file document which contains no particular # format into DocBook style XML. Information about DocBook XML can be found at # http://www.docbook.org/ # # The script recognises some special structures within the plain text document. For # example: # # Where the first non-whitespace character on a line is '=' then all the following # text on the line should be formatted as a 'heading'. If the first non-whitespace # character is '*' then the following text should be hyperlinked. Also, url style # strings should be recognised and given the appropriate DocBook link tag (probably # ) # # Lines which consist of only capital letters and numbers (with at least a few # capital letters), are interpreted as section headings, and constitute the # automatically generated table of contents. # # This script, like the linkdoc2html.sh script also accepts the format # * Document Title|Html-Url-Or-Path|Text-Url-Or-Path| # The script will render this into an emphasised 'document title' with links to the # different formats for the document. # # The script will also format blocks of text between the strings -->> and --<< # (where they are the first string on the line) as a DocBook XML or # block # # # Examples: # ./plaintext2docbook.sh mjb-work.txt > mjb-work.xml # This command line, executed in some kind of a bash shell, will transform a # plain text file which isn't is any particular format, into a DocBook XML file # (that is, it will create a new XML file and leave the original text file # unchanged) # # # Parameters: # textFileName [required] # The name of the text file which is to be transformed from text into XML # # Notes: # This script was derived from the script plaintext2html-forum.sh This script expects # the same type of plain text structuring as with that script, but transforms to # DocBook XML instead of HTML # # This script should also transform quotes into " & into & etc # The script appears to be working reasonably well in conjunction with # the 'add-comment' cgi script. # # This script has had problems with 'gawk' and different versions of awk. For this reason # the 'gawk' or 'awk' code has been removed and replaced with code using the 'nl' # program. This program, when used with the -bp option double spaces the object file # with lines containing only spaces. Therefore some extra 'sed' lines are necessary # to remove these blank lines # # See Also: # diary2html.sh, # Turns a 'diary' style text file into HTML # linkdoc2html.sh, # Turns a text file which has a list of URL links and descriptions into HTML # linkdoc2html-index.sh # As above but also adds an HTML 'table of contents' for possible 'section headings' # linkdoc2html-forum.sh # Turns a text file with a URL list into an HTML file which has the capability # to be contributed to by a web-visitor (using cgi-scripts) # plaintext2pdf.sh, # Turns a text file into a pdf file with an optional table of contents # plaintext2html-simple.sh # As below, but doesn't use certain 'bash' tricks # plaintext2html.sh # Turns a text file with possible section headings and urls into an HTML file # glossary2xml.sh # Turn a text file which is a sort of 'glossary' into a dodgy xml file # alphabetize-glossary.sh # Re-arranges a text file which contains a series of definitions of 'items' or 'terms' # so that the items are ordered alphabetically. # add-comment # a cgi-script which can be used in conjuction with some of the # scripts above to add content specified by web-visitors to a web page # script-summary.txt # contains more short descriptions of scripts and what they do. # Author: # m.j.bishop # # Bugs and Ideas # # Because the American Redhat Server uses UTF-8 encoding of text files I # need to convert text files into iso8859 if I want to use the iso2html.sed # script. This script is useful for encoding european accented characters as # 'entities' which are things that look like & for & or ñ for ñ # This script may not produce well formed XML. Although in other cases it # will. It may depend on the structure of the plain text file # # The process of XMLizing the text is not completely straightforward, especially # using tools such as SED. Since XML is a strict format some thought will have # to go into this script and it may be necessary in some cases to hand edit # the resulting text file. # # The script could also check if there are translations of the current # HTML or text file, using the standard naming convention of name.file-type.language-code # An example of this naming convention is stuff.txt.es which should # be an XML file which contains Spanish language content. This present # script could check for files which have the same name as the source # file but which have a different language code extension, and could # therefore automatically add a link to the translated file (in addition, # The script would only check in the current directory for these 'translated' files. # # I should be able to use Xerces on the command line to check if the resulting # HTML is valid and to provide a warning messaage if it is not # Dependencies: # iso2html.sed # various Unix tools, a Bash shell. This may not be an exhaustive list # echo, cat, sed # if [ "$1" = "" ] then echo "usage: $0 textFileName" cat $0 | sed -n "/^[ ]*#[^\-]/p" exit 1; fi sHeadingPattern='[ A-Z0-9.\/\\:]*[A-Z][A-Z][A-Z][A-Z]*[ A-Z0-9.\/\\:]*' #-- Gnu awk (gawk) seems to assume that pattern matching should be case insensitive #-- by default. The Begin clause below attempts to correct that. Although the script #-- says 'mawk' this is just a symbolic link to 'gawk' echo "" echo "
" #echo "" #echo "" #echo "" #echo "" echo " " #-- Put the page heading before the table of contents #-- cat $1 | \ sed "/^[ ]*=[ ]*[^=].*/!d" | \ sed -e "s//\>/g" | \ sed "s/^[ ]*=[ ]*\([^=].*\)/\1<\/title>/gi" echo " <authorgroup> <author> <firstname></firstname> <surname></surname> </author> <author> <firstname></firstname> <surname></surname> </author> </authorgroup>" echo "<corpname>Ella Associates</corpname>" echo " <revhistory> <revision> <revnumber>0.1</revnumber> <date>$(date)</date> <revremark>Document generated</revremark> </revision> </revhistory>" echo "</articleinfo>" #-- Transform the text to DocBook XML, insert ulinks etc #-- Also delete the heading line which has already been inserted in the HTML #-- But, the line will also delete lines beginning in == or === etc, which may not be desirable. #-- #-- I have disabled the line which turns * beginning lines into hyperlinks #-- since this was not desirable for the netbeans documentation #-- The version of SED on RedHat linux does not like the syntax "\{,4\}" but "\{0,4\}" #-- is ok. # echo "<abstract><para></para></abstract>" #cat $sFile | # sed "/^[ ]*\*/{ \ # s/^[ ]*\*\(.*\)/<glossary-item value = \"\1\">/g;x;G; \ # s/^<glossary-item value = \"[^\n]*/<\/glossary-item>/;}" # The line below surrounds paragraphs with the Docbook para tag. # I dont really know how it works but I got the idea from http://sed.sourceforge.net/ # # sed -e '/./{H;$!d;}' -e 'x;/<sect1>/!s/^\(.*\)$/<para>\1<\/para>/;' expand $1 | \ sed "s/^[ ]*$//g" | \ #-- try to differentiate apostrophes from single quotes #-- Encode special characters '<>&' as XML entities. Apparently XML only understands #-- five entities natively, the ones below and ' I use single quotes as apostrophes #-- but I will try to differentiate when I do the transformation #-- Ampersands have to come first. But this causes trouble if the Ampersand is #-- within a URL such as being part of a query string. So I have disabled it #-- because most docs dont contains free ampersands. Forget that. If the ampersands #-- aren't transformed then they also cause problemes because XML parsers expect #-- a semi-colon after an & characters. This seems a general prob with url query strings #-- in XML sed -e "s/&/\&/g" | \ #-- Do the other entities sed -e "s/</\</g" -e "s/>/\>/g" | \ #-- insert a blank line above and below every section heading. This is to #-- ensure that other XML tags dont overlap the <sect1> tags sed '/^[ A-Z0-9.\/\\:]*[A-Z]\{3,\}[ A-Z0-9.\/\\:]*$/{x;p;x;G;}' | \ #-- Reduce consecutive empty lines to one empty line cat -s | \ #-- Delete the page title because its already been output sed "/^[ ]*=[ ]*\([^=].*\)$/d" | \ #-- Do a trick to get the '-->>' and '--<<' blocks of text to work sed -e "s/^[ ]*\-\-\>\>/<screen>/g" -e "s/^[ ]*\-\-\<\</<\/screen>/g" | \ #-- Make each 'section heading' into an docbook section with title sed "s/^\([ A-Z0-9.\/\\:]*[A-Z]\{3,\}[ A-Z0-9.\/\\:]*\)$/<sect1 lang=\"en\" id=\"\1\"><title>\1<\/title>/g" | \ #-- Add end section tags sed "/<sect1 lang=\"[^\"]*\" id=\"[^\"]*\"><title>/{x;/./s/^.*$/<\/sect1>/;G;}" | \ #-- Mark-up email addresses. sed "/<screen>/,/<\/screen>/!s/\([-a-z0-9.]\{2,\}@[^ \"']\{2,\}\)/<email>\1<\/email>/gi" | \ #-- Surround paragraphs with the <para> tag. I dont really know exactly how this sed snippet works. sed -e '/<screen>/,/<\/screen>/!{/./{H;$!d;};x;/<sect1/!s/^\(.*\)$/<para>\1<\/para>/;}' | \ #-- link URLs beginning with http, except between <screen> tags sed "/<screen>/,/<\/screen>/!s/\([^\"]\)\(http:\/\/[-a-z\%0-9\~\\\/\"\'\.\@_\=\:\&?;]\{3,\}\)/\1<ulink url=\"\2\">\2<\/ulink>/gi" | \ #-- link URLs beginning with http, except between <screen> tags sed "/<screen>/,/<\/screen>/!s/^\(http:\/\/[-a-z\%0-9\~\\\/\"\'\.\@_\=\:\&?;]\{3,\}\)/<ulink url=\"\1\"> \1<\/ulink>/gi" | \ #-- Markup the 'Alexis' keyword as a software program #sed "/<screen>/,/<\/screen>/!s|\([^a-z/]\)alexis\([^a-z/]\)|\1<application>Alexis</application>\2|gi" | \ #-- when 'alexis' starts the line #sed "/<screen>/,/<\/screen>/!s|^alexis\([^a-z/]\)|<application>Alexis</application>\2|gi" | \ #-- when 'alexis' ends line #sed "/<screen>/,/<\/screen>/!s|\([^a-z/]\)alexis$|\1<application>Alexis</application>|gi" | \ #-- Mysql is an application #sed "/<screen>/,/<\/screen>/!s|\<mysql\>|<application>MySql</application>|gi" | \ #-- Java is an application #sed "/<screen>/,/<\/screen>/!s|\<java\>|<application>Java</application>|gi" | \ #-- I am an author #sed "/<screen>/,/<\/screen>/!s|\<mjb\>|<authorinitials>mjb</authorinitials>|gi" | \ #-- Markup the text on keyboard keys sed "/<screen>/,/<\/screen>/!s|\(press\|type\) \(the \)\?[\"\[']\?enter[]\"']\?|\1 \2<keycap>enter</keycap>|gi" | \ #-- Deal with text about the 'escape' key sed "/<screen>/,/<\/screen>/!s|\(press\|type\) \(the \)\?[\"\[']\?esc\(ape\)\?[]\"']\?|\1 \2<keycap>Esc</keycap>|gi" | \ #-- Deal with text about the 'control' key sed "/<screen>/,/<\/screen>/!s|\(press\|type\) \(the \)\?[\"\[']\?co\?n\?tro\?l[]\"']\?|\1 \2<keycap>Control</keycap>|gi" | \ #-- 'shift' key sed "/<screen>/,/<\/screen>/!s|\(press\|type\) \(the \)\?[\"\[']\?shift[]\"']\?|\1 \2<keycap>Shift</keycap>|gi" | \ sed "/<screen>/,/<\/screen>/!s|\(press\|type\) \(the \)\?[\"\[']\?alt[]\"']\?|\1 \2<keycap>Alt</keycap>|gi" | \ sed "/<screen>/,/<\/screen>/!s|\(press\|click\) \(on \)\?\(the \)\?[\"\[']\?ok[]\"']\?|\1 \3<guibutton>OK</guibutton>|gi" | \ #-- An example of the format below: press 'cancel' sed "/<screen>/,/<\/screen>/!s|\(press\|click\) \(on \)\?\(the \)\?[\"\[']\?cancel[]\"']\?|\1 \3<guibutton>Cancel</guibutton>|gi" | \ #-- An example of the format below: click on the "start" button sed "/<screen>/,/<\/screen>/!s|\(press\|click\) \(on \)\?\(the \)\?[\"\[']\?start[]\"']\?|\1 \3<guibutton>Start</guibutton>|gi" | \ #-- An example of the following: type in the box labelled 'edit text' sed "/<screen>/,/<\/screen>/!s|\(box\|button\) \(which is \)\?labelled,\? [\"']\([^'\"]\+\)[\"']|\1 \2labelled <guilabel>\3</guilabel>|gi" | \ #-- An example of the following format: click on the "Mail Servers" item sed "/<screen>/,/<\/screen>/!s|click \(on \)\?the [\"\[']\([^]'\"]\+\)[]\"'] \(menu[- ]\?\)\?item|click \1the <guimenuitem>\2</guimenuitem> \3item|gi" | \ #-- link URLs beginnning with 'www.' sed "/<screen>/,/<\/screen>/!s/\([^a-zA-Z\/]\)\(www\.[-a-z\%0-9\~\\\/\"\'\.\@_&:?;]\{2,\}\)/\1<ulink url=\"http:\/\/\2\">\2<\/ulink>/gi" | \ #-- 'Entitize' apostrophes and single quotes sed -e "s/n't/n\'t/gi" -e "s/\([a-z]\)'s/\1\'s/gi" -e "s/'/\"/g" echo "</sect1>" echo "</article>"