# Description:
#
#   The following command lines were used to reformat the Netbeans      
#   (www.netbeans.org) standard documentation for the xml module, into  
#   a plain text version, with all the html files concatenated in the   
#   same order that they are refered to in the table of contents. In    
#   addition, the 'see also' sections for each html page has been       
#   removed. These command lines were run on a ms-windows 2000 computer 
#   using the 'unxutils' unix shell and tools which are located on      
#   'source-forge'                                                      
#
#   The script also uses a Ms-Windows 'lynx' port, which is located at  
#   http://www.jim.spath.com/lynx_win32/ Lynx (a text only browser)     
#   is required in order to convert html into plain text with its       
#   '-dump' command line option. The unix program 'html2text' will      
#   also do this, but I have not been able to find a Ms-Windows port    
#   of this program. The dos program 'htmStrip' also does this, but     
#   does not seem to have good support for 'batch' processing. For      
#   this version of lynx on ms-windows you will have to change you      
#   'path' environment variable and add an environment variable called  
#   'lynx_cfg' which will point to the lynx.cfg file.                   
#
#   The result of these transformations is a 235 A4 page plain text manual using   
#   an 8point font in Wordpad or a 298 page manual with a 10point font.
#
# Notes;
#   This scripts should also write 'section, or subject' headings when it
#   concatenates the various index files. This would allow these 'subject'
#   headings to form part of the table of contents for an output format
#   such as Adobe pdf. Otherwise, the table of contents for the users guide
#   is too long (approximately 400 items)
#
#   This script was initially written and run on an MS Windows laptop using
#   the Cygwin shell, but I think it should run on unix too.
#
# See Also
#   netbeans-guide2html.sh, netbeans-guide2pdf.sh
#   plaintext2html.sh, plaintex2html.sh
# Url:
#   http://www.ella-associates.org/alexis-info/utils/
#
#   Author: m.j.bishop

#-- Some pseudo code which depends on which modules documentation
#-- you would like to reformat
#-- The jar documentation files are stored in
#--   [Netbeans Installation Dir]\modules\docs
 
# jar xf [module-name]
# cd org\netbeans\modules\xml\core\docs
# cd org\netbeans\modules\usersguide
#-- for the Tomcat documentation the html files are extracted
#-- to the following directory:
#--   [Nb Install Dir]\modules\docs\org\netbeans\modules\tomcat\tomcat40\docs\tomcat4


#-- for the Tomcat docs, the xml files are called
#-- 'tomcat-toc.xml' and 'tomcatMap.jhm'
#-- for the users guide the map file is 'Map.jhm'
#-- there does not appear to be a complete table of contents file
#--

cat ide-toc.xml | grep "target=" | sed -e 's/.*target="//g' -e 's/".*$//g' > tocmap.txt

#-- This uses the xml map file to find the corresponding html files
#-- in the 'html' directory 
#-- Dont get rid of the leading directory name (eg 'html') below
#--
#-- The line below looks for double quotes (") in the sed part of 
#-- the command line. The ms-windows command shell does not seem to
#-- be able to do this, since it doesn't recognise the single quote (')
#-- as a string delimiter. 
#-- Probably need to get rid of references to 'pending.html' which
#-- indicates that no documenation is available. Also a number of files
#-- occur twice in the output. Uniq wont solve this because it only
#-- removes adjacent duplicates (?)
#-- This command takes approximately 20 seconds to complete
#-- on my win2000 laptop.
for f in $(cat tocmap.txt);
do
  grep "target=\"$f\"" Map.jhm;
done | sed 's/.*url="\([^"]*\)".*/\1/g' | \
       expand | sed "/^[ ]*<!\-\-/d" | \
       uniq > newtoc.txt

#cd html
#-- dont do a 'cd' but use the directory reference in the 
#-- '-map.xml' file. Otherwise for the 'usersguide' documentation, you
#-- would have to 'cd' into a large number of directories.
#--
#-- or cd tomcat4
#-- The command below took about 50 seconds on my win2000 laptop

for f in $(cat newtoc.txt); do lynx -dump -nolist $f ; done | less

#-- Microsoft Windows 2000 contains a program called 'expand' which 
#-- interfers with the unix utility. To run this program on MS Windows you may
#-- have to rename the ms 'expand' program.

# E:\Program Files\NetBeans IDE 3.4\modules\docs\org\netbeans\modules\xml\core\docs\html>
#-- maybe I should leave the see also section in??

cat all.txt | expand | \
   sed -e "/\[splash\]/d" -e "/Legal Notices/d" | \
   sed "/^[ ]*See also[ ]*$/,/^[ ]*[\-_]*[ ]*$/d" | \
   sed "s/^[ ]*$//g" | tr -s "\n" > all-clean.txt

# In order to convert this text output to 
# HTML or PDF see the scripts 'netbeans-guide2html.sh' and 'netbeans-guide2pdf.sh'


