
                      Web EXtractor
             


General Info
------------
Wex - Web EXtractor - is a Java *application* that I developed to recursively retrieve WEB subtrees of HTML documents and related files referenced in HTML anchor tags from WEB sites and store them locally.

This early beta version of Wex saves only files referenced by HTTP URLs pointing to the site where specified WEB tree starts, thus omitting URLs referencing locations external to the original site. Wex also skips URLs that reference documents located "higher" then root of the specified WEB tree.
Wex can also be used in "trace" mode to show WEB subtree in its text window without saving files to local disk.
To work more productively Wex runs additional thread for each location it saves, while parsing HTML in a separate thread. By default number of additional threads is set to 7 and can be changed from Wex command line.
To parse HTML, Wex uses HTML parser library prototype, developed for testing the beta release of HotJava by <a href="http://java.sun.com/people/avh/">* Arthur van Hoff *</a> and modified by <a href="mailto:dima@paragraph.com">* me * </a> to work with JDK 1.0 release of Java.

Running Wex
-----------
To run Wex unzip archive file in any directory of your choice. You will have all files and directories Wex needs in subdirectory Wex of your directory, for example :

~MyDir/Wex/

This directory will contain :

WEXmake          - Wex makefile (unix)
HTMLmake         - HTML library makefile (unix)
wex.mak          - Wex makefile (NT)
HTML.mak         - HTML library makefile(NT)
wex              - directory with Wex classes
html             - directory with HTML library classes
dtds             - directory with DTD files
Sun              - directory with original Readme and other files 
                   from Sun about HTML Parser library prototype.

Now, providing that you have JDK 1.0 installed properly on your computer, you can run Wex from this "top" Wex distribution directory as follows :

java wex.Wex [-t] [-l number] [-out local_root_directory]

Wex options :
             -t trace only, don't save URLs
             -l additional number of threads to save URLs [7 - default]
             -out local_root_directory to start saving URLs [./wextmp - default]

!!! Important !!! : To run Wex this way directories 'html', 'wex' and 'dtds' should be in the same directory. Otherwise you should modify your CLASSPATH environment variable to point to classes from 'html' and 'wex' packages. In any case 'dtds' and 'html' directories should be subdirectories of the same directory.

Bugs and limitations
--------------------

Some HTTPD servers output HTML with URLs pointing to directories without trailing "/", which is not true to HTML spec. Thus it seems there is no "clean" way to determine what such URL is refering to - file or directory. Wex assumes that file names without trailing "/" refer to files, not directories. 
HTML Parser prototype outputs a lot of error messages to standard output when it parses most of existing HTML in accordance with HTML 2.0 standard. Still it does its job for Wex just fine. For details about HTML parser see Arthur van Hoff readme file in Sun directory.

Downloading
-----------
Wex full source code as well as a new version of HTML parser library,
that Wex use can be found at 
<a href="http://aldan.paragraph.com/JavaApp/JavaApp.htm">JavaLab</a>

Keep in Touch
-------------
I plan to continue Wex development and add some useful features to it in next release. Your comments and ideas are always welcome !

Dmitri Kondratiev
dima@paragraph.com
http://aldan.paragraph.com/

