| demerzal v0.17 - releasing what matters |
| Home |
Articles |
Software |
Documentation |
Contact |
Help |
Links |
| Software |
|
NetCrawler.py - this is a script that I wrote just fucking around trying to show the merits of developing something in Python as opposed to Java.
What does it do? This is a command line app that will run on either Windows or Linux / Unix. It will take as parameters a URL and a directory. The goal for this project
was to download the file "html page" and then parse the file for all of the relative content. After it has downloaded the file it stores the contents of the document into
a structured XML document. It does not do any downloading of images or what not, there are other tools that already do this. What I was going for was something that
would be the front end to some kind of Web Crawler. Get the page, parse out the relative contents and then store it in a format that could be easily read by any application.
I do not have a formal DTD and the code documentation sucks ass, but feel free to take it and use it as you will. By the way, it will try and download all of the relative links
within the domain. Example:
Bash-2.01$ ./NetCrawler.py http://www.persiankitty.com /home/pornaddict
The above command will try and download and process all of the html files on said domain (A very good one to test with). It will create a directory name www.persiankitty.com
in /home/pornaddict and then proceed to create XML documents representing the content of each page. After it is done the app will also print out all of the NEW urls that if finds
while processing each page. Look through the code to see how it works, the design is fairly modular so you can rape it for code that you need. Some things that I have used it for?
Can't tell ya that!
[download] [Bugs] [BSD License]
|
|
| Copyright 1999-2001 Demerzal All Rights Reserved |