Larry L's Blog
all about facebook from a technical standpoint
Entry for August 26, 2007-supplemental +
photo













Image
Image




A.3. Beautiful Soup


Beautiful Soup is a Python parser for HTML and XML documents. It is designed to work with poorly written web pages. It is used in this book to create datasets from web sites that do not have APIs, and to find all the text on pages for indexing. The home page for this library is http://www.crummy.com/software/BeautifulSoup.



A.3.1. Installation on All Platforms


Beautiful Soup is available as a single file source download. Near the bottom of the home page, there is a link to download BeautifulSoup.py. Simply download this and put it in either your working directory or your Python/Lib directory.



A.3.2. Simple Usage Example


This example parses the HTML of the Google home page, and shows how to extract elements from the DOM and search for links.


 from BeautifulSoup import BeautifulSoup
from urllib import urlopen
soup=BeautifulSoup(urlopen('http://google.com'))
soup.head.title
Google
links=soup('a')
len(links)
21
links[0]
iGoogle
links[0].contents[0]
u'iGoogle'


A more extensive set of examples is available at http://www.crummy.com/software/BeautifulSoup/documentation


2007-08-26 17:56:50 GMT
 
Hosted by www.Geocities.ws

1