Larry L's Blog

Entry for August 26, 2007-supplemental +

A.3. Beautiful Soup

Beautiful Soup is a Python parser for HTML and XML documents. It is designed to work with poorly written web pages. It is used in this book to create datasets from web sites that do not have APIs, and to find all the text on pages for indexing. The home page for this library is http://www.crummy.com/software/BeautifulSoup.

A.3.1. Installation on All Platforms

Beautiful Soup is available as a single file source download. Near the bottom of the home page, there is a link to download BeautifulSoup.py. Simply download this and put it in either your working directory or your Python/Lib directory.

A.3.2. Simple Usage Example

This example parses the HTML of the Google home page, and shows how to extract elements from the DOM and search for links.

 from BeautifulSoup import BeautifulSoup
 from urllib import urlopen
 soup=BeautifulSoup(urlopen('http://google.com'))
 soup.head.title
Google
 links=soup('a')
 len(links)
21
 links[0]
iGoogle
 links[0].contents[0]
u'iGoogle'

A more extensive set of examples is available at http://www.crummy.com/software/BeautifulSoup/documentation

2007-08-26 17:56:50 GMT

Site Home