Entry for August 27, 2007

Chapter 3. Discovering Groups -umm, going backwards here, but's that's ok...who doesn't love cs?
Chapter 2 discussed ways to find things that are closely related, so, for example, you could find someone who shares your taste in movies. This chapter expands on those ideas and introduces data clustering, a method for discovering and visualizing groups of things, people, or ideas that are all closely related. In this chapter, you'll learn: how to prepare data from a variety of sources; two different clustering algorithms; more on distance metrics; simple graphical visualization code for viewing the generated groups; and finally, a method for projecting very complicated datasets into two dimensions.
-----------------------------------------------------------------------------
3.2.2. Counting the Words in a Feed
Almost all blogs can be read online or via their RSS feeds. An RSS feed is a simple XML document that contains information about the blog and all the entries. The first step in generating word counts for each blog is to parse these feeds. Fortunately, there is an excellent module for doing this called Universal Feed Parser, which you can download from http://www.feedparser.org.
This module makes it easy to get the title, links, and entries from any RSS or Atom feed. The next step is to create a function that will extract all the words from a feed. Create a new file called generatefeedvector.py and insert the following code:
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}
# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description
# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc
--------------------------------------------------------------
This is a pain since I can't copy indentation ****
_______________________________________
RSS and Atom feeds always have a title and a list of entries. Each entry usually has either a summary or description tag that contains the actual text of the entries. The getwordcounts function passes this summary to getwords, which strips out all of the HTML and splits the words by nonalphabetical characters, returning them as a list. Add getwords to generatefeedvector.py:
def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)
# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)
# Convert to lowercase
return [word.lower( ) for word in words if word!='']
___________________________________________________
apcount={}
wordcounts={}
for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word,count in wc.items( ):
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1
______________________________________________
is this helping any?
______________________________________________
The final step is to use the list of words and the list of blogs to create a text file containing a big matrix of all the word counts for each of the blogs:
out=file('blogdata.txt','w')
out.write('Blog')
for word in wordlist: out.write('\t%s' % word)
out.write('\n')
for blog,wc in wordcounts.items( ):
out.write(blog)
for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])
else: out.write('\t0')
out.write('\n')
To generate the word count file, run generatefeedvector.py from the command line:
c:\code\blogcluster>python generatefeedvector.py
___________________________________________________
An alternative method of clustering is K-means clustering. This type of algorithm is quite different from hierarchical clustering because it is told in advance how many distinct clusters to generate. The algorithm will determine the size of the clusters based on the structure of the data.
K-means clustering begins with k randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. This process repeats until the assignments stop changing. Figure 3-5 shows this process in action for five items and two clusters.
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
d=distance(clusters[i],row)
if d
bestmatches[bestmatch].append(j)
# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
lastmatches=bestmatches
# Move the centroids to the average of their members
f or i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
___________________________________________________
>> reload(clusters)
>> kclust=clusters.kcluster(data,k=10)
Iteration 0
...
>> [rownames[r] for r in k[0]]
['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman',
'ProBlogger Blog Tips', "Seth's Blog"]
>> [rownames[r] for r in k[1]]
etc..
kclust now contains an list of IDs for each cluster. Try the clustering with different values of k and see how it affects the results.
____________________________________________________
mind you, i'm NOT a cs major!
____________________________________________________
he clustering algorithms in this chapter have been demonstrated using a stylized visualization of data in two dimensions, with the difference between the various items indicated by how far apart they are in the diagram. Since most real-life examples of items you would want to cluster have more than two numbers, you can't just take the data as-is and plot it in two dimensions. However, to understand the relationship between the various items, it would be very useful to see them charted on a page with closer distances indicating similarity.
This section will introduce a technique called multidimensional scaling, which will be used to find a two-dimensional representation of the dataset. The algorithm takes the difference between every pair of items and tries to make a chart in which the distances between the items match those differences. To do this, the algorithm first calculates the target distances between all the items. In the blog dataset, Pearson correlation was used to compare the items. An example of this is shown in Table 3-2.
Table 3-2. Sample distance matrix A B C D
A 0.0 0.2 0.8 0.7
B 0.2 0.0 0.9 0.8
C 0.8 0.9 0.0 0.1
D 0.7 0.8 0.1 0.0
Next, all the items (blogs, in this case) are placed randomly on the two-dimensional chart, as shown in Figure 3-7.
sorry
The current distances between all the items are calculated using the actual distance (the sum of the differences of the squares), as shown in Figure 3-8.
sorry
Every node is moved according to the combination of all the other nodes pushing and pulling on it. Each time this happens, the difference between the current distances and the target distances gets a bit smaller. This procedure is repeated many times until the total amount of error cannot be reduced by moving the items any more.
The function for doing this takes the data vector and returns one with only two columns, the X and Y coordinates of the items on the two-dimensional chart. Add this function to clusters.py:
[a bit long-skip]
To run this algorithm, call scaledown to get the two-dimensional dataset and then call draw2d to plot it:
>> reload(clusters)
>> blognames,words,data=clusters.readfile('blogdata.txt')
>> coords=clusters.scaledown(data)
...
>> clusters.draw2d(coords,blognames,jpeg='blogs2d.jpg')
________________________________________________________
enough clusters?.......................................ok! here's some facebook stuff..
__________________________________________________________
.9. Matching on Facebook
Facebook is a popular social networking site that was originally for college students but eventually opened up to a larger audience. Like other social networking sites, it allows users to make profiles, enter demographic information about themselves, and connect to their friends on the site. Facebook also includes an API that lets you query information about people and find out if two people are friends or not. By doing this, you can build a set similar to the matchmaker dataset using real people.
As of this writing, Facebook has remained very committed to privacy, so you can only view the profiles of people who are your friends. The API applies the same rules, requiring a user to log in and only allowing queries, so unfortunately, you'll only be able to work through this section if you have a Facebook account and have connected to at least 20 people.
________________________________________________________
To start, create a new file called facebook.py, import some modules you'll need, and set up some constants:
import urllib,md5,webbrowser,time
from xml.dom.minidom import parseString
apikey="Your API Key"
secret="Your Secret Key"
FacebookSecureURL = "https://api.facebook.com/restserver.php"
There are two convenience methods to add: getsinglevalue gets the next value from inside a named node, and callid returns a number based on the system time.
def getsinglevalue(node,tag):
nl=node.getElementsByTagName(tag)
if len(nl)>0:
tagNode=nl[0]
if tagNode.hasChildNodes( ):
return tagNode.firstChild.nodeValue
return ''
def callid( ):
return str(int(time.time( )*10))
Some of the Facebook calls require that you send a sequence number, which can be any number as long as it is higher than the last sequence number you used. Using the system time is an easy way to get consistently higher numbers.
______________________________________________________
9.9.2. Creating a Session
The procedure for creating a session on Facebook is actually intended for you to create an application for other people to use without ever learning their login information. This is accomplished in several steps:
Use the Facebook API to request a token.
Send the user a URL to the Facebook login page, with the token in the URL.
Wait until the user has logged in.
Request a session from the Facebook API using the token.
Because several variables are used by all the calls, it's better to wrap the Facebook functionality in a class. Create a new class in facebook.py called fbsession and add an __init__ method that carries out the steps listed above:
class fbsession:
def __init_ _(self):
self.session_secret=None
self.session_key=None
self.token=self.createtoken( )
webbrowser.open(self.getlogin( ))
print "Press enter after logging in:",
raw_input( )
self.getsession( )
---------------------------------------------
There are several methods used by __init__ that have to be added to the class for it to work. The first thing you'll need is a way to send requests to the Facebook API. The sendrequest method opens a connection to Facebook and posts the arguments. An XML file is returned and parsed using the minidom parser. Add this method to your class:
def sendrequest(self, args):
args['api_key'] = apikey
args['sig'] = self.makehash(args)
post_data = urllib.urlencode(args)
url = FacebookURL + "?" + post_data
data=urllib.urlopen(url).read( )
return parseString(data)
The line shown in bold generates the signature of the request. This is accomplished with the makehash method, which joins all the arguments together in a string and hashes them with the secret key. You'll see shortly that the secret key changes once you get a session, so the method checks to see if you already have a session secret key. Add makehash to your class:
def makehash(self,args):
hasher = md5.new(''.join([x + '=' + args[x] for x in sorted(args.keys( ))]))
if self.session_secret: hasher.update(self.session_secret)
else: hasher.update(secret)
return hasher.hexdigest( )
_________________________________________________________
Now you're ready to write some actual Facebook API calls. we'll do this later. bye