Larry L's Blog

Entry for August 27, 2007

Chapter 3. Discovering Groups -umm, going backwards here, but's that's ok...who doesn't love cs?

Chapter 2 discussed ways to find things that are closely related, so, for example, you could find someone who shares your taste in movies. This chapter expands on those ideas and introduces data clustering, a method for discovering and visualizing groups of things, people, or ideas that are all closely related. In this chapter, you'll learn: how to prepare data from a variety of sources; two different clustering algorithms; more on distance metrics; simple graphical visualization code for viewing the generated groups; and finally, a method for projecting very complicated datasets into two dimensions.

-----------------------------------------------------------------------------

3.2.2. Counting the Words in a Feed

Almost all blogs can be read online or via their RSS feeds. An RSS feed is a simple XML document that contains information about the blog and all the entries. The first step in generating word counts for each blog is to parse these feeds. Fortunately, there is an excellent module for doing this called Universal Feed Parser, which you can download from http://www.feedparser.org.

This module makes it easy to get the title, links, and entries from any RSS or Atom feed. The next step is to create a function that will extract all the words from a feed. Create a new file called generatefeedvector.py and insert the following code:

import feedparser

import re

# Returns title and dictionary of word counts for an RSS feed

def getwordcounts(url):

# Parse the feed

d=feedparser.parse(url)

wc={}

# Loop over all the entries

for e in d.entries:

if 'summary' in e: summary=e.summary

else: summary=e.description

# Extract a list of words

words=getwords(e.title+' '+summary)

for word in words:

wc.setdefault(word,0)

wc[word]+=1

return d.feed.title,wc

--------------------------------------------------------------

This is a pain since I can't copy indentation ****

_______________________________________

RSS and Atom feeds always have a title and a list of entries. Each entry usually has either a summary or description tag that contains the actual text of the entries. The getwordcounts function passes this summary to getwords, which strips out all of the HTML and splits the words by nonalphabetical characters, returning them as a list. Add getwords to generatefeedvector.py:

def getwords(html):

# Remove all the HTML tags

txt=re.compile(r'<[^>]+>').sub('',html)

# Split words by all non-alpha characters

words=re.compile(r'[^A-Z^a-z]+').split(txt)

# Convert to lowercase

return [word.lower( ) for word in words if word!='']

___________________________________________________

apcount={}

wordcounts={}

for feedurl in file('feedlist.txt'):

title,wc=getwordcounts(feedurl)

wordcounts[title]=wc

for word,count in wc.items( ):

apcount.setdefault(word,0)

if count>1:

apcount[word]+=1

______________________________________________

is this helping any?

______________________________________________

The final step is to use the list of words and the list of blogs to create a text file containing a big matrix of all the word counts for each of the blogs:

out=file('blogdata.txt','w')

out.write('Blog')

for word in wordlist: out.write('\t%s' % word)

out.write('\n')

for blog,wc in wordcounts.items( ):

out.write(blog)

for word in wordlist:

if word in wc: out.write('\t%d' % wc[word])

else: out.write('\t0')

out.write('\n')

To generate the word count file, run generatefeedvector.py from the command line:

c:\code\blogcluster>python generatefeedvector.py

___________________________________________________

An alternative method of clustering is K-means clustering. This type of algorithm is quite different from hierarchical clustering because it is told in advance how many distinct clusters to generate. The algorithm will determine the size of the clusters based on the structure of the data.

K-means clustering begins with k randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. This process repeats until the assignments stop changing. Figure 3-5 shows this process in action for five items and two clusters.

import random

def kcluster(rows,distance=pearson,k=4):

# Determine the minimum and maximum values for each point

ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))

for i in range(len(rows[0]))]

# Create k randomly placed centroids

clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0]

for i in range(len(rows[0]))] for j in range(k)]

lastmatches=None

for t in range(100):

print 'Iteration %d' % t

bestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each row

for j in range(len(rows)):

row=rows[j]

bestmatch=0

for i in range(k):

d=distance(clusters[i],row)

if d

bestmatches[bestmatch].append(j)

# If the results are the same as last time, this is complete

if bestmatches==lastmatches: break

lastmatches=bestmatches

# Move the centroids to the average of their members

f or i in range(k):

avgs=[0.0]*len(rows[0])

if len(bestmatches[i])>0:

for rowid in bestmatches[i]:

for m in range(len(rows[rowid])):

avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):

avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

return bestmatches

___________________________________________________

>> reload(clusters)

>> kclust=clusters.kcluster(data,k=10)

Iteration 0

...

>> [rownames[r] for r in k[0]]

['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman',

'ProBlogger Blog Tips', "Seth's Blog"]

>> [rownames[r] for r in k[1]]

etc..

kclust now contains an list of IDs for each cluster. Try the clustering with different values of k and see how it affects the results.

____________________________________________________

mind you, i'm NOT a cs major!

____________________________________________________

he clustering algorithms in this chapter have been demonstrated using a stylized visualization of data in two dimensions, with the difference between the various items indicated by how far apart they are in the diagram. Since most real-life examples of items you would want to cluster have more than two numbers, you can't just take the data as-is and plot it in two dimensions. However, to understand the relationship between the various items, it would be very useful to see them charted on a page with closer distances indicating similarity.

This section will introduce a technique called multidimensional scaling, which will be used to find a two-dimensional representation of the dataset. The algorithm takes the difference between every pair of items and tries to make a chart in which the distances between the items match those differences. To do this, the algorithm first calculates the target distances between all the items. In the blog dataset, Pearson correlation was used to compare the items. An example of this is shown in Table 3-2.

Table 3-2. Sample distance matrix A B C D

A 0.0 0.2 0.8 0.7

B 0.2 0.0 0.9 0.8

C 0.8 0.9 0.0 0.1

D 0.7 0.8 0.1 0.0

Next, all the items (blogs, in this case) are placed randomly on the two-dimensional chart, as shown in Figure 3-7.

sorry

The current distances between all the items are calculated using the actual distance (the sum of the differences of the squares), as shown in Figure 3-8.

sorry

Every node is moved according to the combination of all the other nodes pushing and pulling on it. Each time this happens, the difference between the current distances and the target distances gets a bit smaller. This procedure is repeated many times until the total amount of error cannot be reduced by moving the items any more.

The function for doing this takes the data vector and returns one with only two columns, the X and Y coordinates of the items on the two-dimensional chart. Add this function to clusters.py:

[a bit long-skip]

To run this algorithm, call scaledown to get the two-dimensional dataset and then call draw2d to plot it:

>> reload(clusters)

>> blognames,words,data=clusters.readfile('blogdata.txt')

>> coords=clusters.scaledown(data)

...

>> clusters.draw2d(coords,blognames,jpeg='blogs2d.jpg')

________________________________________________________

enough clusters?.......................................ok! here's some facebook stuff..

__________________________________________________________

.9. Matching on Facebook

Facebook is a popular social networking site that was originally for college students but eventually opened up to a larger audience. Like other social networking sites, it allows users to make profiles, enter demographic information about themselves, and connect to their friends on the site. Facebook also includes an API that lets you query information about people and find out if two people are friends or not. By doing this, you can build a set similar to the matchmaker dataset using real people.

As of this writing, Facebook has remained very committed to privacy, so you can only view the profiles of people who are your friends. The API applies the same rules, requiring a user to log in and only allowing queries, so unfortunately, you'll only be able to work through this section if you have a Facebook account and have connected to at least 20 people.

________________________________________________________

To start, create a new file called facebook.py, import some modules you'll need, and set up some constants:

import urllib,md5,webbrowser,time

from xml.dom.minidom import parseString

apikey="Your API Key"

secret="Your Secret Key"

FacebookSecureURL = "https://api.facebook.com/restserver.php"

There are two convenience methods to add: getsinglevalue gets the next value from inside a named node, and callid returns a number based on the system time.

def getsinglevalue(node,tag):

nl=node.getElementsByTagName(tag)

if len(nl)>0:

tagNode=nl[0]

if tagNode.hasChildNodes( ):

return tagNode.firstChild.nodeValue

return ''

def callid( ):

return str(int(time.time( )*10))

Some of the Facebook calls require that you send a sequence number, which can be any number as long as it is higher than the last sequence number you used. Using the system time is an easy way to get consistently higher numbers.

______________________________________________________

9.9.2. Creating a Session

The procedure for creating a session on Facebook is actually intended for you to create an application for other people to use without ever learning their login information. This is accomplished in several steps:

Use the Facebook API to request a token.

Send the user a URL to the Facebook login page, with the token in the URL.

Wait until the user has logged in.

Request a session from the Facebook API using the token.

Because several variables are used by all the calls, it's better to wrap the Facebook functionality in a class. Create a new class in facebook.py called fbsession and add an __init__ method that carries out the steps listed above:

class fbsession:

def __init_ _(self):

self.session_secret=None

self.session_key=None

self.token=self.createtoken( )

webbrowser.open(self.getlogin( ))

print "Press enter after logging in:",

raw_input( )

self.getsession( )

---------------------------------------------

There are several methods used by __init__ that have to be added to the class for it to work. The first thing you'll need is a way to send requests to the Facebook API. The sendrequest method opens a connection to Facebook and posts the arguments. An XML file is returned and parsed using the minidom parser. Add this method to your class:

def sendrequest(self, args):

args['api_key'] = apikey

args['sig'] = self.makehash(args)

post_data = urllib.urlencode(args)

url = FacebookURL + "?" + post_data

data=urllib.urlopen(url).read( )

return parseString(data)

The line shown in bold generates the signature of the request. This is accomplished with the makehash method, which joins all the arguments together in a string and hashes them with the secret key. You'll see shortly that the secret key changes once you get a session, so the method checks to see if you already have a session secret key. Add makehash to your class:

def makehash(self,args):

hasher = md5.new(''.join([x + '=' + args[x] for x in sorted(args.keys( ))]))

if self.session_secret: hasher.update(self.session_secret)

else: hasher.update(secret)

return hasher.hexdigest( )

_________________________________________________________

Now you're ready to write some actual Facebook API calls. we'll do this later. bye

2007-08-27 05:35:13 GMT

Site Home