Larry L's Blog

Entry for August 26, 2007-supplemental

2.6.1. The del.icio.us API

Data from del.icio.us is made available through an API that returns data in XML format. To make things even easier for you, there is a Python API that you can download from http://code.google.com/p/pydelicious/source or http://oreilly.com/catalog/9780596529321.

To work through the example in this section, you'll need to download the latest version of this library and put it in your Python library path. (See Appendix A Appendix A for more information on installing this library.)

This library has several simple calls to get links that people have submitted. For example, to get a list of recent popular posts about programming, you can use the get_popular call:

Code View: Scroll / Show All

 import pydelicious 
 pydelicious.get_popular(tag='programming') 
[{'count': '', 'extended': '', 'hash': '', 'description': u'How To Write
Unmaintainable Code', 'tags': '', 'href': u'http://thc.segfault.net/root/phun/
unmaintain.html', 'user': u'dorsia', 'dt': u'2006-08-19T09:48:56Z'}, {'count': '',
'extended': '', 'hash': '', 'description': u'Threading in C#', 'tags': '', 'href':
u'http://www.albahari.com/threading/', 'user': u'mmihale', 'dt': u'2006-05-17T18:
09:24Z'},
...etc...

You can see that it returns a list of dictionaries, each one containing a URL, description, and the user who posted it. Since you are working from live data, your results will look different from the examples. There are two other calls you'll be using, get_urlposts, which returns all the posts for a given URL, and get_userposts, which returns all the posts for a given user. The data for these calls is returned in the same way

-----------------------------------------------------------------------------

Audioscrobbler. Take a look at http://www.audioscrobbler.net , a dataset containing music preferences for a large set of users. Use their web services API to get a set of data for making and building a music recommendation system.

------------------------------------------------------------------------------

This section will show you how to cluster the blogs dataset to generate a hierarchy of blogs, which, if successful, will group them thematically. First, you'll need a method to load in the data file. Create a file called clusters.py and add this function to it:

def readfile(filename):
  lines=[line for line in file(filename)]

  # First line is the column titles
  colnames=lines[0].strip(  ).split('\t')[1:]
  rownames=[]
  data=[]
  for line in lines[1:]:
    p=line.strip(  ).split('\t')
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    data.append([float(x) for x in p[1:]])
  return rownames,colnames,data

This function reads the top row into the list of column names and reads the leftmost column into a list of row names, then puts all the data into a big list where every item in the list is the data for that row. The count for any cell can be referenced by its row and column in data, which also corresponds to the indices of the rownames and colnames lists.

The next step is to define closeness. We discussed this in Chapter 2 , using Euclidean distance and Pearson correlation as examples of ways to determine how similar two movie critics are. In the present example, some blogs contain more entries or much longer entries than others, and will thus contain more words overall. The Pearson correlation will correct for this, since it really tries to determine how well two sets of data fit onto a straight line. The Pearson correlation code for this module will take two lists of numbers and return their correlation score:

from math import sqrt
def pearson(v1,v2):
  # Simple sums
  sum1=sum(v1)
  sum2=sum(v2)

  # Sums of the squares
  sum1Sq=sum([pow(v,2) for v in v1])
  sum2Sq=sum([pow(v,2) for v in v2])

  # Sum of the products
  pSum=sum([v1[i]*v2[i] for i in range(len(v1))])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/len(v1))
  den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
  if den==0: return 0

  return 1.0-num/den

Remember that the Pearson correlation is 1.0 when two items match perfectly, and is close to 0.0 when there's no relationship at all. The final line of the code returns 1.0 minus the Pearson correlation to create a smaller distance between items that are more similar.

Each cluster in a hierarchical clustering algorithm is either a point in the tree with two branches, or an endpoint associated with an actual row from the dataset (in this case, a blog). Each cluster also contains data about its location, which is either the row data for the endpoints or the merged data from its two branches for other node types. You can create a class called bicluster that has all of these properties, which you'll use to represent the hierarchical tree. Create the cluster type as a class in cluster.py:

class bicluster:
  def __init_  _(self,vec,left=None,right=None,distance=0.0,id=None):
    self.left=left
    self.right=right
    self.vec=vec
    self.id=id
    self.distance=distance

The algorithm for hierarchical clustering begins by creating a group of clusters that are just the original items

------------------------------------------------------------------

To run the hierarchical clustering, start up a Python session, load in the file, and call hcluster on the data:

$ python
 import clusters
 blognames,words,data=clusters.readfile('blogdata.txt')
 clust=clusters.hcluster(data)

This may take a few minutes to run. Storing the distances increases the speed significantly, but it's still necessary for the algorithm to calculate the correlation between every pair of blogs. This process can be made faster by using an external library to calculate the distances. To view your results, you can create a simple function that traverses the clustering tree recursively and prints it like a filesystem hierarchy. Add the function printclust to clusters.py:

def printclust(clust,labels=None,n=0):
  # indent to make a hierarchy layout
  for i in range(n): print ' ',
  if clust.id
    # negative id means that this is branch
    print '-'
  else:
    # positive id means that this is an endpoint
    if labels==None: print clust.id
    else: print labels[clust.id]

  # now print the right and left branches
  if clust.left!=None: printclust(clust.left,labels=labels,n=n+1)
  if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)

The output from this doesn't look very fancy and it's a little hard to read with a large dataset like the blog list, but it does give a good overall sense of whether clustering is working. In the next section, we'll look at creating a graphical version that is much easier to read and is drawn to scale to show the overall spread of each cluster.

In your Python session, call this function on the clusters you just built:

 reload(clusters)
 clusters.printclust(clust,labels=blognames)

The output listing will contain all 100 blogs and will thus be quite long. Here's an example of a cluster ...

--------------------------------------------------------------------------

This will generate a file called blogclust.jpg with the dendrogram. The dendrogram should look similar to the one shown in Figure 3-3 . If you like, you can change the height and width settings to make it easier to print or less cluttered.

Figure 3-3. Dendrogram showing blog clusters

---------------------------------------------------------------------------------

have fun with your dendos, girls...

2007-08-26 17:26:07 GMT

Site Home