Larry L's Blog

Entry for August 26, 2007

The spam-filtering method you'll see in this book is based on the appearance of words or phrases without any regard to what they mean or to sentence structures. Although it's theoretically possible to build an algorithm that would take grammar into account, this is rarely done in practice because the effort required would be disproportionately large compared to the improvement in the algorithm. Understanding the meaning of words or their relevance to a person's life would require far more information than spam filters, in their current incarnation, can access.

---------------------------------------------------------------------------------------

rediction markets are also a form of collective intelligence. One of the most well known of these is the Hollywood Stock Exchange (http://hsx.com ), where people trade stocks on movies and movie stars. You can buy or sell a stock at the current price knowing that its ultimate value will be one millionth of the movie's actual opening box office number. Because the price is set by trading behavior, the value is not chosen by any one individual but by the behavior of the group, and the current price can be seen as the whole group's prediction of box office numbers for the movie. The predictions made by the Hollywood Stock Exchange are routinely better than those made by individual experts.

Some dating sites, such as eHarmony, use

-------------------------------------------------------------------------------------

You've probably come across recommendation engines before when using an online shopping site like Amazon. Amazon tracks the purchasing habits of all its shoppers, and when you log onto the site, it uses this information to suggest products you might like. Amazon can even suggest movies you might like, even if you've only bought books from it before. Some online concert ticket agencies will look at the history of shows you've seen before and alert you to upcoming shows that might be of interest. Sites like http://reddit.com let you vote on links to other web sites and then use your votes to suggest other links you might find interesting.

collaborative filtering algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours. It looks at other things they like and combines them to create a ranked list of suggestions. There are several different ways of deciding which people are similar and combining their choices to make a list; this chapter will cover a few of these.

The term collaborative filtering was first used by David Goldberg at Xerox PARC in 1992 in a paper called "Using collaborative filtering to weave an information tapestry." He designed a system called Tapestry that allowed people to annotate documents as either interesting or uninteresting and used this information to filter documents for other people.

There are now hundreds of web sites that employ some sort of collaborative filtering algorithm for movies, music, books, dating, shopping, other web sites, podcasts, articles, and even jokes.

-----------------------------------------------------------------------------------

The first thing you need is a way to represent different people and their preferences. In Python, a very simple way to do this is to use a nested dictionary. If you'd like to work through the example in this section, create a file called recommendations.py, and insert the following code to create the dataset:

Code View: Scroll / Show All

# A dictionary of movie critics and their ratings of a small

# set of movies

critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,

 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,

 'The Night Listener': 3.0},

'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,

 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,

 'You, Me and Dupree': 3.5},

'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,

 'Superman Returns': 3.5, 'The Night Listener': 4.0},

'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,

 'The Night Listener': 4.5, 'Superman Returns': 4.0,

 'You, Me and Dupree': 2.5},

'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,

 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,

 'You, Me and Dupree': 2.0},

'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,

 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},

'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}



---------------------------------------------------

[euclidan chart shown] -oooooooh

To calculate the distance between Toby and LaSalle in the chart, take the difference in each axis, square them and add them together, then take the square root of the sum. In Python, you can use the pow(n,2) function to square a number and take the square root with the sqrt function:


>> from math import sqrt

>> sqrt(pow(5-4,2)+pow(4-1,2))

3.1622776601683795




This formula calculates the distance, which will be smaller for people who are more similar. However, you need a function that gives higher values for people who are similar. This can be done by adding 1 to the function (so you don't get a division-by-zero error) and inverting it:


>> 1/(1+sqrt(pow(5-4,2)+pow(4-1,2)))

0.2402530733520421
------------------------------------------------------
a diffrent correlation method
 
# Returns the Pearson correlation coefficient for p1 and p2

def sim_pearson(prefs,p1,p2):

  # Get the list of mutually rated items

  si={}

  for item in prefs[p1]:

    if item in prefs[p2]: si[item]=1



  # Find the number of elements

  n=len(si)



  # if they are no ratings in common, return 0

  if n==0: return 0



  # Add up all the preferences

  sum1=sum([prefs[p1][it] for it in si])

  sum2=sum([prefs[p2][it] for it in si])



  # Sum up the squares

  sum1Sq=sum([pow(prefs[p1][it],2) for it in si])

  sum2Sq=sum([pow(prefs[p2][it],2) for it in si])



  # Sum up the products

  pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])



  # Calculate Pearson score

  num=pSum-(sum1*sum2/n)

  den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))

  if den==0: return 0

  r=num/den



  return r



---------------------------------------------------------

2007-08-26 17:00:19 GMT

Site Home