preface
The increasing number of people contributing to the Internet, either deliberately or incidentally, has created a huge set of data that gives us millions of potential insights into user experience, marketing, personal tastes, and human behavior in general. This book provides an introduction to the emerging field of collective intelligence. It covers ways to get hold of interesting datasets from many web sites you've probably heard of, ideas on how to collect data from users of your own applications, and many different ways to analyze and understand the data once you've found it.
This book's goal is to take you beyond simple database-backed applications and teach you how to write smarter programs to take advantage of the information you and others collect every day.
Multiparadigm
Python supports object-oriented, procedural, and functional styles of programming. Machine-learning algorithms vary greatly, and the clearest way to implement one may use a different paradigm than another. Sometimes it's useful to pass around functions as parameters and other times to capture state in an object. Python supports both approaches.
List and dictionary constructors
Python has a good set of primitive types and two that are used heavily throughout this book are list and dictionary. A list is an ordered list of any type of value, and it is constructed with square brackets:
number_list=[1,2,3,4]
string_list=['a', 'b', 'c', 'd']
mixed_list=['a', 3, 'c', 8]
A dictionary is an unordered set of key/value pairs, similar to a hash map in other languages. It is constructed with curly braces:
ages={'John':24,'Sarah':28,'Mike':31}
The elements of lists and dictionaries can be accessed using square brackets after the list name:
string_list[2] # returns 'b'
ages['Sarah'] # returns 28
Significant Whitespace
Unlike most languages, Python actually uses the indentation of the code to define code blocks. Consider this snippet:
if x==1:
print 'x is 1'
print 'Still in if block'
print 'outside if block'
The interpreter knows that the first two print statements are executed when x is 1 because the code is indented. Indentation can be any number of spaces, as long as it is consistent. This book uses two spaces for indentation. When entering the code you'll need to be careful to copy the indentation correctly.
List comprehensions
A list comprehension is a convenient way of converting one list to another by filtering and applying functions to it. A list comprehension is written as:
[expression for variable in list]
or:
[expression for variable in list if condition]
For example, the following code:
l1=[1,2,3,4,5,6,7,8,9]
4]
would print this list:
[50,60,70,80,90]
List comprehensions are used frequently in this book because they are an extremely concise way to apply a function to an entire list or to remove bad items. The other manner in which they are often used is with the dict constructor:
l1=[1,2,3,4,5,6,7,8,9]
timesten=dict([(v,v*10) for v in l1])
This code will create a dictionary with the original list being the keys and each item multiplied by 10 as the value:
{1:10,2:20,3:30,4:40,5:50,6:60,7:70,8:80,9:90}
-------------------------------------------------------------------------------------------------------------------------
pen APIs
The algorithms for synthesizing collective intelligence require data from many users. In addition to machine-learning algorithms, this book discusses a number of Open Web APIs (application programming interfaces). These are ways that companies allow you to freely access data from their web sites by means of a specified protocol; you can then write programs that download and process the data. This data usually consists of contributions from the site's users, which can be mined for new insights. In some cases, there is a Python library available to access these APIs; if not, it's pretty straightforward to create your own interface to access the data using Python's built-in libraries for downloading data and parsing XML.
Here are some of the web sites with open APIs that you'll see in this book:
del.icio.us
A social bookmarking application whose open API lets you download links by tag or from a specific user.
Kayak
A travel site with an API for conducting searches for flights and hotels from within your own programs.
eBay
An online auction site with an API that allows you to query items that are currently for sale.
Hot or Not
A rating and dating site with an API to search for people and get their ratings and demographic information.
Akismet
An API for collaborative spam filtering.
A huge number of potential applications can be built by processing data from a single source, by combining data from multiple sources, and even by combining external information with input from your own users. The ability to harness data created by people in a variety of ways on different sites is a principle element of creating collective intelligence. A good starting point for finding more web sites with open APIs is ProgrammableWeb (http://www.programmableweb.com).
Overview of the Chapters
Every algorithm in the book is motivated by a realistic problem that can, I hope, be easily understood by all readers. I have tried to avoid problems that require a great deal of domain knowledge, and I have focused on problems that, while complex, are easy for most people to relate to.
Chapter 1
Explains the concepts behind machine learning, how it is applied in many different fields, and how it can be used to draw new conclusions from data gathered from many different people.
Chapter 2
Introduces the collaborative filtering techniques used by many online retailers to recommend products or media. The chapter includes a section on recommending links to people from a social bookmarking site, and building a move recommendation system from the MovieLens dataset.
Chapter 3
Builds on some of the ideas in Chapter 2 and introduces two different methods of clustering, which automatically detect groups of similar items in a large dataset. This chapter demonstrates the use of clustering to find groups on a set of popular weblogs and on people's desires from a social networking web site.
Chapter 4
Describes the various parts of a search engine including the crawler, indexer, and query engine. It covers the PageRank algorithm for scoring pages based on inbound links and shows you how to create a neural network that learns which keywords are associated with different results.
Chapter 5
Introduces algorithms for optimization, which are designed to search millions of possible solutions to a problem and choose the best one. The wide variety of uses for these algorithms is demonstrated with examples that find the best flights for a group of people traveling to the same location, find the best way of matching students to dorms, and lay out a network with the minimum number of crossed lines.
Chapter 6
Demonstrates Bayesian filtering, which is used in many free and commercial spam filters for automatically classifying documents based on the type of words and other features that appear in the document. This is applied to a set of RSS search results to demonstrate automatic classification of the entries.
Chapter 7
Introduces decision trees as a method not only of making predictions, but also of modeling the way the decisions are made. The first decision tree is built with hypothetical data from server logs and is used to predict whether or not a user is likely to become a premium subscriber. The other examples use data from real web sites to model real estate prices and "hotness."
Chapter 8
Approaches the problem of predicting numerical values rather than classifications using k-nearest neighbors techniques, and applies the optimization algorithms from Chapter 5. These methods are used in conjunction with the eBay API to build a system for predicting eventual auction prices for items based on a set of properties.
Chapter 9
Shows how support-vector machines can be used to match people in online dating sites or when searching for professional contacts. Support-vector machines are a fairly advanced technique and this chapter compares them to other methods.
Chapter 10
Introduces a relatively new technique called non-negative matrix factorization, which is used to find the independent features in a dataset. In many datasets the items are constructed as a composite of different features that we don't know in advance; the idea here is to detect these features. This technique is demonstrated on a set of news articles, where the stories themselves are used to detect themes, one or more of which may apply to a given story.
Chapter 11
Introduces genetic programming, a very sophisticated set of techniques that goes beyond optimization and actually builds algorithms using evolutionary ideas to solve a particular problem. This is demonstrated by a simple game in which the computer is initially a poor player that improves its skill by improving its own code the more the game is played.
Chapter 12
Reviews all the machine-learning and statistical algorithms described in the book and compares them to a set of artificial problems. This will help you understand how they work and visualize the way that each of them divides data.
Appendix A
Gives information on third-party libraries used in the book, such as where to find them and how to install them.
Appendix B
Contains formulae, descriptions, and code for many of the mathematical concepts introduced throughout the book.
Exercises at the end of each chapter give ideas of ways to extend the algorithms and make them more powerful.
Conventions
The following typographical conventions are used in this book:
Plain text
Indicates menu titles, menu options, menu buttons, and keyboard accelerators (such as Alt and Ctrl).
Italic
Indicates new terms, URLs, email addresses, filenames, file extensions, pathnames, directories, and Unix utilities.
Constant width
Indicates commands, options, switches, variables, attributes, keys, functions, types, classes, namespaces, methods, modules, properties, parameters, values, objects, events, event handlers, XML tags, HTML tags, macros, the contents of files, or the output from commands.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values.
steve