A Plan For Spam
I think it's possible to stop spam, and
that content-based filters are the way to do it. The Achilles heel of the
spammers is their message. They can circumvent any other barrier you set up.
They have so far, at least. But they have to deliver their message, whatever it
is. If we can write software that recognizes their messages, there is no way
they can get around that.
To the recipient, spam is easily recognizable. If you hired someone to read your
mail and discard the spam, they would have little trouble doing it. How much do
we have to do, short of AI, to automate this process?
I think we will be able to solve the problem with fairly simple algorithms. In
fact, I've found that you can filter present-day spam acceptably well using
nothing more than a Bayesian combination of the spam probabilities of individual
words. Using a slightly tweaked (as described below) Bayesian filter, we now
miss less than 5 per 1000 spams, with 0 false positives.
The statistical approach is not usually the first one people try when they write
spam filters. Most hackers' first instinct is to try to write software that
recognizes individual properties of spam. You look at spams and you think, the
gall of these guys to try sending me mail that begins "Dear Friend" or
has a subject line that's all uppercase and ends in eight exclamation points. I
can filter out that stuff with about one line of code.
And so you do, and in the beginning it works. A few simple rules will take a big
bite out of your incoming spam. Merely looking for the word "click"
will catch 79.7% of the emails in my spam corpus, with only 1.2% false
positives.
I spent about six months writing software that looked for individual spam
features before I tried the statistical approach. What I found was that
recognizing that last few percent of spams got very hard, and that as I made the
filters stricter I got more false positives.
False positives are innocent emails that get mistakenly identified as spams. For
most users, missing legitimate email is an order of magnitude worse than
receiving spam, so a filter that yields false positives is like an acne cure
that carries a risk of death to the patient.
The more spam a user gets, the less likely he'll be to notice one innocent mail
sitting in his spam folder. And strangely enough, the better your spam filters
get, the more dangerous false positives become, because when the filters are
really good, users will be more likely to ignore everything they catch.
I don't know why I avoided trying the statistical approach for so long. I think
it was because I got addicted to trying to identify spam features myself, as if
I were playing some kind of competitive game with the spammers. (Nonhackers
don't often realize this, but most hackers are very competitive.) When I did try
statistical analysis, I found immediately that it was much cleverer than I had
been. It discovered, of course, that terms like "virtumundo" and
"teens" were good indicators of spam. But it also discovered that
"per" and "FL" and "ff0000" are good indicators of
spam. In fact, "ff0000" (html for bright red) turns out to be as good
an indicator of spam as any pornographic term.
_ _ _
Here's a sketch of how I do statistical filtering. I start with one corpus of
spam and one of nonspam mail. At the moment each one has about 4000 messages in
it. I scan the entire text, including headers and embedded html and javascript,
of each message in each corpus. I currently consider alphanumeric characters,
dashes, apostrophes, and dollar signs to be part of tokens, and everything else
to be a token separator. (There is probably room for improvement here.) I ignore
tokens that are all digits, and I also ignore html comments, not even
considering them as token separators.
I count the number of times each token (ignoring case, currently) occurs in each
corpus. At this stage I end up with two large hash tables, one for each corpus,
mapping tokens to number of occurrences.
Next I create a third hash table, this time mapping each token to the
probability that an email containing it is a spam, which I calculate as follows
[1]:
I explained this as code to show a couple of important details. I want to bias
the probabilities slightly to avoid false positives, and by trial and error I've
found that a good way to do it is to double all the numbers in good.
This helps to distinguish between words that occasionally do occur in legitimate
email and words that almost never do. I only consider words that occur more than
five times in total (actually, because of the doubling, occurring three times in
nonspam mail would be enough). And then there is the question of what
probability to assign to words that occur in one corpus but not the other. Again
by trial and error I chose .01 and .99. There may be room for tuning here, but
as the corpus grows such tuning will happen automatically anyway.
The especially observant will notice that while I consider each corpus to be a
single long stream of text for purposes of counting occurrences, I use the
number of emails in each, rather than their combined length, as the divisor in
calculating spam probabilities. This adds another slight bias to protect against
false positives.
When new mail arrives, it is scanned into tokens, and the most interesting
fifteen tokens, where interesting is measured by how far their spam probability
is from a neutral .5, are used to calculate the probability that the mail is
spam. If probs is a list of the fifteen individual
probabilities, you calculate the combined
probability thus:
There are examples of this algorithm being applied to actual emails in an
appendix at the end.
I treat mail as spam if the algorithm above gives it a probability of more than
.9 of being spam. But in practice it would not matter much where I put this
threshold, because few probabilities end up in the middle of the range.
One great advantage of the statistical approach is that you don't have to read
so many spams. Over the past six months, I've read literally thousands of spams,
and it is really kind of demoralizing. Norbert Wiener said if you compete with
slaves you become a slave, and there is something similarly degrading about
competing with spammers. To recognize individual spam features you have to try
to get into the mind of the spammer, and frankly I want to spend as little time
inside the minds of spammers as possible.
But the real advantage of the Bayesian approach, of course, is that you know
what you're measuring. Feature-recognizing filters like SpamAssassin assign a
spam "score" to email. The Bayesian approach assigns an actual
probability. The problem with a "score" is that no one knows what it
means. The user doesn't know what it means, but worse still, neither does the
developer of the filter. How many points should an email get for having
the word "sex" in it? A probability can of course be mistaken, but
there is little ambiguity about what it means, or how evidence should be
combined to calculate it. Based on my corpus, "sex" indicates a .97
probability of the containing email being a spam, whereas "sexy"
indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an
email containing both words would, in the (unlikely) absence of any other
evidence, have a 99.97% chance of being a spam.
Because it is measuring probabilities, the Bayesian approach considers all the
evidence in the email, both good and bad. Words that occur disproportionately rarely
in spam (like "though" or "tonight" or
"apparently") contribute as much to decreasing the probability as bad
words like "unsubscribe" and "opt-in" do to increasing it.
So an otherwise innocent email that happens to include the word "sex"
is not going to get tagged as spam.
Ideally, of course, the probabilities should be calculated individually for each
user. I get a lot of email containing the word "Lisp", and (so far) no
spam that does. So a word like that is effectively a kind of password for
sending mail to me. In my earlier spam-filtering software, the user could set up
a list of such words and mail containing them would automatically get past the
filters. On my list I put words like "Lisp" and also my zipcode, so
that (otherwise rather spammy-sounding) receipts from online orders would get
through. I thought I was being very clever, but I found that the Bayesian filter
did the same thing for me, and moreover discovered of a lot of words I hadn't
thought of.
When I said at the start that our filters let through less than 5 spams per 1000
with 0 false positives, I'm talking about filtering my mail based on a corpus of
my mail. But these numbers are not misleading, because that is the approach I'm
advocating: filter each user's mail based on the spam and nonspam mail he
receives. Essentially, each user should have two delete buttons, ordinary delete
and delete-as-spam. Anything deleted as spam goes into the spam corpus, and
everything else goes into the nonspam corpus.
You could start users with a seed filter, but ultimately each user should have
his own per-word probabilities based on the actual mail he receives. This (a)
makes the filters more effective, (b) lets each user decide their own precise
definition of spam, and (c) perhaps best of all makes it hard for spammers to
tune mails to get through the filters. If a lot of the brain of the filter is in
the individual databases, then merely tuning spams to get through the seed
filters won't guarantee anything about how well they'll get through individual
users' varying and much more trained filters.
Content-based spam filtering is often combined with a whitelist, a list of
senders whose mail can be accepted with no filtering. One easy way to build such
a whitelist is to keep a list of every address the user has ever sent mail to.
If a mail reader has a delete-as-spam button then you could also add the from
address of every email the user has deleted as ordinary trash.
I'm an advocate of whitelists, but more as a way to save computation than as a
way to improve filtering. I used to think that whitelists would make filtering
easier, because you'd only have to filter email from people you'd never heard
from, and someone sending you mail for the first time is constrained by
convention in what they can say to you. Someone you already know might send you
an email talking about sex, but someone sending you mail for the first time
would not be likely to. The problem is, people can have more than one email
address, so a new from-address doesn't guarantee that the sender is writing to
you for the first time. It is not unusual for an old friend (especially if he is
a hacker) to suddenly send you an email with a new from-address, so you can't
risk false positives by filtering mail from unknown addresses especially
stringently.
In a sense, though, my filters do themselves embody a kind of whitelist (and
blacklist) because they are based on entire messages, including the headers. So
to that extent they "know" the email addresses of trusted senders and
even the routes by which mail gets from them to me. And they know the same about
spam, including the server names, mailer versions, and protocols.
_ _ _
If I thought that I could keep up current rates of spam filtering, I would
consider this problem solved. But it doesn't mean much to be able to filter out
most present-day spam, because spam evolves. Indeed, most antispam
techniques so far have been like pesticides that do nothing more than create
a new, resistant strain of bugs.
I'm more hopeful about Bayesian filters, because they evolve with the spam. So
as spammers start using "c0ck" instead of "cock" to evade
simple-minded spam filters based on individual words, Bayesian filters
automatically notice. Indeed, "c0ck" is far more damning evidence than
"cock", and Bayesian filters know precisely how much more.
Still, anyone who proposes a plan for spam filtering has to be able to answer
the question: if the spammers knew exactly what you were doing, how well could
they get past you? For example, I think that if checksum-based spam filtering
becomes a serious obstacle, the spammers will just switch to mad-lib techniques
for generating message bodies.
To beat Bayesian filters, it would not be enough for spammers to make their
emails unique or to stop using individual naughty words. They'd have to make
their mails indistinguishable from your ordinary mail. And this I think would
severely constrain them. Spam is mostly sales pitches, so unless your regular
mail is all sales pitches, spams will inevitably have a different character. And
the spammers would also, of course, have to change (and keep changing) their
whole infrastructure, because otherwise the headers would look as bad to the
Bayesian filters as ever, no matter what they did to the message body. I don't
know enough about the infrastructure that spammers use to know how hard it would
be to make the headers look innocent, but my guess is that it would be even
harder than making the message look innocent.
Assuming they could solve the problem of the headers, the spam of the future
will probably look something like this:
Spammers range from businesses running so-called opt-in lists who don't even try
to conceal their identities, to guys who hijack mail servers to send out spams
promoting porn sites. If we use filtering to whittle their options down to mails
like the one above, that should pretty much put the spammers on the
"legitimate" end of the spectrum out of business; they feel obliged by
various state laws to include boilerplate about why their spam is not spam, and
how to cancel your "subscription," and that kind of text is easy to
recognize.
(I used to think it was naive to believe that stricter laws would decrease spam.
Now I think that while stricter laws may not decrease the amount of spam that
spammers send, they can certainly help filters to decrease the amount of
spam that recipients actually see.)
All along the spectrum, if you restrict the sales pitches spammers can make, you
will inevitably tend to put them out of business. That word business is
an important one to remember. The spammers are businessmen. They send spam
because it works. It works because although the response rate is abominably low
(at best 15 per million, vs 3000 per million for a catalog mailing), the cost,
to them, is practically nothing. The cost is enormous for the recipients, about
5 man-weeks for each million recipients who spend a second to delete the spam,
but the spammer doesn't have to pay that.
Sending spam does cost the spammer something, though. [2] So the lower we can
get the response rate-- whether by filtering, or by using filters to force
spammers to dilute their pitches-- the fewer businesses will find it worth their
while to send spam.
The reason the spammers use the kinds of sales
pitches that they do is to increase response rates. This is possibly even
more disgusting
than getting inside the mind of a spammer, but let's take a quick look inside
the mind of someone who responds to a spam. This person is either
astonishingly credulous or deeply in denial about their sexual interests. In
either case, repulsive or idiotic as the spam seems to us, it is exciting to
them. The spammers wouldn't say these things if they didn't sound exciting. And
"thought you should check out the following" is just not going to have
nearly the pull with the spam recipient as the kinds of things that spammers say
now. Result: if it can't contain exciting sales pitches, spam becomes less
effective as a marketing vehicle, and fewer businesses want to use it.
That is the big win in the end. I started writing spam filtering software
because I didn't want have to look at the stuff anymore. But if we get good
enough at filtering out spam, it will stop working, and the spammers will
actually stop sending it.
_ _ _
Of all the approaches to fighting spam, from software to laws, I believe
Bayesian filtering will be the single most effective. But I also think that the
more different kinds of antispam efforts we undertake, the better, because any
measure that constrains spammers will tend to make filtering easier. And even
within the world of content-based filtering, I think it will be a good thing if
there are many different kinds of software being used simultaneously. The more
different filters there are, the harder it will be for spammers to tune spams to
get through them.
Appendix: Examples of Filtering
Here
is an example of a spam that arrived while I was writing this article. The
fifteen most interesting words in this spam are:
Unfortunately that makes this email a boring example of the use of Bayes' Rule.
To see an interesting variety of probabilities we have to look at this
actually quite atypical spam.
The fifteen most interesting words in this spam, with their probabilities, are:
"Madam" is obviously from spams beginning "Dear Sir or
Madam." They're not very common, but the word "madam" never
occurs in my legitimate email, and it's all about the ratio.
"Republic" scores high because it often shows up in Nigerian scam
emails, and also occurs once or twice in spams referring to Korea and South
Africa. You might say that it's an accident that it thus helps identify this
spam. But I've found when examining spam probabilities that there are a lot of
these accidents, and they have an uncanny tendency to push things in the right
direction rather than the wrong one. In this case, it is not entirely a
coincidence that the word "Republic" occurs in Nigerian scam emails
and this spam. There is a whole class of dubious business propositions involving
less developed countries, and these in turn are more likely to have names that
specify explicitly (because they aren't) that they are republics.[3]
On the other hand, "enter" is a genuine miss. It occurs mostly in
unsubscribe instructions, but here is used in a completely innocent way.
Fortunately the statistical approach is fairly robust, and can tolerate quite a
lot of misses before the results start to be thrown off.
For comparison, here
is an example of that rare bird, a spam that gets through the filters. Why?
Because by sheer chance it happens to be loaded with words that occur in my
actual email:
Second, I think filtering based on word pairs (see below) might well catch this
one: "cost effective", "setup fee", "money back"
-- pretty incriminating stuff. And of courrse if they continued to spam me (or a
network I was part of), "Hostex" itself would be recognized as a spam
term.
Finally, here
is an innocent email. Its fifteen most interesting words are as follows:
It's interesting that "describe" rates as so thoroughly innocent. It
hasn't occurred in a single one of my 4000 spams. The data turns out to be full
of such surprises. One of the things you learn when you analyze spam texts is
how narrow a subset of the language spammers operate in. It's that fact,
together with the equally characteristic vocabulary of any individual user's
mail, that makes Bayesian filtering a good bet.
Appendix: More Ideas
One idea that I haven't tried yet is to filter based on word pairs, or even
triples, rather than individual words. This should yield a much sharper estimate
of the probability. For example, in my current database, the word
"offers" has a probability of .96. If you based the probabilities on
word pairs, you'd end up with "special offers" and "valuable
offers" having probabilities of .99 and, say, "approach offers"
(as in "this approach offers") having a probability of .1 or less.
The reason I haven't done this is that filtering based on individual words
already works so well. But it does mean that there is room to tighten the
filters if spam gets harder to detect. (Curiously, a filter based on word pairs
would be in effect a Markov-chaining text generator running in reverse.)
Specific spam features (e.g. not seeing the recipient's address in the to:
field) do of course have value in recognizing spam. They can be considered in
this algorithm by treating them as virtual words. I'll probably do this in
future versions, at least for a handful of the most egregious spam indicators.
Feature-recognizing spam filters are right in many details; what they lack is an
overall discipline for combining evidence.
Recognizing nonspam features may be more important than recognizing spam
features. False positives are such a worry that they demand extraordinary
measures. I will probably in future versions add a second level of testing
designed specifically to avoid false positives. If a mail triggers this second
level of filters it will be accepted even if its spam probability is above the
threshold.
I don't expect this second level of filtering to be Bayesian. It will inevitably
be not only ad hoc, but based on guesses, because the number of false positives
will not tend to be large enough to notice patterns. (It is just as well,
anyway, if a backup system doesn't rely on the same technology as the primary
system.)
Another thing I may try in the future is to focus extra attention on specific
parts of the email. For example, about 95% of current spam includes the url of a
site they want you to visit. (The remaining 5% want you to call a phone number,
reply by email or to a US mail address, or in a few cases to buy a certain
stock.) The url is in such cases practically enough by itself to determine
whether the email is spam.
Domain names differ from the rest of the text in a (non-German) email in that
they often consist of several words stuck together. Though computationally
expensive in the general case, it might be worth trying to decompose them. If a
filter has never seen the token "xxxporn" before it will have an
individual spam probability of .4, whereas "xxx" and "porn"
individually have probabilities (in my corpus) of .9889 and .99 respectively,
and a combined probability of .9998.
I expect decomposing domain names to become more important as spammers are
gradually forced to stop using incriminating words in the text of their
messages. (A url with an ip address is of course an extremely incriminating
sign, except in the mail of a few sysadmins.)
It might be a good idea to have a cooperatively maintained list of urls promoted
by spammers. We'd need a trust metric of the type studied by Raph Levien to
prevent malicious or incompetent submissions, but if we had such a thing it
would provide a boost to any filtering software. It would also be a convenient
basis for boycotts.
Another way to test dubious urls would be to send out a crawler to look at the
site before the user looked at the email mentioning it. You could use a Bayesian
filter to rate the site just as you would an email, and whatever was found on
the site could be included in calculating the probability of the email being a
spam. A url that led to a redirect would of course be especially suspicious.
One cooperative project that I think really would be a good idea would be to
accumulate a giant corpus of spam. A large, clean corpus is the key to making
Bayesian filtering work well. Bayesian filters could actually use the corpus as
input. But such a corpus would be useful for other kinds of filters too, because
it could be used to test them.
Creating such a corpus poses some technical problems. We'd need trust metrics to
prevent malicious or incompetent submissions, of course. We'd also need ways of
erasing personal information (not just to-addresses and ccs, but also e.g. the
arguments to unsubscribe urls, which often encode the to-address) from mails in
the corpus. If anyone wants to take on this project, it would be a good thing
for the world.