09/19/2009

Spam is so irritating.

I run my own mail server– actually, I run mail for hundreds of people– and as a result I have to do everything I can to block spam to them, without compromising legitimate messages. So I have three layers of protection:

First, there’s “greylisting”, which tells the mail server sending the message that the receiving server (mine) is temporarily unavailable. It’s kind of a clumsy trick, but it works because spam software will only attempt to send a message once before giving up. A bona fide mail server, on the other hand, will wait a few minutes and re-send the message. The second time, my server lets it through– and remembers the sender, so the next time the mail gets through immediately. Although this seems really simple, it’s terrifically effective and probably blocks at least half of the inbound spam.

Then there’s “blacklisting”, which checks the source of the message (every mail server includes information about itself in the message header) against a list of known spammers. There are hundreds of thousands of known servers out there, so if an incoming message matches, it’s stopped dead.

Finally there’s a heuristic content filter that actually “reads” the incoming message and looks for key words (“Viagra”, “enhancement”, “free mortgage quote”, that sort of thing). It also checks for suspicious headers, lots of images with little text, and other things that spam tends to have. If there’s enough funny stuff going on, the filter deletes the message.

So all of that happens before the mail even gets to the recipient’s mailbox. I find that it’s pretty effective, probably blocking 90% of the incoming spam. Of course a little bit still trickles through. The problem is that spam is such a flood– possibly accounting for more than 90% of mail traffic on the entire internet– that even a trickle means my customers are getting a handful of spam messages every day.

Since I’ve had some of my email addresses for a decade, there’s been plenty of time for them to show up on various spam lists. As a result I probably receive more spam than the average user. So I put another filter in place on my own mailbox, and that filter is the most awesome of all. I “train” it by sending it examples of spam messages that got past, as well as “ham” (legitimate) messages that I want to receive. It remembers words and phrases from each type of message, and over time it “learns” what I consider to be spam versus what I tend to want to receive. Amazing stuff, really, but the downside is it’s sort of a personalized spam filter because if I used the same rules for one of my customer’s mailboxes, it may fail catastrophically. My customers probably don’t get quite the same mix of web programming, Linux user group, ultimate frisbee, and Facebook invitations that I do.

So here’s my latest spam vs ham database:

[fixed:The information shown below is an analysis of your spam database.

Histogram
score   count  pct  histogram
0.00    31884 38.67 ####################################
0.05      133  0.16 #
0.10      182  0.22 #
0.15      265  0.32 #
0.20      331  0.40 #
0.25      321  0.39 #
0.30      316  0.38 #
0.35      524  0.64 #
0.40      501  0.61 #
0.45      256  0.31 #
0.50      540  0.66 #
0.55      288  0.35 #
0.60      770  0.93 #
0.65      360  0.44 #
0.70      232  0.28 #
0.75     1440  1.75 ##
0.80      330  0.40 #
0.85      671  0.81 #
0.90      468  0.57 #
0.95    42629 51.71 ################################################
tot     82441
hapaxes:  ham   20710 (25.12%), spam   34438 (41.77%)
   pure:  ham   31808 (38.58%), spam   42439 (51.48%)]

I’ve sent it over 82,000 messages to chew on, and of those, almost 43,000 contain words that are “100% spam”– meaning the words in those messages only appear in other spam messages (at least as I categorize them). And about 32,000 words are pure ham– meaning they’re terms that my spam just doesn’t contain. There’s the fuzzy area between, where words sometimes appear in spam and sometimes in ham.

But the filter is pretty smart, so when a new message comes in, it looks at all of the words, compares them to its dictionary, and assigns a score to the message. If the score is greater than some threshold I define, the message is probably spam and it’s dropped. Good riddance.

After training this puppy, I was amazed at how effective it is. I went from probably a few hundred spam messages a day (ugh) to maybe half a dozen. Sweet!

Out of curiosity, I checked the performance of this filter over the past three months. From June 18 to today, I’ve received 50,208 email messages. Remember this number is after the initial three spam filters have been applied– in reality there have probably been close to half a million messages sent to my mailboxes in that period. Yikes.

Of those that were handled by the filter, 26,299 were spam. Doing the math, that’s 300 junk messages per day. And it means 23,909 legitimate messages were sent, or about 270 messages per day.

Wow, that’s a lot of email. I sure feel loved.