Monday, August 27, 2007

Why do we still have spam?

I'm asking literally. I'm pretty confident some form of spam will always be with us. Snail mail has been around for centuries and we still get junk mail. I'm more wondering why we have the quantity and content of spam we do.

While spam email and physical junk mail are certainly kin to one another, spam has a quality all its own. I haven't taken measurements, but the sheep/goat ratio in my physical mailbox is considerably higher than in my main email inbox, and most of the junk mail is stuff I might plausibly be interested in. At worst, it's at least clearly related to something I signed up for a while ago.

Most spam is just dreck, and there's an awful lot of it. No matter how many chances I get to plow my fee from transferring the fortune of an ex-member of the Nigerian government into the European lottery and use the winnings to buy replica watches and pharmaceuticals, I'm just not going to click on the link and go for it. Sorry, guys.

Why is there such a difference in quality and quantity between the two? The obvious answer is economics. It costs something to send me an envelope. It costs nearly nothing to send me an email. I'm guessing the main cost is the inconvenience of occasionally getting busted and shut down, and even that is obviously not a great cost except possibly in the most egregious cases. So that rules out one possible solution. No one wants to pay to send legitimate email, so sending spam is always going to be cheap, too.

That more or less leaves technical fixes, and it's fascinating to watch the Darwinian arms race play out. Dumb rules don't work at all. That's why we see all the creative misspellings. It's interesting, too, that the "this doesn't have enough actual text" rule bit the dust pretty quickly. Embedded images still seem to be popular, even though > 90% of the time they're garbage. It's that other < 10% that's the problem. Bayesian smart rules seem to work a bit better, but the spammers have been getting smarter. For a while T-bird's filters would get a good majority of spam, but that no longer seems to be the case, even with constant training.

One of these days I'll look up the mechanics behind that filter. As I understand it's based on words or phrases, but the same approach ought to work for other kinds of feature extraction. Instead of a hard "not enough text means spam" or "too many misspellings" dumb rule, image/text ratio and misspellings could be features that, together with, say, "replica" and "watch" in the subject line might be likely to trigger the filter. I also wonder if it's worth trying to make word matching fuzzy, so that "replica" and "rep lica" and "repl1ca" would all count the same. I'm guessing those approaches have been tried by now (or don't work as well as one might expect) and what we have is probably about as good as it's going to get.

To my knowledge, I've never used a central-database-driven approach, where everyone mails any spam that reaches them and the mail client checks incoming mail to see if someone else has already flagged it. I assume, though, that this is why you see random blocks of text at the bottom of spam messages and little noisy dots in images. Everyone's copy of the spam message is unique, presumably preventing a match. This sort of random noise is actually not unlike the "type the word you see" bot blocker. In both cases our visual system cares less about the noise than a computer does and it's a hard problem to make the computer do better.

My most effective spam blocker so far is the simple whitelist, and I'm irked that it took me as long as it did to try it. Quite a while ago I'd noticed that once I'd set up filters shunting each category of real mail into its own folder, the residue in my inbox was nearly pure spam. Of course, I could never remember which folder particular pieces of real mail had gone to, but that's for another post.

Unfortunately, "nearly pure" meant it was a hassle to find the now-rare real mail that didn't happen to match one of my real mail filters. A year or so ago I finally took the next logical step and added a saved search called "not in address book" to my inbox in T-bird. VoilĂ ! The contents of that folder were very pure spam. I could then spend a few seconds every so often scanning them to make sure of their purity, gun the whole bunch and anything left in the inbox was good. Still a bit of a hassle, but it at least brought the spam-scanning down to around the time spent actually reading and understanding real subject lines.

Further, if anything legit made it into "not in address book", it was generally because I'd just asked for it and I knew to look for it. Finally, the "not in address book" folder stayed empty, because it was constantly cleared out, so there was less of a chance of losing an email from a long-lost acquaintance. (I use past tense above because I haven't brought the saved search over to my new setup since I wanted to see how the Bayesian filters were doing these days. Suffice to say the saved search will be back soon. [Ed. Note: And so it was. And it still worked just fine])

This gives me some hope that we may yet get spam mostly under control. When I look at my solicited email, there's still a fair bit I don't really want to read, but it's much like my snail mailbox. The stuff I don't want is at least plausible and it's easily ignored.

Further, it's possible to tighten up the whitelist approach securely, using digital signatures. Perhaps some day enough people will be using signatures that I can change my "not in address book" search to "unsigned". Ideally, email-driven "opt in" schemes would have an extra step whereby the vendor (say, my bank) gives me a key that it will use to sign the "click on this link to register" message it's about to send. My long-lost acquaintance would go through my social networking site to do the initial handshake, and conversely accepting an invitation would involve an exchange of keys.

If you buy that, and it's certainly they kind of scenario I've heard mentioned, then my original question reduces to "Why doesn't everyone use digital signatures?" An interesting question, that one.

1 comment:

David Hull said...

Note to self: now it's Gmail ... but we also tolerate the occasional lost to spam filter message