Date: 2005-04-26 00:51:00
Tags: spam
new twist in the fight against spam

It seems that spammers have introduced a new twist on an old tactic. When Paul Graham's A Plan for Spam article introduced Bayesian filtering principles to the antispam world, spammers were quick to react to this new threat. Since their spam was now being scored by full content (and not just naive keyword matching), they started including snippets of legitimate text along with their spam messages. This legitimate text, since it wasn't part of their marketing campaign, was typically displayed in an impossibly small font or in invisible (ie. white on white) colors.

Anyway, I recall seeing text pulled from such works as Moby Dick, Ulysses, and various Shakespeare. It didn't matter what the text was, as long as it didn't look very much like spam. As far as I can tell, there are at least two goals involved here:

  1. With the inclusion of a lot of non-spam text, there would be a slightly higher probability that the message might look a little bit more like a legitimate message, and would then sneak through a slightly higher percentage of spam filters.
  2. Bayesian filters learn patterns from the messages you receive and mark as spam. When you mark a message as spam, each word in the entire message essentially gets a count in the "spam" column. By including a lot of non-spam text, this means that a lot of non-spam words will end up with higher counts in the "spam" column. This has the longer-term effect of decreasing the trustworthiness of the Bayesian filter data, because it may start to mark legitimate messages as spam. If this happens a lot, users may turn off the Bayesian part of the filter.

Recently, several people (cetan, leroy_brown242, Amy) who have journals, have received messages from other Internet users wondering why some of their journal text was included in the spam message. Obviously, the journal authors don't have anything to do with the sending of the spam. It seems that the spammers are now scraping text off the Internet instead of using text from the classics.

Perhaps this approach is intended to more closely match the kind of text that people normally receive in email. Because the text is written by today's Internet users and not 19th century authors, the vocabulary will be better suited to confuse spam filters.

This new technique is surprising and annoying to those users whose text is used in spam. Most recipients of the spam will either not see the message at all, or not see the small/obscured text, or just ignore it. The few who do look at the whole message and google for key words or phrases to find the original author's journal, seem skilled enough at that point to not accuse the user whose journal text was used.

Fortunately, Bayesian filtering techniques are just one weapon in the fight against spam. With blacklists, SPF, virus scanners, and the battery of tests provided by SpamAssassin, I now get, on average, about 5 spam messages in my inbox per day. Since my mail server receives about 1000 spam messages per day, that's less than a 1% miss rate on my spam filters.

I use Thunderbird, and its anti-spam self-learning filtering works really good. But lately I've been bombarded by even more spam than ever before. I love to be accessible, so I don't believe in hiding my e-mail address to the world or changing it. Maybe I should upgrate my antispam tactics ad get SpamAssassin?

There should be tougher laws against spam and/or they should be enforced more.
SpamAssassin is normally a unix server-side product, but since it is open source I believe others have repackaged it in a form usable by Outlook or Thunderbird. I don't really know anything about those products, though.
A few municipalities (Washington state comes to mind) have enacted legislation making spam illegal and they've been able to arrest a few people, but the problem is this only serves to move the spammers somewhere else. As it is, a lot of spam comes from overseas (China's really bad from what I understand).
What is your rate of false positives (non-spam erroneously classed as spam)?
I check occasionally, but there is such a huge amount of spam to sort through it's hard to find. I find the occasional message, perhaps one every month or two, so that would be about 1 in 50,000. Hard to say for sure, though.
My ability to sort went up when I added the spam score to the subject of the spam messages and sorted my spam folder by subject. I don't find many false positives with scores above 10 or so
Maybe one could could successfully prosecute spammers using that technique for unauthorized redistribution of copyrighted content, even if they are not found to be violating any other spam legislation? :) Of course identifying the spammer to initiate the lawsuit against may still be the difficult part.
That would certainly be an interesting precedent to set
It seems that a lot of the text is not copied verbatim, but is instead scrambled up with other text from the same or different journals. I wonder how well copyright protects the author in cases where the copied text is scrambled nonsense.
I wonder if Gmail's antispam tools are more effective than those in Thunderbird.
I forward a copy of almost all my (pre-filtered) mail to my gmail account, so I can measure the performance of gmail's spam filtering too. Until recently, it was quite aggressive and did not let very much spam through, but also had a lot of false positives. Every day I go through my gmail spam filter and unmark the messages that it has misclassified as spam.

A few weeks ago, I think they relaxed the spam criteria a bit because I am getting a lot fewer false positives, and more spam arriving in my regular inbox. This is probably also on the order of 10 per day, but I don't keep accurate statistics.
I really like Yahoo's spam catching mechanism. It is so simple and yet works very well so far for me.
How is it simple? Does it tell you what kinds of things it's actually checking for?
Disclaimer: I'm taking a stab at what I think goes on behind the scenes.

It appears as if Yahoo does a comparison of incoming email to detect duplicates (or maybe near duplicates). It must flag these with a unique identifier to group them. Users can mark email as "Spam". It is a voting system to determine whether the email is spam or not. After a certain vote percentage has been reached, it automatically moves all of those emails into your spam folder (which is set for autodelete after a certain amount of time).

There are other techniques they use too which seem to revolve around a history of spam from a given email account. A friend of mine sent an email to me once and it was mistakenly marked as spam but that very rarely happens.

At any rate, out of the 500 or so spam in my spam folder, only about 2 or 3 slipped by and I just voted them spam so hopefully the other people never have to see them.
Yes, there's a lot of things that a large email receiver such as yahoo or gmail or hotmail or aol can do, by comparing incoming email across a wide range of user accounts. The mechanisms that yahoo uses to determine what is spam and what isn't is much more sophisticated than just a voting system (a guy I used to work with moved on to work with yahoo's anti-spam team).

There are dangers with this approach though. Because it seems that a lot of junk messages are sent with my name in the From field, gmail thinks I'm a spammer. There has still been no satisfactory resolution to that problem, and I don't expect to get one.
Have you looked at getting one of those email certificates that certifies you're not a spammer?
How would that help? I already publish SPF, DomainKeys, and sign my outgoing messages with an S/MIME certificate. What sort of certificate are you thinking of?
I don't think you have to worry too much about the random text decreasing the effectiveness of Bayesian filters.

It will increase the spam counts of common words, but those words will also tend to appear in valid emails, so their spam score will be relatively low. The terms that only appear in the spams will still be there, and will still score higher. And since the algorithm only looks at the first 10 or so highest-scoring words, the spams will still get flagged. The highest scores on normal mail, even those words that get included in random spam, will still be low enough that it won't trigger anything.

Or so I understand. It's been awhile since I read the original article, and I'm too slackful today to go reread it. I think Graham addressed this.
Gmail has been great as has Spam Bully and spam bays
Greg Hewgill <>