Last week I started looking at ways to cut down on false positives in our spam filters. I’ve only seen two in my own mailbox this year, but of course everyone gets different kinds of email. I’ve been trolling the server logs for low-scoring “spam,” looking for anything that looks like it might be legit, particularly if the Bayes subsystem has already identified it correctly but isn’t enough to counteract the score assigned by other rules. (Unfortunately, it’s hard to tell when all you’ve got is the sender, subject, and list of spam rules.)
One item I noticed was a copy of the Microsoft Technet Flash newsletter. I thought this was odd, since I’d gotten a copy of the same newsletter and it hadn’t been labeled. In fact, it turned out that my copy only scored 0.3 points, and the other hit 6.4! (5 points indicates probable spam.) What could explain such a disparity?
Answer: two very small differences.
First, it came through a half hour later, after the message had been reported to DCC. Since DCC is technically a list of bulk mail and not a list of spam, it’s theoretically not a false positive. That added 1 point.
Second: The text must have been slightly different, because the other copy triggered a rule that looks for the phrase “no cost” or “no charge.” That not only added 1.7 points, but altered the message enough that the Bayesian classifier was less certain it was legit – a shift from 0% to 1-9% likelihood of being spam. So instead of subtracting 4.9 points from the score, it only subtracted 1.5 — a net gain of 3.4 points.
Of course, these differences wouldn’t have been an issue if the message hadn’t been formatted in a very spam-like way. Before Bayes made its adjustments, the copy sent to me would have been 5.2 points — just over the threshold. Someone else seemed to be getting the text-only version (it didn’t trip any HTML rules) and the final score was -1!