One of the things you learn when you analyze spam texts is how narrow a subset of the language spammers operate in. It’s that fact, together with the equally characteristic vocabulary of any individual user’s mail, that makes Bayesian filtering a good bet.
This whole article is a great read, even if you don’t understand the code fragments (I don’t but the text makes it clear enough). Mozilla 1.3 now has a Bayesian filter, based on Graham’s ideas.
What struck me about this was the notion of vocabulary shallowness as a filtering strategy. I was catching up on my reading last week (doctors offices are great for that) and these was an article about the genesis of the Cat in the Hat and simultaneous demise of Dick and Jane. The problem with Dick and Jane, as explained by Rudolph Flesch was that those books introduced words as words, rather than phonemes or sounds: the expectation was that children would learn to recognize words without knowing the sounds that comprised them. The success of this approach is illustrated by the title of Flesch’s book: Why Johnny Can’t Read.
(Flesch is also know for the “Fog Index,” a measure of the comprehensibility of a document, based on word lengths and frequencies.)
Dr. Suess was commissioned to write a book using some of the simple lists of words Flesch determined contained the core phonemes: took him a year and 222 words. Not only were the Cat in the Hat and its successor monstrous successes, but a whole line of books — Beginner Books — was created to keep ’em coming.
This idea of being able to filter out junk email based on the message’s lack of vocabulary, either inherent or driven by formula, is fascinating. I’ll be testing this to see how it works.
I know Apple’s Mail.app has a teachable filter in it but it doesn’t seem to work all that well: we’ll have to examine Mozilla’s take on it.