Comments

There are some weirdnesses with Spambayes. It depends a bit on how you train. Your ALL-CAPS SUBJECTS don't get caught because the tokenizer folds case. Check tokenizer.py for a huge comment at the top discussing this and other aspects of tokenization.

Also, as spambayes is statistical, single words don't carry a lot of weight in themselves (unless they have been seen in a lot of one type of mail). So just because a message mentions "Nigeria" isn't enough - especially if many of the other words are ham clues.

I don't know what interface to Spambayes you use (the POP3 proxy?) but most of them have some form of "show the clues for a message" gizmo. That can often be illuminating to look at...
posted by Paul Moore at 08:41:13 AM on February 14, 2004
I use K9. It doesn't fold case, which is why it may be better at recognizing such e-mails. Even so, traditionally it's had problems with the 419 scams, and most recent one I got had a low, though spammy, rating. I think those types of spams are just harder to recognize since they often say things in such a roundabout way ("I am writing to you in view of the fact that we would be of great assistance to each other like developing acordial relationship.")

I just checked my spam ratings for that recent e-mail, and even the word Nigerian was more hammy than spammy, probably from other e-mails talking about the 419 scams. But for an example of how being case-sensitive is useful, the word "Dollars" was one of the most spammy words in the e-mail (99% spammy, with 118 occurrences in my spam db, 0 in my good db), while "dollars" only had a spam rating of 60.3%. For whatever reason, one of the most recognizable spammy indicators is the version of Outlook Express they use (or that their bulk mailing software claims to be).

Also, re: "Career News"... K9 lets me use a blacklist so I can compensate if there's ever anything it consistently misses. For a long time it had a lot of trouble recognizing these "stock alert" e-mails from "OTCBB". Now it consistently gets them, even though they seem to have stopped using the domains (otcbb.com and otcbfirstalert.com) I put in my blacklist :)

What kind of accuracy ratings are you getting with SpamBayes? For me, K9 is almost 99.5% accurate over the past ~4600 e-mails, and 100% over the past 630.
posted by Keith at 12:43:13 PM on February 14, 2004
Oops, correction, it was otcfirstalert.com.
posted by Keith at 12:45:18 PM on February 14, 2004
Check your mail headers, some of the spam that has got past my SpamBayes had this in the headers:

X-Spambayes-Exception: Traceback (most recent call last):
. File "C:\Python23\Scripts\sb_server.py", line 438, in onRetr
. msg.setPayload(messageText)
. File "c:\python23\Lib\site-packages\spambayes\message.py", line 231,
in setPayload . prs._parsebody(self, fp)
. File "C:\Python23\lib\email\Parser.py", line 239, in _parsebody
. msgobj = self.parsestr(part)
. File "C:\Python23\lib\email\Parser.py", line 75, in parsestr
. return self.parse(StringIO(text), headersonly=headersonly)
. File "C:\Python23\lib\email\Parser.py", line 64, in parse
. self._parsebody(root, fp, firstbodyline)
. File "C:\Python23\lib\email\Parser.py", line 245, in _parsebody
. raise Errors.BoundaryError(
.BoundaryError: multipart message with no defined boundary
posted by AC at 11:12:22 AM on February 16, 2004