Bayesian Statistical Filtering

Introduction

The technique using Bayesian statistics was first publicly proposed by Paul Graham in his August 2002 paper A Plan for Spam.  The basic idea is to attempt classifying email into spam or non-spam (also called "ham") by breaking down the email into components he calls tokens - words, IP addresses, and other significant features in the email, and then comparing those tokens with a database.  This database is simply a frequency table that counts how often any token has been seen in spam and ham.

Let us take a quantitative example with the word "cash".  If the database of experiences indicates that we found this word a total of 200 out of 1000 spams and only 3 times out of 500 hams, then the ratio of spam to total occurrences is:

 

spam ratio P    =

(200/1000)
------------------------------- =
(3/500 + 200/1000)

 0.200
-----------------------   =
0.006  +  0.200

 

 

0.971

Now imagine you did this for all N tokens seen in the message under review, and computed the ratios P1 ... Pn.  Using naive Bayesian statistics you can compute an overall probability that the message is spam using the following equation:

 

spam probability =

 P1 P2 ... Pn
--------------------------------------------------------------------
P1 P2 ... Pn  + (1 - P1)(1 - P2) ... (1 - Pn)

 

While simplistic, this overview summarizes what was originally proposed by Graham.  From this historical reference the technique has been further optimized to:

 

Pre-configured token database

Praetor is installed with a default token database trained from several thousand samples with 85% being spam.  Thus you do not need to perform any initial bulk training, rather, you can perform subsequent training from this default starting point.  Since there are already plenty of examples of spam, you should concentrate your efforts on training the false positives — good messages that were caught as spam or unsure.  

(If you were using Praetor v1.5, please note that all our test sites have found that the Bayesian filtering method does a far better job than the many rules used previously.  Even better, there was less administration to tweak the rules.)

To check if Bayesian filtering is being performed via the system filters check on the spam control screen as shown below.

If the BASIC pre-configured rule filter is checked, then verify that Bayesian classifying is enabled (default) by checking on the Classify tab page.

At times when you want to capture some samples of spam for further training, you may want to enable the checkbox to save any message that is classified as Unsure, i.e. those whose "spamicity" or spam probability is between the two cut-off values on the Options tab page (default range is between 0.30 and 0.60).  If such "unsure" messages are saved, then they will appear in the category in the   node so that you may review them and queue them for training as good or bad samples.  Click here to read how periodic incremental training is performed.

Warning:

Saving Unsure messages should be done sparingly as it can cause many message files to be captured which can slow down the Windows operating system.  Depending on how much email traffic your site receives, you should do this only for brief periods.