
These parameters control the statistical computations used by Praetor.
|
Minimum deviation |
For each token, there is a corresponding frequency found in good and bad. When expressed as a percentage, the difference must exceed this value in order to be counted towards to overall computation of the message spam probability index or spamicity. When the difference is low, this means the token was seen almost as many times in good messages as it was seen in spam so it can be ignored. In Paul Graham's original proposal, he only used the top 15 or 20 tokens to compute the spamicity value. The approach used by Praetor is different in that we use a minimum difference to weed out the unimportant tokens. The default value is set to 0.1.
|
|
X coefficient |
This parameter is the default value to be assigned when a token is encountered for the very first time. Essentially it represents a "first guess" as to its contribution to the overall spamicity score. This value should be close to 0.5 and the default is 0.415.
|
|
S coefficient |
If a token is encountered for the very first time, we only have the X coefficient to use, but what if we have see the token before only a few times? Due to statistical variation, using the ratio of two small numbers will be rather unreliable so a compromise is needed between that ratio and the first guess value X. The S coefficient is used to serve as a weighting factor such that the larger it is, the greater the importance given to X when the token counts are low. Setting S too low is not a good thing because of the situation where the token is seen before for spam but not for ham, or vice versa. Choosing this value is largely a matter of trial-and-error but studies have shown that it should be in the neighborhood from 0.01 to 0.1. In one test the spamicity computed on a spam sample where ten out of 78 tokens contributed to the calculation, changed the computed result from 0.999 to 0.505 when S went from 0.001 to 0.00000001! The default is set to 0.01
|