Useful Bayesian points to keep in mind
Tokens expire
What does this mean. Each sample you provide to the bayesian engine for learning becomes a signature token in the bayesian database. They expire FIFO (First In First Out) and are removed from the database. A neglected bayesian database will degrade over time. Tokens for old spammer campaigns don't match current ones and the effectives decreases.
Score adjusting strategy Even with a good healthy bayesian database the default rule scores can cause a good number of false negatives. It can be a good idea to test some different score strategies.
The following scores have proven to be very effective:
score BAYES_00 -1 score BAYES_01 -0.5 score BAYES_10 -0.25 score BAYES_20 0.001 score BAYES_30 0.5 score BAYES_40 1.5 score BAYES_44 2 score BAYES_50 2.5 score BAYES_56 3 score BAYES_60 3.5 score BAYES_70 4 score BAYES_80 9.2 score BAYES_90 12.3 score BAYES_99 17.5
Observe how the scoring is arranged, with low probability having negative values, high probability scores having a positive number, and mid range probability having a score near zero.
Adjusting your bayesian scores in this way as proven to have a very positive effect on identifying spam without a noted increase in false positives (ham caught as spam). You can cut and paste the previous scores into your SpamAssassin local.cf file.
Note: Always use local.cf to make any changes to SA scores and rules. Local.cf is designed for this purpose. It gets loaded last so entries override all SA files loaded previously. Local.cf will not be overwritten when you upgrade your SpamAssassin version.
Why scores don't always show up in headers? The default bayesian scores can have null values for the mid range - meaning that for example an email for which the bayesian engine is unsure might have any scoring value associated. Suppose the bayesian engine determines 50% probability that an email is spam. If the scoring for a 50% does not specify a value, it will not appear in the headers.
The best solution to this is to assign a miniscule value to these. One which will not contribute a significant amount to the spam score. With mid range bayesian scores assign values such as .001 or -.001, the bayesian score will always appear along with the other SpamAssassin scores in the log files and headers.
Why don't someone just gimme a database? Starter databases are not readily available for download from websites. They are out there, but they can be difficult to locate. The reason this is not more widely done is for the simple rule - One mans spam is another mans ham. Your company might be in a business that regularly deals with messages that look very spammy. Or perhaps the opposite extreme is true. If you receive a lot of trade journals or emails of that nature, then they might be problematic.
It's just not a good idea to use a starter database. Follow the steps in bayesian sample rules and build up to your 200 samples. It doesn't take that much time, and during that time bayesian scoring will not be applied but your results will be far more accurate than starter database that doesn't match your business content.
But if you insist, some starter databases can be found here http://www.fsl.com/support/
You can always query the database to see how many samples are present and balance of ham to spam samples with the following command line syntax: (don't forget to cd/ to your SpamAssassin/Bayes directory first)
sa-learn --dump magic
You can diagnose problems with SpamAssassin by running it in debug mode with the following syntax
spamassassin -D --lint
|