The Bayesian Spam Filtering Software
A bayesian system consists of the database and the engine. The engine tests incoming emails against samples contained in the database and determines the probability of an email being a spam. This probability is then used by SpamAssassin to either add or subtract from the overall spammyness score that will be applied to the email.
The database is populated by a learning process where over time, you identify what is ham and what is spam by providing samples. Signatures of these samples are then processed into tokens in the database for comparisons against future incoming emails.
This system allows for constant updating of what is spam and what isn't That's important as the spammers evolve new and clever ways to slip past single layered spam fighting strategies. And, as spammers sign up new and different clients, or their campaigns change, all new sample signatures will need to be learned.
There can be pitfalls in the use of bayesian. A badly trained or neglected bayesian database will perform poorly, and can actually increase your workload by causing more false positives than spam it recognises. This set of articles was written to advise on the best practises to follow to build an effective, healthy bayesian database.
It's important to recognise that a bayesian engine is not a blacklist. Feeding it a sample spam and expecting the next exact same spam to be stopped is not going to happen - not based on bayesian scores alone. That's why spam identification is best done on the server side with a comprehensive layering of tools - each one better suited at on particular method of spam identification. Alas, many people will try to force feed a bayesian database into being a blacklist by dumping thousands of the exact same email into the learning folder. This is not good, and you will find out why in other chapters of this comprehensive subject.
Any client side add-on or module that requires you to identify emails you receive as spam ('This email is junk') is a bayesian, or bayesian like system - perhaps proprietary algorithms are used, but the idea is the same.
Although this article set focuses on server side bayesian systems with particular emphasis on its use embedded in SpamAssassin, the principles described apply to developing a healthy bayesian database on client side implementations too.
Two folders are needed for training a bayesian database. One for learning what is a ham, and another for learning what is spam. Usually, once a day, any installation will take these samples and apply them to the bayesian database for learning. Often this is configured to be done during off peek demand hours. But the command to learn the samples can be issued at any time, and SpamAssassin includes scripts for doing this.
SpamAssassin passes each email through the bayesian engine and adds or subtracts some amount of score based on the how the scoring is configured (we'll cover that on another page), and how confident the bayesian engine is of the hammyness or spammyness of the email
As always, the headers of any email will reveal what the bayesian engine thought of an email. Always look at those headers, become familiar with them and what they mean. A wealth of information is contained in these headers. Even complete SpamAssassin scores are available in these headers, including the bayesian contribution to the overall score.
We'll discuss the 200 sample minimum and score configuration strategies on separate pages of this article set, as well as best and worst practice discussions for achieving an effective, healthy bayesian database.
Some common mispellings for bayesian include: bahyesian, bayesain, bahyesain, bayesin, bayesiin, bayesan, beighesiin, bayeian, bayesien, baysian, beighesien, baesian, bahyesiin, byesian, bahyesien, beighesian, bayes1an, bayesiam, bayesina, bayeisan, bayseian, baeysian, byaesian, abyesian, bayesia, ayesian
|