spam filtering software

 

Effective rules for training a Bayesian database

A Bayesian system of spam recognition can be very effective. It works like this:

You supply samples of both spam emails and ham emails for the Bayesian system to learn from. What is meant by learning is that it begins to be able to recognise what is spam and what is ham based on the signature, or appearance of the samples you provide.

Once a Bayesian database has learned a fixed number of samples (200 samples) hit has statistically significant amount of samples to start grading incoming emails. Based on how confident a Bayesian engine thinks an email is ham or spam, a score will be applied to the email. The score can be a negative or low value if the email is recognized as ham, or the score can be positive if the email looks like spam. These scores can be adjusted, and deciding to modify stock scores represents a separate strategy on its own.

Baysian database development is an on-going task. It's is never time to stop the training process. As spammers evolve or their campaigns change, so will the signatures of those spams, and you will need to keep one step ahead of them.

There are a handful of rules you should strive to follow when training your bayesian database

  1. 200 samples
  2. Don't over do it
  3. Keep an even hand


1. As mentioned already, your bayesian engine will no contribute any scoring until it has been fed at least 200 samples of spams and hams.

2. Don't over do it. Some people feel compelled to dump hundreds or even thousands of spams into their bayesian learning folder. A bayesian engine is not a blacklist - that's important to realize. It only offers an opinion on spammyness of an email signature. Significant testing indicates that once your bayesian database is performing fairly well, best practice is to feed it only a small number of samples on a daily or semi daily bases. That means 3 or 4 samples every day or so.

Why is this important to avoid over-doing it? What happens when you go on a learning binge is that over time, you will have built up such high signal to noise ratio that your bayesian database will actually decrease in effectiveness. So keep it simple, don't over work yourself, and don't overwhelm your bayesian database.

3. Keep an even hand. This is probably the most important rule of all. It may be tempting to cut corners and only feed your bayesian spams to learn. If you are going to do this, then stop using a bayesian engine all together, because you're wasting your time. Training a bayesian database is like a training a dog. You can't just tell it what spam looks like, you must also tell it what ham looks like. If you fail to feed your bayesian engine equal amounts of hams, it will eventually think that everything looks like spam, and you will increase your administrative workload in hunting down false positives.

Always feed your bayesian database roughly equal amounts of both spam and ham. Now, you're not doomed if you don't feed exactly equal amounts, it only needs to be roughly equal. The simple message here is, don't concentrate only on learning spams (or hams for that matter), but roughly equal amounts of both.

So remember - You won't see any contribution from your bayesian engine until you have 200 samples learned. Don't over feed your bayesian engine, once performing nicely, 3 to 5 per day or every other day is good. And always feed roughly equal amounts of both hams and spams.

Also, you can always query the database to see how many samples are present and balance of ham to spam samples with the following command line syntax

sa-learn --dump magic