Where to get Bayesian Ham samples
A similar problem exists for collecting ham samples. See bayesian sample rules to know why you must have spam samples. Hams get delivered to the end users, and they are less compelled to supply you with samples of hams since that's what they expect to get.
Configuration wise, this is an easier problem to solve. With any decent spam filter / mail server, you will have the functionality of server side rules. With a full compliment of conditional statements and action statements, you can easily create a rule that looks at the spam score of an email and copies it to a folder for ham samples if its score is below X points.
This ham samples folder should also be an IMAP folder with only administrator ACL rights. Then, just as with the spam samples IMAP folder outlined in the bayesian spam samples article, the administrator can select his ham samples for balancing out the spam learning samples. This balancing is a requirement as outlined bayesian sample rules.
There me be privacy issues to consider with this, but for the most part, the administrator is one person that has access to the system and it's store of data and emails anyway. So having the administrator looking through hams is probably not a privacy issue for most, but do be aware of this before implementing it.
On the other hand, if you are already doing archiving of emails, you will already have a ready made source of clean hams to be used to balance out the spam samples you will be feeding to the bayesian engine for learning.
In selecting your ham samples, you might want to select those that scored rather high, near the threshold of being stopped as spam. But how can you easily select which ones those are without opening each one? Again your server side filtering rules can come into play.
Lets suppose your spam / ham threshold is 6. Anything with a spam score of 6 and above is considered spam, while anything scoring less than 6 is considered ham. The filter rule you use to collect ham samples could say to the effect (pseudo code)(if score < 5 and score > 3 the copy to hamsamples folder). With your ham sample rule written this way, you are saying to only skim the higher scoring hams, scoring between 3 and 5 as our ham samples. This guarantees that your ham samples are high scoring.
If you are using your archive folder for ham samples, just go ahead and implement a rule as described above, and have a separate archive folder and separate ham samples folder. Just clear out the ham samples folder after each learning session.
|