spam filtering software

 

Where to get Bayesian Spam samples

Several problems may have become obvious. Where do we get samples of both hams and spams to feed to the bayesian engine? If a spam gets through our system, it's delivered to the end user and the administrator may no longer has access to it depending on how the end user collects his emails.

Often times, people will forward these spams to the administrator for learning. The problem with this is that when a spam is forwarded to anyone, including the administrator, many of the emails headers will be completely modified rendering it a useless as a sample.

One way around this is to provide an IMAP folder with default ACL rights sufficient for all users to copy their spams into. Doing this will maintain the integrity of the sample. Another less workable option is for an administrator can share out a public folder. Some email clients will allow users to drag emails onto an Explorer folder, but not the more popular ones. So the public IMAP folder option is preferred.

An additional benefit to publishing a public folder is that the administrator can use any IMAP client to view the samples his users have provided - meaning he can view the subject and other header information more easily then if the emails were in a regular folder as a flat file.

At this point, you might be tempted to simply publish the bayesian learning folders with IMAP ACL permission for users to feed the bayesian system their spams. Do not do this. Selecting samples for the bayesian learning should be done by an administrator for a number of reasons.

  • Liability can occur if sensitive eyes are allowed to read the spams collected.
  • One mans ham is another mans ham. The administrator must make this decision.
  • Allowing users to affect what is caught as spam and what is not will result in more complaints and waisted hours for the administrator.
  • Samples provided to the bayesian learning process must be carefully approached, not flooded with samples by everyone.

Do not allow end users to directly supply samples on a server side bayesian system. Outfitt their clients with one of the countless client side products.

Ok, you've got an IMAP folder with public ACL rights, and end users are supplying any spams that they receive. You, the administrator use your imap client to view this folder, select samples for learning and move them into your bayesian learning folder for spam samples. Again, follow the bayesian sample rules. Once you've got your samples for this session, go ahead and empty the folder containing all the users contributions.

If your end users are configured to use the POP3 protocol to fetch emails, then you have a problem because POP3 does not support public folders, and they cannot submit samples. If this is the case, you might want to setup a filter rule to collect high scoring hams. High scoring hams are often actually spams that scored below your filtering threshold. Write a rule to make copies of these into a folder before being delivered to the end user, and you have your spam samples.