Linux, Unix, /etc

Danger Will Robinson! You are now entering a condescending Unix user zone!
Sponsored links (requires javascript):

Spam-filtering

Spam belt tightening done badly rejects legitimate mail says a recent article on advogato, the collaborative weblog for free software developers. Well, it's already happened! I can't mail AOL users, for instance. AOL refuse to accept e-mail direct from my mail server. They require me to go through my ISP's server. This is A Bad Thing for many reasons. a) how do they know my ISP provides such a server? b) it breaks the Internet c) limiting general freedoms purportedly to combat a specific evil is an old, old trick.

I thought it was worth outlining a better way of dealing with spam— specifically, how I deal with it. I have two lines of defence.

1. blockmail

I run sendmail on my Internet gateway. Sendmail has a simple method of access control, controlled by a file /etc/mail/access. This is a list of users and/or hostnames and/or networks that may be treated in various ways. In my case, I use it simply as a "reject list", to block certain hosts and networks from sending e-mail to my machine at all. Clearly, this is a facility that must be used with some care, or we shall end up blocking legitimate e-mail, just as these damned blacklists the likes of ORBS do.

Here's the shell script "blockmail":

#!/bin/sh
#blockmail: add a new reject rule to /etc/mail/access
#reads a list of domains and/or e-mail addresses from stdin
for i
do
if grep -q "^$i     REJECT" /etc/mail/access
then
echo "$i already in /etc/mail/access; aborting"; exit 1
else
echo "$i        REJECT" >> /etc/mail/access
fi
done
newaccess

Note the use of newaccess (a symlink to sendmail) to build the access.db file, which is a db(3) equivalent of the access file to help speed up access for large file sizes. The limitation of the access is that it only looks at the From: header, which these days is usually (but not always) forged.

2. bfilter

So, most spam gets through. Then what? Well, before my e-mail reaches the stage where I actually have to read it, it goes through another process, being filtered through a very useful little program called procmail. So, over the years, I took the approach of adding rules to my procmail configuration file, .procmailrc, to catch spam e-mails, storing them in a folder that I checked periodically for "false positives". However, spam changes, and keeping up with it sufficiently to catch most of it is not a trivial task. Enter Bayesian filtering...

Now, Bayesian analysis seemed just the trick, but finding a decent simple Unix filter to do it wasn't as easy as I'd thought. I eventually discovered a C program wrtten by Chris Lightfoot that suits me. With bfilter, there's no need for some outlandish perl setup or python or v2.3.5.6 of this AND version 7.9.9 of this OR version 5.222 of the other: just a C compiler and the stands Unix db libraries. bfilter is just what I want: a small and simple C program that requires no extra libraries nor general jumping through hoops. I haven't been running it long enough to give a judgment on how effective it is, but so far it's been catching most spam, and hopefully the efficiency will improve greatly as it "learns" about the sort of spam I get.

So now, instead of a big .procmail file containing hundreds of rules through which every single e-mail message must pass, I have just one anti-spam rule, which passes every message through bfilter, and chucks it into the spam bin if bfilter denounces it as such.

:0 fw
| bfilter test

:0:
* ^X-Spam-Probability: YES
$MAIL/spam/.

Of course, the spam bin still has to be checked periodically for "false positives", and spam does sometimes still get through. I'm probably on about 30-40 spam messages nowadays, a drastic increase on even this time last year, but of those I see in my inbox every day between 0 and 1.

[back to Linux, Unix, /etc]



Copyright © 1995-2007 Paul Dunne,

Sponsored links (requires javascript):