The Heart of a Spam Filter
In this chapter, you’ll implement the core of a spam-filtering engine. You won’t write a soup-to-nuts spam-filtering application; rather, you’ll focus on the functions for classifying new messages and training the filter.
This application is going to be large enough that it’s worth defining a new package to avoid name conflicts. For instance, in the source code you can download from this book’s Web site, I use the package name COM.GIGAMONKEYS.SPAM
, defining a package that uses both the standard COMMON-LISP
package and the COM.GIGAMONKEYS.PATHNAMES
package from Chapter 15, like this:
(defpackage :com.gigamonkeys.spam
(:use :common-lisp :com.gigamonkeys.pathnames))
Any file containing code for this application should start with this line:
(in-package :com.gigamonkeys.spam)
You can use the same package name or replace com.gigamonkeys
with some domain you control.3
You can also type this same form at the REPL to switch to this package to test the functions you write. In SLIME this will change the prompt from CL-USER>
to SPAM>
like this:
CL-USER> (in-package :com.gigamonkeys.spam)
#<The COM.GIGAMONKEYS.SPAM package>
SPAM>
Once you have a package defined, you can start on the actual code. The main function you’ll need to implement has a simple job—take the text of a message as an argument and classify the message as spam, ham, or unsure. You can easily implement this basic function by defining it in terms of other functions that you’ll write in a moment.
(defun classify (text)
(classification (score (extract-features text))))
Reading from the inside out, the first step in classifying a message is to extract features to pass to the score
function. In score
you’ll compute a value that can then be translated into one of three classifications—spam, ham, or unsure—by the function classification
. Of the three functions, classification
is the simplest. You can assume score
will return a value near 1 if the message is a spam, near 0 if it’s a ham, and near .5 if it’s unclear.
Thus, you can implement classification
like this:
(defparameter *max-ham-score* .4)
(defparameter *min-spam-score* .6)
(defun classification (score)
(cond
((<= score *max-ham-score*) 'ham)
((>= score *min-spam-score*) 'spam)
(t 'unsure)))
The extract-features
function is almost as straightforward, though it requires a bit more code. For the moment, the features you’ll extract will be the words appearing in the text. For each word, you need to keep track of the number of times it has been seen in a spam and the number of times it has been seen in a ham. A convenient way to keep those pieces of data together with the word itself is to define a class, word-feature
, with three slots.
(defclass word-feature ()
((word
:initarg :word
:accessor word
:initform (error "Must supply :word")
:documentation "The word this feature represents.")
(spam-count
:initarg :spam-count
:accessor spam-count
:initform 0
:documentation "Number of spams we have seen this feature in.")
(ham-count
:initarg :ham-count
:accessor ham-count
:initform 0
:documentation "Number of hams we have seen this feature in.")))
You’ll keep the database of features in a hash table so you can easily find the object representing a given feature. You can define a special variable, *feature-database*
, to hold a reference to this hash table.
(defvar *feature-database* (make-hash-table :test #'equal))
You should use **DEFVAR**
rather than **DEFPARAMETER**
because you don’t want *feature-database*
to be reset if you happen to reload the file containing this definition during development—you might have data stored in *feature-database*
that you don’t want to lose. Of course, that means if you do want to clear out the feature database, you can’t just reevaluate the **DEFVAR**
form. So you should define a function clear-database
.
(defun clear-database ()
(setf *feature-database* (make-hash-table :test #'equal)))
To find the features present in a given message, the code will need to extract the individual words and then look up the corresponding word-feature
object in *feature-database*
. If *feature-database*
contains no such feature, it’ll need to create a new word-feature
to represent the word. You can encapsulate that bit of logic in a function, intern-feature
, that takes a word and returns the appropriate feature, creating it if necessary.
(defun intern-feature (word)
(or (gethash word *feature-database*)
(setf (gethash word *feature-database*)
(make-instance 'word-feature :word word))))
You can extract the individual words from the message text using a regular expression. For example, using the Common Lisp Portable Perl-Compatible Regular Expression (CL-PPCRE) library written by Edi Weitz, you can write extract-words
like this:4
(defun extract-words (text)
(delete-duplicates
(cl-ppcre:all-matches-as-strings "[a-zA-Z]{3,}" text)
:test #'string=))
Now all that remains to implement extract-features
is to put extract-features
and intern-feature
together. Since extract-words
returns a list of strings and you want a list with each string translated to the corresponding word-feature
, this is a perfect time to use **MAPCAR**
.
(defun extract-features (text)
(mapcar #'intern-feature (extract-words text)))
You can test these functions at the REPL like this:
SPAM> (extract-words "foo bar baz")
("foo" "bar" "baz")
And you can make sure the **DELETE-DUPLICATES**
is working like this:
SPAM> (extract-words "foo bar baz foo bar")
("baz" "foo" "bar")
You can also test extract-features
.
SPAM> (extract-features "foo bar baz foo bar")
(#<WORD-FEATURE @ #x71ef28da> #<WORD-FEATURE @ #x71e3809a>
#<WORD-FEATURE @ #x71ef28aa>)
However, as you can see, the default method for printing arbitrary objects isn’t very informative. As you work on this program, it’ll be useful to be able to print word-feature
objects in a less opaque way. Luckily, as I mentioned in Chapter 17, the printing of all objects is implemented in terms of a generic function **PRINT-OBJECT**
, so to change the way word-feature
objects are printed, you just need to define a method on **PRINT-OBJECT**
that specializes on word-feature
. To make implementing such methods easier, Common Lisp provides the macro **PRINT-UNREADABLE-OBJECT**
.5
The basic form of **PRINT-UNREADABLE-OBJECT**
is as follows:
(print-unreadable-object (object stream-variable &key type identity)
body-form*)
The object argument is an expression that evaluates to the object to be printed. Within the body of **PRINT-UNREADABLE-OBJECT**
, stream-variable is bound to a stream to which you can print anything you want. Whatever you print to that stream will be output by **PRINT-UNREADABLE-OBJECT**
and enclosed in the standard syntax for unreadable objects, #<>
.6
**PRINT-UNREADABLE-OBJECT**
also lets you include the type of the object and an indication of the object’s identity via the keyword parameters type and identity. If they’re non-**NIL**
, the output will start with the name of the object’s class and end with an indication of the object’s identity similar to what’s printed by the default **PRINT-OBJECT**
method for **STANDARD-OBJECT**
s. For word-feature
, you probably want to define a **PRINT-OBJECT**
method that includes the type but not the identity along with the values of the word
, ham-count
, and spam-count
slots. Such a method would look like this:
(defmethod print-object ((object word-feature) stream)
(print-unreadable-object (object stream :type t)
(with-slots (word ham-count spam-count) object
(format stream "~s :hams ~d :spams ~d" word ham-count spam-count))))
Now when you test extract-features
at the REPL, you can see more clearly what features are being extracted.
SPAM> (extract-features "foo bar baz foo bar")
(#<WORD-FEATURE "baz" :hams 0 :spams 0>
#<WORD-FEATURE "foo" :hams 0 :spams 0>
#<WORD-FEATURE "bar" :hams 0 :spams 0>)