Training the Filter
Now that you have a way to keep track of individual features, you’re almost ready to implement score
. But first you need to write the code you’ll use to train the spam filter so score
will have some data to use. You’ll define a function, train
, that takes some text and a symbol indicating what kind of message it is—ham
or spam
--and that increments either the ham count or the spam count of all the features present in the text as well as a global count of hams or spams processed. Again, you can take a top-down approach and implement it in terms of other functions that don’t yet exist.
(defun train (text type)
(dolist (feature (extract-features text))
(increment-count feature type))
(increment-total-count type))
You’ve already written extract-features
, so next up is increment-count
, which takes a word-feature
and a message type and increments the appropriate slot of the feature. Since there’s no reason to think that the logic of incrementing these counts is going to change for different kinds of objects, you can write this as a regular function.7 Because you defined both ham-count
and spam-count
with an :accessor
option, you can use **INCF**
and the accessor functions created by **DEFCLASS**
to increment the appropriate slot.
(defun increment-count (feature type)
(ecase type
(ham (incf (ham-count feature)))
(spam (incf (spam-count feature)))))
The **ECASE**
construct is a variant of **CASE**
, both of which are similar to case
statements in Algol-derived languages (renamed switch
in C and its progeny). They both evaluate their first argument—the key form--and then find the clause whose first element—the key--is the same value according to **EQL**
. In this case, that means the variable type
is evaluated, yielding whatever value was passed as the second argument to increment-count
.
The keys aren’t evaluated. In other words, the value of type
will be compared to the literal objects read by the Lisp reader as part of the **ECASE**
form. In this function, that means the keys are the symbols ham
and spam
, not the values of any variables named ham
and spam
. So, if increment-count
is called like this:
(increment-count some-feature 'ham)
the value of type
will be the symbol ham
, and the first branch of the **ECASE**
will be evaluated and the feature’s ham count incremented. On the other hand, if it’s called like this:
(increment-count some-feature 'spam)
then the second branch will run, incrementing the spam count. Note that the symbols ham
and spam
are quoted when calling increment-count
since otherwise they’d be evaluated as the names of variables. But they’re not quoted when they appear in **ECASE**
since **ECASE**
doesn’t evaluate the keys.8
The E in **ECASE**
stands for “exhaustive” or “error,” meaning **ECASE**
should signal an error if the key value is anything other than one of the keys listed. The regular **CASE**
is looser, returning **NIL**
if no matching clause is found.
To implement increment-total-count
, you need to decide where to store the counts; for the moment, two more special variables, *total-spams*
and *total-hams*
, will do fine.
(defvar *total-spams* 0)
(defvar *total-hams* 0)
(defun increment-total-count (type)
(ecase type
(ham (incf *total-hams*))
(spam (incf *total-spams*))))
You should use **DEFVAR**
to define these two variables for the same reason you used it with *feature-database*
--they’ll hold data built up while you run the program that you don’t necessarily want to throw away just because you happen to reload your code during development. But you’ll want to reset those variables if you ever reset *feature-database*
, so you should add a few lines to clear-database
as shown here:
(defun clear-database ()
(setf
*feature-database* (make-hash-table :test #'equal)
*total-spams* 0
*total-hams* 0))