23. Practical: A Spam Filter - A Couple of Utility Functions - 《Practical Common Lisp》

A Couple of Utility Functions

A Couple of Utility Functions

To finish the implementation of test-classifier, you need to write the two utility functions that don’t really have anything particularly to do with spam filtering, shuffle-vector and start-of-file.

An easy and efficient way to implement shuffle-vector is using the Fisher-Yates algorithm.14 You can start by implementing a function, nshuffle-vector, that shuffles a vector in place. This name follows the same naming convention of other destructive functions such as **NCONC** and **NREVERSE**. It looks like this:

(defun nshuffle-vector (vector)
  (loop for idx downfrom (1- (length vector)) to 1
        for other = (random (1+ idx))
        do (unless (= idx other)
             (rotatef (aref vector idx) (aref vector other))))
  vector)

The nondestructive version simply makes a copy of the original vector and passes it to the destructive version.

(defun shuffle-vector (vector)
  (nshuffle-vector (copy-seq vector)))

The other utility function, start-of-file, is almost as straightforward with just one wrinkle. The most efficient way to read the contents of a file into memory is to create an array of the appropriate size and use **READ-SEQUENCE** to fill it in. So it might seem you could make a character array that’s either the size of the file or the maximum number of characters you want to read, whichever is smaller. Unfortunately, as I mentioned in Chapter 14, the function **FILE-LENGTH** isn’t entirely well defined when dealing with character streams since the number of characters encoded in a file can depend on both the character encoding used and the particular text in the file. In the worst case, the only way to get an accurate measure of the number of characters in a file is to actually read the whole file. Thus, it’s ambiguous what **FILE-LENGTH** should do when passed a character stream; in most implementations, **FILE-LENGTH** always returns the number of octets in the file, which may be greater than the number of characters that can be read from the file.

However, **READ-SEQUENCE** returns the number of characters actually read. So, you can attempt to read the number of characters reported by **FILE-LENGTH** and return a substring if the actual number of characters read was smaller.

(defun start-of-file (file max-chars)
  (with-open-file (in file)
    (let* ((length (min (file-length in) max-chars))
           (text (make-string length))
           (read (read-sequence text in)))
      (if (< read length)
        (subseq text 0 read)
        text))))