Strings in Binary Files
Textual strings are another kind of primitive data type you’ll find in many binary formats. When you read files one byte at a time, you can’t read and write strings directly—you need to decode and encode them one byte at a time, just as you do with binary-encoded numbers. And just as you can encode an integer in several ways, you can encode a string in many ways. To start with, the binary format must specify how individual characters are encoded.
To translate bytes to characters, you need to know both what character code and what character encoding you’re using. A character code defines a mapping from positive integers to characters. Each number in the mapping is called a code point. For instance, ASCII is a character code that maps the numbers from 0-127 to particular characters used in the Latin alphabet. A character encoding, on the other hand, defines how the code points are represented as a sequence of bytes in a byte-oriented medium such as a file. For codes that use eight or fewer bits, such as ASCII and ISO-8859-1, the encoding is trivial—each numeric value is encoded as a single byte.
Nearly as straightforward are pure double-byte encodings, such as UCS-2, which map between 16-bit values and characters. The only reason double-byte encodings can be more complex than single-byte encodings is that you may also need to know whether the 16-bit values are supposed to be encoded in big-endian or little-endian format.
Variable-width encodings use different numbers of octets for different numeric values, making them more complex but allowing them to be more compact in many cases. For instance, UTF-8, an encoding designed for use with the Unicode character code, uses a single octet to encode the values 0-127 while using up to four octets to encode values up to 1,114,111.6
Since the code points from 0-127 map to the same characters in Unicode as they do in ASCII, a UTF-8 encoding of text consisting only of characters also in ASCII is the same as the ASCII encoding. On the other hand, texts consisting mostly of characters requiring four bytes in UTF-8 could be more compactly encoded in a straight double-byte encoding.
Common Lisp provides two functions for translating between numeric character codes and character objects: **CODE-CHAR**
, which takes an numeric code and returns as a character, and **CHAR-CODE**
, which takes a character and returns its numeric code. The language standard doesn’t specify what character encoding an implementation must use, so there’s no guarantee you can represent every character that can possibly be encoded in a given file format as a Lisp character. However, almost all contemporary Common Lisp implementations use ASCII, ISO-8859-1, or Unicode as their native character code. Because Unicode is a superset ofISO-8859-1, which is in turn a superset of ASCII, if you’re using a Unicode Lisp, **CODE-CHAR**
and **CHAR-CODE**
can be used directly for translating any of those three character codes.7
In addition to specifying a character encoding, a string encoding must also specify how to encode the length of the string. Three techniques are typically used in binary file formats.
The simplest is to not encode it but to let it be implicit in the position of the string in some larger structure: a particular element of a file may always be a string of a certain length, or a string may be the last element of a variable-length data structure whose overall size determines how many bytes are left to read as string data. Both these techniques are used in ID3 tags, as you’ll see in the next chapter.
The other two techniques can be used to encode variable-length strings without relying on context. One is to encode the length of the string followed by the character data—the parser reads an integer value (in some specified integer format) and then reads that number of characters. Another is to write the character data followed by a delimiter that can’t appear in the string such as a null character.
The different representations have different advantages and disadvantages, but when you’re dealing with already specified binary formats, you won’t have any control over which encoding is used. However, none of the encodings is particularly more difficult to read and write than any other. Here, as an example, is a function that reads a null-terminated ASCII string, assuming your Lisp implementation uses ASCII or one of its supersets such as ISO-8859-1 or full Unicode as its native character encoding:
(defconstant +null+ (code-char 0))
(defun read-null-terminated-ascii (in)
(with-output-to-string (s)
(loop for char = (code-char (read-byte in))
until (char= char +null+) do (write-char char s))))
The **WITH-OUTPUT-TO-STRING**
macro, which I mentioned in Chapter 14, is an easy way to build up a string when you don’t know how long it’ll be. It creates a **STRING-STREAM**
and binds it to the variable name specified, s
in this case. All characters written to the stream are collected into a string, which is then returned as the value of the **WITH-OUTPUT-TO-STRING**
form.
To write a string back out, you just need to translate the characters back to numeric values that can be written with **WRITE-BYTE**
and then write the null terminator after the string contents.
(defun write-null-terminated-ascii (string out)
(loop for char across string
do (write-byte (char-code char) out))
(write-byte (char-code +null+) out))
As these examples show, the main intellectual challenge—such as it is—of reading and writing primitive elements of binary files is understanding how exactly to interpret the bytes that appear in a file and to map them to Lisp data types. If a binary file format is well specified, this should be a straightforward proposition. Actually writing functions to read and write a particular encoding is, as they say, a simple matter of programming.
Now you can turn to the issue of reading and writing more complex on-disk structures and how to map them to Lisp objects.