17. Unicode – a brief introduction (advanced)
Unicode is a standard for representing and managing text in most of the world’s writing systems. Virtually all modern software that works with text, supports Unicode. The standard is maintained by the Unicode Consortium. A new version of the standard is published every year (with new Emojis etc.). Unicode 1 was published in 1991.
17.1. Code points vs. code units
Two concepts are crucial for understanding Unicode:
- Code points: are numbers that represent Unicode characters.
- Code units: are pieces of data with fixed sizes. One or more code units encode a single code point. The size of code units depends on the encoding format. The most popular format, UTF-8, has 8-bit code units.
17.1.1. Code points
The first version of Unicode had 16-bit code points. Since then, the number of characters has grown considerably and the size of code points was extended to 21 bits. These 21 bits are partitioned in 17 planes, with 16 bits each:
- Plane 0: Basic Multilingual Plane (BMP), 0x0000–0xFFFF
- This is the most frequently used plane. Roughly, it comprises the original Unicode.
- Plane 1: Supplementary Multilingual Plane (SMP), 0x10000–0x1FFFF
- Plane 2: Supplementary Ideographic Plane (SIP), 0x20000–0x2FFFF
- Plane 3–13: Unassigned
- Plane 14: Supplementary Special-Purpose Plane (SSP), 0xE0000–0xEFFFF
- Plane 15–16: Supplementary Private Use Area (S PUA A/B), 0x0F0000–0x10FFFF
Planes 1-16 are called supplementary planes or astral planes.
Let’s check the code points of a few characters:
The hexadecimal number of the code points tells us that the first three characters reside in plane 0 (within 16 bits), while the emoji resides in plane 1.
17.1.2. Encoding formats for code units: UTF-32, UTF-16, UTF-8
Let’s cover three ways of encoding code points as code units.
17.1.2.1. UTF-32 (Unicode Transformation Format 32)
UTF-32 uses 32 bits to store code units, resulting in one code unit per code point. This format is the only one with fixed-length encoding (all others use a varying number of code units to encode a single code point).
17.1.2.2. UTF-16 (Unicode Transformation Format 16)
UTF-16 uses 16-bit code units. It encodes code points as follows:
BMP (first 16 bits of Unicode): are stored in single code units.
Astral planes: After subtracting the BMP’s count of 0x10000 characters from Unicode’s count of 0x110000 characters, 0x100000 characters (20 bits) remain. These are stored in unoccupied “holes” in the BMP:
- Most significant 10 bits (leading surrogate): 0xD800-0xDBFF
- Least significant 10 bits (trailing surrogate): 0xDC00-0xDFFF
As a consequence, by looking at a UTF-16 code unit, we can tell if it is a BMP character, the first part (leading surrogate) of an astral plane character or the last part (trailing surrogate) of an astral plane character.
17.1.2.3. UTF-8 (Unicode Transformation Format 8)
UTF-8 has 8-bit code units. It uses 1–4 code units to encode a code point:
Code points | Code units |
---|---|
0000–007F | 0xxxxxxx (7 bits) |
0080–07FF | 110xxxxx, 10xxxxxx (5+6 bits) |
0800–FFFF | 1110xxxx, 10xxxxxx, 10xxxxxx (4+6+6 bits) |
10000–1FFFFF | 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx (3+6+6+6 bits) |
Notes:
- The bit prefix of each code unit tells us:
- Is it first in a series of code units? If yes, how many code units will follow?
- Is it second or later in a series of code units?
- The character mappings in the 0000–007F range are the same as ASCII, which leads to a degree of backward-compatibility with older software.
17.2. Web development: UTF-16 and UTF-8
For web development, two Unicode encoding formats are relevant: UTF-16 and UTF-8.
17.2.1. Source code internally: UTF-16
The ECMAScript specification internally represents source code as UTF-16.
17.2.2. Strings: UTF-16
The characters in JavaScript strings are UTF-16 code units:
For more information on Unicode and strings, consult the section on atoms of text in the chapter on strings.
17.2.3. Source code in files: UTF-8
When JavaScript is stored in .html
and .js
files, the encoding is almost always UTF-8, these days:
17.3. Grapheme clusters – the real characters
The concept of a character becomes remarkably complex, once you consider many of the world’s writing systems.
On one hand, code points can be said to represent Unicode “characters”.
On the other hand, there are grapheme clusters. A grapheme cluster corresponds most closely to a symbol displayed on screen or paper. It is defined as “a horizontally segmentable unit of text”. One or more code points are needed to encode a grapheme cluster.
For example, one emoji of a family is composed of 7 code points – 4 of them are graphemes themselves and they are joined by invisible code points:
Another example is flag emojis: