Strings and Regular Expressions - Better Unicode Support - 《Understanding ECMAScript 6》

Better Unicode Support

Better Unicode Support

Before ECMAScript 6, JavaScript strings revolved around 16-bit character encoding (UTF-16). Each 16-bit sequence is a code unit representing a character. All string properties and methods, like the length property and the charAt() method, were based on these 16-bit code units. Of course, 16 bits used to be enough to contain any character. That’s no longer true thanks to the expanded character set introduced by Unicode.

UTF-16 Code Points

Limiting character length to 16 bits wasn’t possible for Unicode’s stated goal of providing a globally unique identifier to every character in the world. These globally unique identifiers, called code points, are simply numbers starting at 0. Code points are what you may think of as character codes, where a number represents a character. A character encoding must encode code points into code units that are internally consistent. For UTF-16, code points can be made up of many code units.

The first 2^16^ code points in UTF-16 are represented as single 16-bit code units. This range is called the Basic Multilingual Plane (BMP). Everything beyond that is considered to be in one of the supplementary planes, where the code points can no longer be represented in just 16-bits. UTF-16 solves this problem by introducing surrogate pairs in which a single code point is represented by two 16-bit code units. That means any single character in a string can be either one code unit for BMP characters, giving a total of 16 bits, or two units for supplementary plane characters, giving a total of 32 bits.

In ECMAScript 5, all string operations work on 16-bit code units, meaning that you can get unexpected results from UTF-16 encoded strings containing surrogate pairs, as in this example:

var text = "𠮷";
console.log(text.length);           // 2
console.log(/^.$/.test(text));      // false
console.log(text.charAt(0));        // ""
console.log(text.charAt(1));        // ""
console.log(text.charCodeAt(0));    // 55362
console.log(text.charCodeAt(1));    // 57271

The single Unicode character "𠮷" is represented using surrogate pairs, and as such, the JavaScript string operations above treat the string as having two 16-bit characters. That means:

The length of text is 2, when it should be 1.
A regular expression trying to match a single character fails because it thinks there are two characters.
The charAt() method is unable to return a valid character string, because neither set of 16 bits corresponds to a printable character.

The charCodeAt() method also just can’t identify the character properly. It returns the appropriate 16-bit number for each code unit, but that is the closest you could get to the real value of text in ECMAScript 5.

ECMAScript 6, on the other hand, enforces UTF-16 string encoding to address problems like these. Standardizing string operations based on this character encoding means that JavaScript can support functionality designed to work specifically with surrogate pairs. The rest of this section discusses a few key examples of that functionality.

The codePointAt() Method

One method ECMAScript 6 added to fully support UTF-16 is the codePointAt() method, which retrieves the Unicode code point that maps to a given position in a string. This method accepts the code unit position rather than the character position and returns an integer value, as these console.log() examples show:

var text = "𠮷a";
console.log(text.charCodeAt(0));    // 55362
console.log(text.charCodeAt(1));    // 57271
console.log(text.charCodeAt(2));    // 97
console.log(text.codePointAt(0));   // 134071
console.log(text.codePointAt(1));   // 57271
console.log(text.codePointAt(2));   // 97

The codePointAt() method returns the same value as the charCodeAt() method unless it operates on non-BMP characters. The first character in text is non-BMP and is therefore comprised of two code units, meaning the length property is 3 rather than 2. The charCodeAt() method returns only the first code unit for position 0, but codePointAt() returns the full code point even though the code point spans multiple code units. Both methods return the same value for positions 1 (the second code unit of the first character) and 2 (the "a" character).

Calling the codePointAt() method on a character is the easiest way to determine if that character is represented by one or two code units. Here’s a function you could write to check:

function is32Bit(c) {
    return c.codePointAt(0) > 0xFFFF;
}
console.log(is32Bit("𠮷"));         // true
console.log(is32Bit("a"));          // false

The upper bound of 16-bit characters is represented in hexadecimal as FFFF, so any code point above that number must be represented by two code units, for a total of 32 bits.

The String.fromCodePoint() Method

When ECMAScript provides a way to do something, it also tends to provide a way to do the reverse. You can use codePointAt() to retrieve the code point for a character in a string, while String.fromCodePoint() produces a single-character string from a given code point. For example:

console.log(String.fromCodePoint(134071));  // "𠮷"

Think of String.fromCodePoint() as a more complete version of the String.fromCharCode() method. Both give the same result for all characters in the BMP. There’s only a difference when you pass code points for characters outside of the BMP.

The normalize() Method

Another interesting aspect of Unicode is that different characters may be considered equivalent for the purpose of sorting or other comparison-based operations. There are two ways to define these relationships. First, canonical equivalence means that two sequences of code points are considered interchangeable in all respects. For example, a combination of two characters can be canonically equivalent to one character. The second relationship is compatibility. Two compatible sequences of code points look different but can be used interchangeably in certain situations.

Due to these relationships, two strings representing fundamentally the same text can contain different code point sequences. For example, the character “æ” and the two-character string “ae” may be used interchangeably but are strictly not equivalent unless normalized in some way.

ECMAScript 6 supports Unicode normalization forms by giving strings a normalize() method. This method optionally accepts a single string parameter indicating one of the following Unicode normalization forms to apply:

Normalization Form Canonical Composition ("NFC"), which is the default
Normalization Form Canonical Decomposition ("NFD")
Normalization Form Compatibility Composition ("NFKC")
Normalization Form Compatibility Decomposition ("NFKD")

It’s beyond the scope of this book to explain the differences between these four forms. Just keep in mind that when comparing strings, both strings must be normalized to the same form. For example:

var normalized = values.map(function(text) {
    return text.normalize();
});
normalized.sort(function(first, second) {
    if (first < second) {
        return -1;
    } else if (first === second) {
        return 0;
    } else {
        return 1;
    }
});

This code converts the strings in the values array into a normalized form so that the array can be sorted appropriately. You can also sort the original array by calling normalize() as part of the comparator, as follows:

values.sort(function(first, second) {
    var firstNormalized = first.normalize(),
        secondNormalized = second.normalize();
    if (firstNormalized < secondNormalized) {
        return -1;
    } else if (firstNormalized === secondNormalized) {
        return 0;
    } else {
        return 1;
    }
});

Once again, the most important thing to note about this code is that both first and second are normalized in the same way. These examples have used the default, NFC, but you can just as easily specify one of the others, like this:

values.sort(function(first, second) {
    var firstNormalized = first.normalize("NFD"),
        secondNormalized = second.normalize("NFD");
    if (firstNormalized < secondNormalized) {
        return -1;
    } else if (firstNormalized === secondNormalized) {
        return 0;
    } else {
        return 1;
    }
});

If you’ve never worried about Unicode normalization before, then you probably won’t have much use for this method now. But if you ever work on an internationalized application, you’ll definitely find the normalize() method helpful.

Methods aren’t the only improvements that ECMAScript 6 provides for working with Unicode strings, though. The standard also adds two useful syntax elements.

The Regular Expression u Flag

You can accomplish many common string operations through regular expressions. But remember, regular expressions assume 16-bit code units, where each represents a single character. To address this problem, ECMAScript 6 defines a u flag for regular expressions, which stands for Unicode.

The u Flag in Action

When a regular expression has the u flag set, it switches modes to work on characters, not code units. That means the regular expression should no longer get confused about surrogate pairs in strings and should behave as expected. For example, consider this code:

var text = "𠮷";
console.log(text.length);           // 2
console.log(/^.$/.test(text));      // false
console.log(/^.$/u.test(text));     // true

The regular expression /^.$/ matches any input string with a single character. When used without the u flag, this regular expression matches on code units, and so the Japanese character (which is represented by two code units) doesn’t match the regular expression. When used with the u flag, the regular expression compares characters instead of code units and so the Japanese character matches.

Counting Code Points

Unfortunately, ECMAScript 6 doesn’t add a method to determine how many code points a string has, but with the u flag, you can use regular expressions to figure it out as follows:

function codePointLength(text) {
    var result = text.match(/[\s\S]/gu);
    return result ? result.length : 0;
}
console.log(codePointLength("abc"));    // 3
console.log(codePointLength("𠮷bc"));   // 3

This example calls match() to check text for both whitespace and non-whitespace characters (using [\s\S] to ensure the pattern matches newlines), using a regular expression that is applied globally with Unicode enabled. The result contains an array of matches when there’s at least one match, so the array length is the number of code points in the string. In Unicode, the strings "abc" and "𠮷bc" both have three characters, so the array length is three.

W> Although this approach works, it’s not very fast, especially when applied to long strings. You can use a string iterator (discussed in Chapter 8) as well. In general, try to minimize counting code points whenever possible.

Determining Support for the u Flag

Since the u flag is a syntax change, attempting to use it in JavaScript engines that aren’t compatible with ECMAScript 6 throws a syntax error. The safest way to determine if the u flag is supported is with a function, like this one:

function hasRegExpU() {
    try {
        var pattern = new RegExp(".", "u");
        return true;
    } catch (ex) {
        return false;
    }
}

This function uses the RegExp constructor to pass in the u flag as an argument. This syntax is valid even in older JavaScript engines, but the constructor will throw an error if u isn’t supported.

I> If your code still needs to work in older JavaScript engines, always use the RegExp constructor when using the u flag. This will prevent syntax errors and allow you to optionally detect and use the u flag without aborting execution.