Better Unicode Support
Before ECMAScript 6, JavaScript strings revolved around 16-bit character encoding (UTF-16). Each 16-bit sequence is a code unit representing a character. All string properties and methods, like the length
property and the charAt()
method, were based on these 16-bit code units. Of course, 16 bits used to be enough to contain any character. That’s no longer true thanks to the expanded character set introduced by Unicode.
UTF-16 Code Points
Limiting character length to 16 bits wasn’t possible for Unicode’s stated goal of providing a globally unique identifier to every character in the world. These globally unique identifiers, called code points, are simply numbers starting at 0. Code points are what you may think of as character codes, where a number represents a character. A character encoding must encode code points into code units that are internally consistent. For UTF-16, code points can be made up of many code units.
The first 2^16^ code points in UTF-16 are represented as single 16-bit code units. This range is called the Basic Multilingual Plane (BMP). Everything beyond that is considered to be in one of the supplementary planes, where the code points can no longer be represented in just 16-bits. UTF-16 solves this problem by introducing surrogate pairs in which a single code point is represented by two 16-bit code units. That means any single character in a string can be either one code unit for BMP characters, giving a total of 16 bits, or two units for supplementary plane characters, giving a total of 32 bits.
In ECMAScript 5, all string operations work on 16-bit code units, meaning that you can get unexpected results from UTF-16 encoded strings containing surrogate pairs, as in this example:
var text = "𠮷";
console.log(text.length); // 2
console.log(/^.$/.test(text)); // false
console.log(text.charAt(0)); // ""
console.log(text.charAt(1)); // ""
console.log(text.charCodeAt(0)); // 55362
console.log(text.charCodeAt(1)); // 57271
The single Unicode character "𠮷"
is represented using surrogate pairs, and as such, the JavaScript string operations above treat the string as having two 16-bit characters. That means:
- The
length
oftext
is 2, when it should be 1. - A regular expression trying to match a single character fails because it thinks there are two characters.
- The
charAt()
method is unable to return a valid character string, because neither set of 16 bits corresponds to a printable character.
The charCodeAt()
method also just can’t identify the character properly. It returns the appropriate 16-bit number for each code unit, but that is the closest you could get to the real value of text
in ECMAScript 5.
ECMAScript 6, on the other hand, enforces UTF-16 string encoding to address problems like these. Standardizing string operations based on this character encoding means that JavaScript can support functionality designed to work specifically with surrogate pairs. The rest of this section discusses a few key examples of that functionality.
The codePointAt() Method
One method ECMAScript 6 added to fully support UTF-16 is the codePointAt()
method, which retrieves the Unicode code point that maps to a given position in a string. This method accepts the code unit position rather than the character position and returns an integer value, as these console.log()
examples show:
var text = "𠮷a";
console.log(text.charCodeAt(0)); // 55362
console.log(text.charCodeAt(1)); // 57271
console.log(text.charCodeAt(2)); // 97
console.log(text.codePointAt(0)); // 134071
console.log(text.codePointAt(1)); // 57271
console.log(text.codePointAt(2)); // 97
The codePointAt()
method returns the same value as the charCodeAt()
method unless it operates on non-BMP characters. The first character in text
is non-BMP and is therefore comprised of two code units, meaning the length
property is 3 rather than 2. The charCodeAt()
method returns only the first code unit for position 0, but codePointAt()
returns the full code point even though the code point spans multiple code units. Both methods return the same value for positions 1 (the second code unit of the first character) and 2 (the "a"
character).
Calling the codePointAt()
method on a character is the easiest way to determine if that character is represented by one or two code units. Here’s a function you could write to check:
function is32Bit(c) {
return c.codePointAt(0) > 0xFFFF;
}
console.log(is32Bit("𠮷")); // true
console.log(is32Bit("a")); // false
The upper bound of 16-bit characters is represented in hexadecimal as FFFF
, so any code point above that number must be represented by two code units, for a total of 32 bits.
The String.fromCodePoint() Method
When ECMAScript provides a way to do something, it also tends to provide a way to do the reverse. You can use codePointAt()
to retrieve the code point for a character in a string, while String.fromCodePoint()
produces a single-character string from a given code point. For example:
console.log(String.fromCodePoint(134071)); // "𠮷"
Think of String.fromCodePoint()
as a more complete version of the String.fromCharCode()
method. Both give the same result for all characters in the BMP. There’s only a difference when you pass code points for characters outside of the BMP.
The normalize() Method
Another interesting aspect of Unicode is that different characters may be considered equivalent for the purpose of sorting or other comparison-based operations. There are two ways to define these relationships. First, canonical equivalence means that two sequences of code points are considered interchangeable in all respects. For example, a combination of two characters can be canonically equivalent to one character. The second relationship is compatibility. Two compatible sequences of code points look different but can be used interchangeably in certain situations.
Due to these relationships, two strings representing fundamentally the same text can contain different code point sequences. For example, the character “æ” and the two-character string “ae” may be used interchangeably but are strictly not equivalent unless normalized in some way.
ECMAScript 6 supports Unicode normalization forms by giving strings a normalize()
method. This method optionally accepts a single string parameter indicating one of the following Unicode normalization forms to apply:
- Normalization Form Canonical Composition (
"NFC"
), which is the default - Normalization Form Canonical Decomposition (
"NFD"
) - Normalization Form Compatibility Composition (
"NFKC"
) - Normalization Form Compatibility Decomposition (
"NFKD"
)
It’s beyond the scope of this book to explain the differences between these four forms. Just keep in mind that when comparing strings, both strings must be normalized to the same form. For example:
var normalized = values.map(function(text) {
return text.normalize();
});
normalized.sort(function(first, second) {
if (first < second) {
return -1;
} else if (first === second) {
return 0;
} else {
return 1;
}
});
This code converts the strings in the values
array into a normalized form so that the array can be sorted appropriately. You can also sort the original array by calling normalize()
as part of the comparator, as follows:
values.sort(function(first, second) {
var firstNormalized = first.normalize(),
secondNormalized = second.normalize();
if (firstNormalized < secondNormalized) {
return -1;
} else if (firstNormalized === secondNormalized) {
return 0;
} else {
return 1;
}
});
Once again, the most important thing to note about this code is that both first
and second
are normalized in the same way. These examples have used the default, NFC, but you can just as easily specify one of the others, like this:
values.sort(function(first, second) {
var firstNormalized = first.normalize("NFD"),
secondNormalized = second.normalize("NFD");
if (firstNormalized < secondNormalized) {
return -1;
} else if (firstNormalized === secondNormalized) {
return 0;
} else {
return 1;
}
});
If you’ve never worried about Unicode normalization before, then you probably won’t have much use for this method now. But if you ever work on an internationalized application, you’ll definitely find the normalize()
method helpful.
Methods aren’t the only improvements that ECMAScript 6 provides for working with Unicode strings, though. The standard also adds two useful syntax elements.
The Regular Expression u Flag
You can accomplish many common string operations through regular expressions. But remember, regular expressions assume 16-bit code units, where each represents a single character. To address this problem, ECMAScript 6 defines a u
flag for regular expressions, which stands for Unicode.
The u Flag in Action
When a regular expression has the u
flag set, it switches modes to work on characters, not code units. That means the regular expression should no longer get confused about surrogate pairs in strings and should behave as expected. For example, consider this code:
var text = "𠮷";
console.log(text.length); // 2
console.log(/^.$/.test(text)); // false
console.log(/^.$/u.test(text)); // true
The regular expression /^.$/
matches any input string with a single character. When used without the u
flag, this regular expression matches on code units, and so the Japanese character (which is represented by two code units) doesn’t match the regular expression. When used with the u
flag, the regular expression compares characters instead of code units and so the Japanese character matches.
Counting Code Points
Unfortunately, ECMAScript 6 doesn’t add a method to determine how many code points a string has, but with the u
flag, you can use regular expressions to figure it out as follows:
function codePointLength(text) {
var result = text.match(/[\s\S]/gu);
return result ? result.length : 0;
}
console.log(codePointLength("abc")); // 3
console.log(codePointLength("𠮷bc")); // 3
This example calls match()
to check text
for both whitespace and non-whitespace characters (using [\s\S]
to ensure the pattern matches newlines), using a regular expression that is applied globally with Unicode enabled. The result
contains an array of matches when there’s at least one match, so the array length is the number of code points in the string. In Unicode, the strings "abc"
and "𠮷bc"
both have three characters, so the array length is three.
W> Although this approach works, it’s not very fast, especially when applied to long strings. You can use a string iterator (discussed in Chapter 8) as well. In general, try to minimize counting code points whenever possible.
Determining Support for the u Flag
Since the u
flag is a syntax change, attempting to use it in JavaScript engines that aren’t compatible with ECMAScript 6 throws a syntax error. The safest way to determine if the u
flag is supported is with a function, like this one:
function hasRegExpU() {
try {
var pattern = new RegExp(".", "u");
return true;
} catch (ex) {
return false;
}
}
This function uses the RegExp
constructor to pass in the u
flag as an argument. This syntax is valid even in older JavaScript engines, but the constructor will throw an error if u
isn’t supported.
I> If your code still needs to work in older JavaScript engines, always use the RegExp
constructor when using the u
flag. This will prevent syntax errors and allow you to optionally detect and use the u
flag without aborting execution.