8. RegExp Unicode property escapes
This chapter explains the proposal “RegExp Unicode Property Escapes” by Mathias Bynens.
8.1. Overview
JavaScript lets you match characters by mentioning the “names” of sets of characters. For example, \s
stands for “whitespace”:
> /^\s+$/u.test('\t \n\r')
true
The proposal lets you additionally match characters by mentioning their Unicode character properties (what those are is explained next) inside the curly braces of \p{}
. Two examples:
> /^\p{White_Space}+$/u.test('\t \n\r')
true
> /^\p{Script=Greek}+$/u.test('μετά')
true
As you can see, one of the benefits of property escapes is is that they make regular expressions more self-descriptive. Additional benefits will become clear later.
Before we delve into how property escapes work, let’s examine what Unicode character properties are.
8.2. Unicode character properties
In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:
The semantics of a character are determined by its identity, normative properties, and behavior.
8.2.1. Examples of properties
These are a few examples of properties:
Name
: a unique name, composed of uppercase letters, digits, hyphens and spaces. For example:- A:
Name = LATIN CAPITAL LETTER A
?
:Name = SLIGHTLY SMILING FACE
- A:
General_Category
: categorizes characters. For example:- x:
General_Category = Lowercase_Letter
- $:
General_Category = Currency_Symbol
- x:
White_Space
: used for marking invisible spacing characters, such as spaces, tabs and newlines. For example:- \t:
White_Space = True
- π:
White_Space = False
- \t:
Age
: version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard.- €:
Age = 2.1
- €:
Block
: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:- S:
Block = Basic_Latin
(range U+0000..U+007F) ?
:Block = Emoticons
(range U+1F600..U+1F64F)
- S:
Script
: is a collection of characters used by one or more writing systems.- Some scripts support several writing systems. For example, the Latin script supports the writing systems English, French, German, Latin, etc.
- Some languages can be written in multiple alternate writing systems that are supported by multiple scripts. For example, Turkish used the Arabic script before it transitioned to the Latin script in the early 20th century.
- Examples:
- α:
Script = Greek
- Д:
Script = Cyrillic
- α:
8.2.2. Types of properties
The following types of properties exist:
- Enumerated property: a property whose values are few and named.
General_Category
is an enumerated property. - Closed enumerated property: an enumerated property whose set of values is fixed and will not be changed in future versions of the Unicode Standard.
- Boolean property: a closed enumerated property whose values are
True
andFalse
. Boolean properties are also called binary, because they are like markers that characters either have or not.White_Space
is a binary property. - Numeric property: has values that are integers or real numbers.
- String-valued property: a property whose values are strings.
- Catalog property: an enumerated property that may be extended as the Unicode Standard evolves.
Age
andScript
are catalog properties. - Miscellaneous property: a property whose values are not Boolean, enumerated, numeric, string or catalog values.
Name
is a miscellaneous property.
8.2.3. Matching properties and property values
Properties and property values are matched as follows:
- Loose matching: case, whitespace, underscores and hyphens are ignored when comparing properties and property values. For example,
"General_Category"
,"general category"
,"-general-category-"
,"GeneralCategory"
are all considered to be the same property. - Aliases: the data files
PropertyAliases.txt
andPropertyValueAliases.txt
define alternative ways of referring to properties and property values.- Most aliases have long forms and short forms. For example:
- Long form:
General_Category
- Short form:
gc
- Long form:
- Examples of property value aliases (per line, all values are considered equal):
Lowercase_Letter
,Ll
Currency_Symbol
,Sc
True
,T
,Yes
,Y
False
,F
,No
,N
- Most aliases have long forms and short forms. For example:
8.3. Unicode property escapes for regular expressions
Unicode property escapes look like this:
\p{prop=value}
: Match all characters whose propertyprop
has the valuevalue
.\P{prop=value}
: Match all characters that do not have a propertyprop
whose value isvalue
.\p{bin_prop}
: Match all characters whose binary propertybin_prop
is True.\P{bin_prop}
: Match all characters whose binary propertybin_prop
is False. Comments:You can only use Unicode property escapes if the flag
/u
is set. Without/u
,\p
is the same asp
.Forms (3) and (4) can be used as abbreviations if the property is
General_Category
. For example,\p{Lowercase_Letter}
is an abbreviation for\p{General_Category=Lowercase_Letter}
8.3.1. Details
Things to note:
- Property escapes do not support loose matching. You must use aliases exactly as they are mentioned in
PropertyAliases.txt
andPropertyValueAliases.txt
- Implementations must support at least the following Unicode properties and their aliases:
General_Category
Script
Script_Extensions
- The binary properties listed in the specification (and no others, to guarantee interoperability). These include, among others:
Alphabetic
,Uppercase
,Lowercase
,White_Space
,Noncharacter_Code_Point
,Default_Ignorable_Code_Point
,Any
,ASCII
,Assigned
,ID_Start
,ID_Continue
,Join_Control
,Emoji_Presentation
,Emoji_Modifier
,Emoji_Modifier_Base
.
8.4. Examples
Matching whitespace:
> /^\p{White_Space}+$/u.test('\t \n\r')
true
Matching letters:
> /^\p{Letter}+$/u.test('πüé')
true
Matching Greek letters:
> /^\p{Script=Greek}+$/u.test('μετά')
true
Matching Latin letters:
> /^\p{Script=Latin}+$/u.test('Grüße')
true
> /^\p{Script=Latin}+$/u.test('façon')
true
> /^\p{Script=Latin}+$/u.test('mañana')
true
Matching lone surrogate characters:
> /^\p{Surrogate}+$/u.test('\u{D83D}')
true
> /^\p{Surrogate}+$/u.test('\u{DE00}')
true
Note that Unicode code points in astral planes (such as emojis) are composed of two JavaScript characters (a leading surrogate and a trailing surrogate). Therefore, you’d expect the previous regular expression to match the emoji ?
, which is all surrogates:
> '?'.length
2
> '?'.charCodeAt(0).toString(16)
'd83d'
> '?'.charCodeAt(1).toString(16)
'de42'
However, with the /u
flag, property escapes match code points, not JavaScript characters:
> /^\p{Surrogate}+$/u.test('?')
false
In other words, ?
is considered to be a single character:
> /^.$/u.test('?')
true
8.5. Trying it out
V8 5.8+ implement this proposal, it is switched on via —harmony_regexp_property
:
- Node.js:
node —harmony_regexp_property
- Check Node’s version of V8 via
npm version
- Check Node’s version of V8 via
- Chrome:
- Go to
chrome://version/
- Check the version of V8.
- Find the “Executable Path”. For example:
/Applications/Google Chrome.app/Contents/MacOS/Google Chrome
- Start Chrome:
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' —js-flags="—harmony_regexp_property"
- Go to
8.6. Further reading
JavaScript:
- “Unicode and JavaScript” (in “Speaking JavaScript”)
Regular expressions: “New flag
/u
(unicode)” (in “Exploring ES6”) The Unicode standard:Unicode Technical Report #23: The Unicode Character Property Model (Editors: Ken Whistler, Asmus Freytag)
- Unicode Standard Annex #44: Unicode Character Database (Editors: Mark Davis, Laurențiu Iancu, Ken Whistler)
- Unicode Character Database:
PropList.txt
,PropertyAliases.txt
,PropertyValueAliases.txt
- “Unicode character property” (Wikipedia)