- 23. New regular expression features
- 23.1 Overview
- 23.2 New flag /y (sticky)
- 23.2.1 RegExp.prototype.exec(str)
- 23.2.2 RegExp.prototype.test(str)
- 23.2.3 String.prototype.search(regex)
- 23.2.4 String.prototype.match(regex)
- 23.2.5 String.prototype.split(separator, limit)
- 23.2.6 String.prototype.replace(search, replacement)
- 23.2.7 Example: using sticky matching for tokenizing
- 23.2.8 Example: manually implementing sticky matching
- 23.3 New flag /u (unicode)
- 23.4 New data property flags
- 23.5 RegExp() can be used as a copy constructor
- 23.6 String methods that delegate to regular expression methods
- Further reading
Please support this book: buy it (PDF, EPUB, MOBI) or donate
23. New regular expression features
This chapter explains new regular expression features in ECMAScript 6. It helps if you are familiar with ES5 regular expression features and Unicode. Consult the following two chapters of “Speaking JavaScript” if necessary:
23.1 Overview
The following regular expression features are new in ECMAScript 6:
- The new flag
/y
(sticky) anchors each match of a regular expression to the end of the previous match. - The new flag
/u
(unicode) handles surrogate pairs (such as\uD83D\uDE80
) as code points and lets you use Unicode code point escapes (such as\u{1F680}
) in regular expressions. - The new data property
flags
gives you access to the flags of a regular expression, just likesource
already gives you access to the pattern in ES5:
> /abc/ig.source // ES5
- 'abc'
- > /abc/ig.flags // ES6
- 'gi'
- You can use the constructor
RegExp()
to make a copy of a regular expression:
> new RegExp(/abc/ig).flags
- 'gi'
- > new RegExp(/abc/ig, 'i').flags // change flags
- 'i'
23.2 New flag /y (sticky)
The new flag /y
changes two things while matching a regular expression re
against a string:
- Anchored to
re.lastIndex
: The match must start atre.lastIndex
(the index after the previous match). This behavior is similar to the^
anchor, but with that anchor, matches must always start at index 0. - Match repeatedly: If a match was found,
re.lastIndex
is set to the index after the match. This behavior is similar to the/g
flag. Like/g
,/y
is normally used to match multiple times. The main use case for this matching behavior is tokenizing, where you want each match to immediately follow its predecessor. An example of tokenizing via a sticky regular expression andexec()
is given later.
Let’s look at how various regular expression operations react to the /y
flag. The following tables give an overview. I’ll provide more details afterwards.
Methods of regular expressions (re
is the regular expression that a method is invoked on):
Flags | Start matching | Anchored to | Result if match | No match | re.lastIndex | |
---|---|---|---|---|---|---|
exec() | – | 0 | – | Match object | null | unchanged |
/g | re.lastIndex | – | Match object | null | index after match | |
/y | re.lastIndex | re.lastIndex | Match object | null | index after match | |
/gy | re.lastIndex | re.lastIndex | Match object | null | index after match | |
test() | (Any) | (like exec()) | (like exec()) | true | false | (like exec()) |
Methods of strings (str
is the string that a method is invoked on, r
is the regular expression parameter):
Flags | Start matching | Anchored to | Result if match | No match | r.lastIndex | |
---|---|---|---|---|---|---|
search() | –, /g | 0 | – | Index of match | -1 | unchanged |
/y, /gy | 0 | 0 | Index of match | -1 | unchanged | |
match() | – | 0 | – | Match object | null | unchanged |
/y | r.lastIndex | r.lastIndex | Match object | null | index after | |
match | ||||||
/g | After prev. | – | Array with matches | null | 0 | |
match (loop) | ||||||
/gy | After prev. | After prev. | Array with matches | null | 0 | |
match (loop) | match | |||||
split() | –, /g | After prev. | – | Array with strings | [str] | unchanged |
match (loop) | between matches | |||||
/y, /gy | After prev. | After prev. | Arr. w/ empty strings | [str] | unchanged | |
match (loop) | match | between matches | ||||
replace() | – | 0 | – | First match replaced | No repl. | unchanged |
/y | 0 | 0 | First match replaced | No repl. | unchanged | |
/g | After prev. | – | All matches replaced | No repl. | unchanged | |
match (loop) | ||||||
/gy | After prev. | After prev. | All matches replaced | No repl. | unchanged | |
match (loop) | match |
23.2.1 RegExp.prototype.exec(str)
If /g
is not set, matching always starts at the beginning, but skips ahead until a match is found. REGEX.lastIndex
is not changed.
const
REGEX
=
/a/
;
REGEX
.
lastIndex
=
7
;
// ignored
const
match
=
REGEX
.
exec
(
'xaxa'
);
console
.
log
(
match
.
index
);
// 1
console
.
log
(
REGEX
.
lastIndex
);
// 7 (unchanged)
If /g
is set, matching starts at REGEX.lastIndex
and skips ahead until a match is found. REGEX.lastIndex
is set to the position after the match. That means that you receive all matches if you loop until exec()
returns null
.
const
REGEX
=
/a/g
;
REGEX
.
lastIndex
=
2
;
const
match
=
REGEX
.
exec
(
'xaxa'
);
console
.
log
(
match
.
index
);
// 3
console
.
log
(
REGEX
.
lastIndex
);
// 4 (updated)
// No match at index 4 or later
console
.
log
(
REGEX
.
exec
(
'xaxa'
));
// null
If only /y
is set, matching starts at REGEX.lastIndex
and is anchored to that position (no skipping ahead until a match is found). REGEX.lastIndex
is updated similarly to when /g
is set.
const
REGEX
=
/a/y
;
// No match at index 2
REGEX
.
lastIndex
=
2
;
console
.
log
(
REGEX
.
exec
(
'xaxa'
));
// null
// Match at index 3
REGEX
.
lastIndex
=
3
;
const
match
=
REGEX
.
exec
(
'xaxa'
);
console
.
log
(
match
.
index
);
// 3
console
.
log
(
REGEX
.
lastIndex
);
// 4
Setting both /y
and /g
is the same as only setting /y
.
23.2.2 RegExp.prototype.test(str)
test()
works the same as exec()
, but it returns true
or false
(instead of a match object or null
) when matching succeeds or fails:
const
REGEX
=
/a/y
;
REGEX
.
lastIndex
=
2
;
console
.
log
(
REGEX
.
test
(
'xaxa'
));
// false
REGEX
.
lastIndex
=
3
;
console
.
log
(
REGEX
.
test
(
'xaxa'
));
// true
console
.
log
(
REGEX
.
lastIndex
);
// 4
23.2.3 String.prototype.search(regex)
search()
ignores the flag /g
and lastIndex
(which is not changed, either). Starting at the beginning of the string, it looks for the first match and returns its index (or -1
if there was no match):
const
REGEX
=
/a/
;
REGEX
.
lastIndex
=
2
;
// ignored
console
.
log
(
'xaxa'
.
search
(
REGEX
));
// 1
If you set the flag /y
, lastIndex
is still ignored, but the regular expression is now anchored to index 0.
const
REGEX
=
/a/y
;
REGEX
.
lastIndex
=
1
;
// ignored
console
.
log
(
'xaxa'
.
search
(
REGEX
));
// -1 (no match)
23.2.4 String.prototype.match(regex)
match()
has two modes:
- If
/g
is not set, it works likeexec()
. - If
/g
is set, it returns an Array with the string parts that matched, ornull
. If the flag/g
is not set,match()
captures groups likeexec()
:
{
const
REGEX
=
/a/
;
REGEX
.
lastIndex
=
7
;
// ignored
console
.
log
(
'xaxa'
.
match
(
REGEX
).
index
);
// 1
console
.
log
(
REGEX
.
lastIndex
);
// 7 (unchanged)
}
{
const
REGEX
=
/a/y
;
REGEX
.
lastIndex
=
2
;
console
.
log
(
'xaxa'
.
match
(
REGEX
));
// null
REGEX
.
lastIndex
=
3
;
console
.
log
(
'xaxa'
.
match
(
REGEX
).
index
);
// 3
console
.
log
(
REGEX
.
lastIndex
);
// 4
}
If only the flag /g
is set then match()
returns all matching substrings in an Array (or null
). Matching always starts at position 0.
const
REGEX
=
/a|b/g
;
REGEX
.
lastIndex
=
7
;
console
.
log
(
'xaxb'
.
match
(
REGEX
));
// ['a', 'b']
console
.
log
(
REGEX
.
lastIndex
);
// 0
If you additionally set the flag /y
, then matching is still performed repeatedly, while anchoring the regular expression to the index after the previous match (or 0).
const
REGEX
=
/a|b/gy
;
REGEX
.
lastIndex
=
0
;
// ignored
console
.
log
(
'xab'
.
match
(
REGEX
));
// null
REGEX
.
lastIndex
=
1
;
// ignored
console
.
log
(
'xab'
.
match
(
REGEX
));
// null
console
.
log
(
'ab'
.
match
(
REGEX
));
// ['a', 'b']
console
.
log
(
'axb'
.
match
(
REGEX
));
// ['a']
23.2.5 String.prototype.split(separator, limit)
The complete details of split()
are explained in Speaking JavaScript.
For ES6, it is interesting to see how things change if you use the flag /y
.
With /y
, the string must start with a separator:
> 'x##'.split(/#/y) // no match
- [ 'x##' ]
- > '##x'.split(/#/y) // 2 matches
- [ '', '', 'x' ]
Subsequent separators are only recognized if they immediately follow the first separator:
> '#x#'.split(/#/y) // 1 match
- [ '', 'x#' ]
- > '##'.split(/#/y) // 2 matches
- [ '', '', '' ]
That means that the string before the first separator and the strings between separators are always empty.
As usual, you can use groups to put parts of the separators into the result array:
> '##'.split(/(#)/y)
- [ '', '#', '', '#', '' ]
23.2.6 String.prototype.replace(search, replacement)
Without the flag /g
, replace()
only replaces the first match:
const
REGEX
=
/a/
;
// One match
console
.
log
(
'xaxa'
.
replace
(
REGEX
,
'-'
));
// 'x-xa'
If only /y
is set, you also get at most one match, but that match is always anchored to the beginning of the string. lastIndex
is ignored and unchanged.
const
REGEX
=
/a/y
;
// Anchored to beginning of string, no match
REGEX
.
lastIndex
=
1
;
// ignored
console
.
log
(
'xaxa'
.
replace
(
REGEX
,
'-'
));
// 'xaxa'
console
.
log
(
REGEX
.
lastIndex
);
// 1 (unchanged)
// One match
console
.
log
(
'axa'
.
replace
(
REGEX
,
'-'
));
// '-xa'
With /g
set, replace()
replaces all matches:
const
REGEX
=
/a/g
;
// Multiple matches
console
.
log
(
'xaxa'
.
replace
(
REGEX
,
'-'
));
// 'x-x-'
With /gy
set, replace()
replaces all matches, but each match is anchored to the end of the previous match:
const
REGEX
=
/a/gy
;
// Multiple matches
console
.
log
(
'aaxa'
.
replace
(
REGEX
,
'-'
));
// '--xa'
The parameter replacement
can also be a function, consult “Speaking JavaScript” for details.
23.2.7 Example: using sticky matching for tokenizing
The main use case for sticky matching is tokenizing, turning a text into a sequence of tokens. One important trait about tokenizing is that tokens are fragments of the text and that there must be no gaps between them. Therefore, sticky matching is perfect here.
function
tokenize
(
TOKEN_REGEX
,
str
)
{
const
result
=
[];
let
match
;
while
(
match
=
TOKEN_REGEX
.
exec
(
str
))
{
result
.
push
(
match
[
1
]);
}
return
result
;
}
const
TOKEN_GY
=
/\s*(\+|[0-9]+)\s*/gy
;
const
TOKEN_G
=
/\s*(\+|[0-9]+)\s*/g
;
In a legal sequence of tokens, sticky matching and non-sticky matching produce the same output:
> tokenize(TOKEN_GY, '3 + 4')
- [ '3', '+', '4' ]
- > tokenize(TOKEN_G, '3 + 4')
- [ '3', '+', '4' ]
If, however, there is non-token text in the string then sticky matching stops tokenizing, while non-sticky matching skips the non-token text:
> tokenize(TOKEN_GY, '3x + 4')
- [ '3' ]
- > tokenize(TOKEN_G, '3x + 4')
- [ '3', '+', '4' ]
The behavior of sticky matching during tokenizing helps with error handling.
23.2.8 Example: manually implementing sticky matching
If you wanted to manually implement sticky matching, you’d do it as follows: The function execSticky()
works like RegExp.prototype.exec()
in sticky mode.
function
execSticky
(
regex
,
str
)
{
// Anchor the regex to the beginning of the string
let
matchSource
=
regex
.
source
;
if
(
!
matchSource
.
startsWith
(
'^'
))
{
matchSource
=
'^'
+
matchSource
;
}
// Ensure that instance property `lastIndex` is updated
let
matchFlags
=
regex
.
flags
;
// ES6 feature!
if
(
!
regex
.
global
)
{
matchFlags
=
matchFlags
+
'g'
;
}
const
matchRegex
=
new
RegExp
(
matchSource
,
matchFlags
);
// Ensure we start matching `str` at `regex.lastIndex`
const
matchOffset
=
regex
.
lastIndex
;
const
matchStr
=
str
.
slice
(
matchOffset
);
let
match
=
matchRegex
.
exec
(
matchStr
);
// Translate indices from `matchStr` to `str`
regex
.
lastIndex
=
matchRegex
.
lastIndex
+
matchOffset
;
match
.
index
=
match
.
index
+
matchOffset
;
return
match
;
}
23.3 New flag /u (unicode)
The flag /u
switches on a special Unicode mode for a regular expression. That mode has two features:
- You can use Unicode code point escape sequences such as
\u{1F42A}
for specifying characters via code points. Normal Unicode escapes such as\u03B1
only have a range of four hexadecimal digits (which equals the basic multilingual plane). - “characters” in the regular expression pattern and the string are code points (not UTF-16 code units). Code units are converted into code points.
A section in the chapter on Unicode has more information on escape sequences. I’ll explain the consequences of feature 2 next. Instead of Unicode code point escapes (e.g.,
\u{1F680}
), I’m using two UTF-16 code units (e.g.,\uD83D\uDE80
). That makes it clear that surrogate pairs are grouped in Unicode mode and works in both Unicode mode and non-Unicode mode.
> '\u{1F680}' === '\uD83D\uDE80' // code point vs. surrogate pairs
- true
23.3.1 Consequence: lone surrogates in the regular expression only match lone surrogates
In non-Unicode mode, a lone surrogate in a regular expression is even found inside (surrogate pairs encoding) code points:
> /\uD83D/.test('\uD83D\uDC2A')
- true
In Unicode mode, surrogate pairs become atomic units and lone surrogates are not found “inside” them:
> /\uD83D/u.test('\uD83D\uDC2A')
- false
Actual lone surrogate are still found:
> /\uD83D/u.test('\uD83D \uD83D\uDC2A')
- true
- > /\uD83D/u.test('\uD83D\uDC2A \uD83D')
- true
23.3.2 Consequence: you can put code points in character classes
In Unicode mode, you can put code points into character classes and they won’t be interpreted as two characters, anymore.
> /^[\uD83D\uDC2A]$/u.test('\uD83D\uDC2A')
- true
- > /^[\uD83D\uDC2A]$/.test('\uD83D\uDC2A')
- false
- > /^[\uD83D\uDC2A]$/u.test('\uD83D')
- false
- > /^[\uD83D\uDC2A]$/.test('\uD83D')
- true
23.3.3 Consequence: the dot operator (.) matches code points, not code units
In Unicode mode, the dot operator matches code points (one or two code units). In non-Unicode mode, it matches single code units. For example:
> '\uD83D\uDE80'.match(/./gu).length
- 1
- > '\uD83D\uDE80'.match(/./g).length
- 2
23.3.4 Consequence: quantifiers apply to code points, not code units
In Unicode mode, quantifiers apply to code points (one or two code units). In non-Unicode mode, they apply to single code units. For example:
> /\uD83D\uDE80{2}/u.test('\uD83D\uDE80\uD83D\uDE80')
- true
- > /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uD83D\uDE80')
- false
- > /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uDE80')
- true
23.4 New data property flags
In ECMAScript 6, regular expressions have the following data properties:
- The pattern:
source
- The flags:
flags
- Individual flags:
global
,ignoreCase
,multiline
,sticky
,unicode
- Other:
lastIndex
As an aside,lastIndex
is the only instance property now, all other data properties are implemented via internal instance properties and getters such asget RegExp.prototype.global
.
The property source
(which already existed in ES5) contains the regular expression pattern as a string:
> /abc/ig.source
- 'abc'
The property flags
is new, it contains the flags as a string, with one character per flag:
> /abc/ig.flags
- 'gi'
You can’t change the flags of an existing regular expression (ignoreCase
etc. have always been immutable), but flags
allows you to make a copy where the flags are changed:
function
copyWithIgnoreCase
(
re
)
{
return
new
RegExp
(
re
.
source
,
re
.
flags
.
includes
(
'i'
)
?
re
.
flags
:
re
.
flags
+
'i'
);
}
The next section explains another way to make modified copies of regular expressions.
23.5 RegExp() can be used as a copy constructor
In ES6 there are two variants of the constructor RegExp()
(the second one is new):
new RegExp(pattern : string, flags = '')
A new regular expression is created as specified viapattern
. Ifflags
is missing, the empty string''
is used.new RegExp(regex : RegExp, flags = regex.flags)
regex
is cloned. Ifflags
is provided then it determines the flags of the copy. The following interaction demonstrates the latter variant:
> new RegExp(/abc/ig).flags
- 'gi'
- > new RegExp(/abc/ig, 'i').flags // change flags
- 'i'
Therefore, the RegExp
constructor gives us another way to change flags:
function
copyWithIgnoreCase
(
re
)
{
return
new
RegExp
(
re
,
re
.
flags
.
includes
(
'i'
)
?
re
.
flags
:
re
.
flags
+
'i'
);
}
23.5.1 Example: an iterable version of exec()
The following function execAll()
is an iterable version of exec()
that fixes several issues with using exec()
to retrieve all matches of a regular expression:
- Looping over the matches is unnecessarily complicated (you call
exec()
until it returnsnull
). exec()
mutates the regular expression, which means that side effects can become a problem.- The flag
/g
must be set. Otherwise, only the first match is returned.
function
*
execAll
(
regex
,
str
)
{
// Make sure flag /g is set and regex.index isn’t changed
const
localCopy
=
copyAndEnsureFlag
(
regex
,
'g'
);
let
match
;
while
(
match
=
localCopy
.
exec
(
str
))
{
yield
match
;
}
}
function
copyAndEnsureFlag
(
re
,
flag
)
{
return
new
RegExp
(
re
,
re
.
flags
.
includes
(
flag
)
?
re
.
flags
:
re
.
flags
+
flag
);
}
Using execAll()
:
const
str
=
'"fee" "fi" "fo" "fum"'
;
const
regex
=
/"([^"]*)"/
;
// Access capture of group #1 via destructuring
for
(
const
[,
group1
]
of
execAll
(
regex
,
str
))
{
console
.
log
(
group1
);
}
// Output:
// fee
// fi
// fo
// fum
23.6 String methods that delegate to regular expression methods
The following string methods now delegate some of their work to regular expression methods:
String.prototype.match
callsRegExp.prototype[Symbol.match]
.String.prototype.replace
callsRegExp.prototype[Symbol.replace]
.String.prototype.search
callsRegExp.prototype[Symbol.search]
.String.prototype.split
callsRegExp.prototype[Symbol.split]
. For more information, consult Sect. “String methods that delegate regular expression work to their parameters” in the chapter on strings.
Further reading
If you want to know in more detail how the regular expression flag /u
works, I recommend the article “Unicode-aware regular expressions in ECMAScript 6” by Mathias Bynens.