- 39. Regular expressions (RegExp)
- 39.1. Creating regular expressions
- 39.2. Syntax
- 39.3. Flags
- 39.4. Properties of regular expression objects
- 39.5. Methods for working with regular expressions
- 39.5.1. regExp.test(str): is there a match?
- 39.5.2. str.search(regExp): at what index is the match?
- 39.5.3. regExp.exec(str): capturing groups
- 39.5.4. str.match(regExp): return all matching substrings
- 39.5.5. str.replace(searchValue, replacementValue)
- 39.5.6. Other methods for working with regular expressions
- 39.6. Flag /g and its pitfalls
- 39.7. Techniques for working with regular expressions
39. Regular expressions (RegExp)
39.1. Creating regular expressions
39.1.1. Literal vs. constructor
The two main ways of creating regular expressions, are:
- Literal:
/abc/ui
, compiled statically (at load time). Constructor:
new RegExp('abc', 'ui')
, compiled dynamically (at runtime)- The second parameter is optional.
Both regular expressions have the same two parts:
- The second parameter is optional.
The body
abc
– the actual regular expression.- The flags
u
andi
. Flags configure how the pattern is interpreted:u
switches on the Unicode mode.i
enables case-insensitive matching.
39.1.2. Cloning and non-destructively modifying regular expressions
There are two variants of the constructor RegExp()
:
new RegExp(pattern : string, flags = '')
A new regular expression is created as specified viapattern
. Ifflags
is missing, the empty string''
is used.new RegExp(regExp : RegExp, flags = regExp.flags)
[ES6]regExp
is cloned. Ifflags
is provided then it determines the flags of the copy.
The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them. For example:
function copyAndAddFlags(regExp, flags='') {
// The constructor doesn’t allow duplicate flags,
// make sure there aren’t any:
const newFlags = [...new Set(regExp.flags + flags)].join('');
return new RegExp(regExp, newFlags);
}
assert.equal(/abc/i.flags, 'i');
assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');
39.2. Syntax
39.2.1. Syntax characters
At the top level of a regular expression, the following syntax characters are special. They are escaped by prefixing a backslash (\
).
\ ^ $ . * + ? ( ) [ ] { } |
In regular expression literals, you must also escape the slash (not necessary with new RegExp()
):
39.2.2. Basic atoms
Atoms are the basic building blocks of regular expressions.
- Pattern characters: are all characters except syntax characters (
^
,$
, etc.). Pattern characters match themselves. Examples:A b %
.
matches any character. You can use the flag/s
(dotall
) to control if the dot matches line terminators or not (more below).- Character escapes (each escape matches a single fixed character):
- Control escapes (for a few control characters):
\f
: form feed (FF)\n
: line feed (LF)\r
: carriage return (CR)\t
: character tabulation\v
: line tabulation
- Arbitrary control characters:
\cA
(Ctrl-A),\cB
(Ctrl-B), etc. - Unicode code units:
\u00E4
- Unicode code points (require flag
/u
):\u{1F44D}
- Control escapes (for a few control characters):
- Character class escapes (each escape matches one out of a set of characters):
\d
: digits (same as[0-9]
)\D
: non-digits
\w
: “word” characters (same as[A-Za-z0-9_]
)\W
: non-word characters
\s
: whitespace (space, tab, line terminators, etc.)\S
: non-whitespace
- Unicode property escapes (ES2018):
\p{White_Space}
,\P{White_Space}
, etc.- Require flag
/u
. - Described in the next subsection.
- Require flag
39.2.2.1. Unicode property escapes
Unicode property escapes look like this:
\p{prop=value}
: matches all characters whose propertyprop
has the valuevalue
.\P{prop=value}
: matches all characters that do not have a propertyprop
whose value isvalue
.\p{bin_prop}
: matches all characters whose binary propertybin_prop
is True.\P{bin_prop}
: matches all characters whose binary propertybin_prop
is False.
Comments:You can only use Unicode property escapes if the flag
/u
is set. Without/u
,\p
is the same asp
.Forms (3) and (4) can be used as abbreviations if the property is
General_Category
. For example,\p{Lowercase_Letter}
is an abbreviation for\p{General_Category=Lowercase_Letter}
Examples:
- Checking for whitespace:
- Checking for Greek letters:
- Deleting any letters:
- Deleting lowercase letters:
Further reading:
- Lists of Unicode properties and their values: “Unicode Standard Annex #44: Unicode Character Database” (Editors: Mark Davis, Laurențiu Iancu, Ken Whistler)
- Unicode property escapes in more depth: Chapter “RegExp Unicode property escapes” in “Exploring ES2018 and ES2019”
39.2.3. Character classes
Match one of a set of characters:
[abc]
- Match any character not in a set:
[^abc]
- Match any character not in a set:
- Inside the square brackets, only the following characters are special and must be escaped:
^ \ - ]
^
only has to be escaped if it comes first. -
need not be escaped if it comes first or last
Character escapes (
\n
,\u{1F44D}
) and character class escapes (\d
,\p{White_Space}
) work as usual.- Exception: Inside square brackets,
\b
matches backspace. Elsewhere, it matches word boundaries.
- Exception: Inside square brackets,
- Character ranges are specified via dashes:
[a-z]
,[^a-z]
39.2.4. Groups
- Positional capturing group:
(#+)
- Backreference:
\1
,\2
, etc.
- Backreference:
- Named capturing group (ES2018):
(?<hashes>#+)
- Backreference:
\k<hashes>
- Backreference:
- Noncapturing group:
(?:#+)
39.2.5. Quantifiers
By default, all of the following quantifiers are greedy:
?
: match never or once*
: match zero or more times+
: match one or more times{n}
: matchn
times{n,}
: matchn
or more times{n,m}
: match at leastn
times, at mostm
times.
To make them reluctant, put question marks (?
) after them:
39.2.6. Assertions
^
matches only at the beginning of the input$
matches only at the end of the input\b
matches only at a word boundary\B
matches only when not at a word boundary
- Lookahead:
(?=«pattern»)
matches ifpattern
matches what comes next (positive lookahead). Example (“sequences of lowercase letters that are followed by anX
” – note that theX
itself is not part of the matched substring):
(?!«pattern»)
matches ifpattern
does not match what comes next (negative lookahead). Example (“sequences of lowercase letters that are not followed by anX
”)
- Further reading: “RegExp lookbehind assertions” in “Exploring ES2018 and ES2019” (covers lookahead assertions, too)
- Lookbehind (ES2018):
(?<=«pattern»)
matches ifpattern
matches what came before (positive lookbehind)
(?<!«pattern»)
matches ifpattern
does not match what came before (negative lookbehind)
- Further reading: “RegExp lookbehind assertions” in “Exploring ES2018 and ES2019”
39.2.7. Disjunction (|)
Caveat: this operator has low precedence. Use groups if necessary:
^aa|zz$
matches all strings that start withaa
and/or end withzz
. Note that|
has a lower precedence than^
and$
.^(aa|zz)$
matches the two strings'aa'
and'zz'
.^a(a|z)z$
matches the two strings'aaz'
and'azz'
.
39.3. Flags
Literal flag | Property name | ES | Description |
---|---|---|---|
g | global | ES3 | Match multiple times |
i | ignoreCase | ES3 | Match case-insensitively |
m | multiline | ES3 | ^ and $ match per line |
s | dotall | ES2018 | Dot matches line terminators |
u | unicode | ES6 | Unicode mode (recommended) |
y | sticky | ES6 | No characters between matches |
The following regular expression flags are available in JavaScript (tbl. 20 provides a compact overview):
/g
(.global
): fundamentally changes how the methodsRegExp.prototype.test()
,RegExp.prototype.exec()
andString.prototype.match()
work. It is explained in detail along with these methods. In a nutshell: Without/g
, the methods only consider the first match for a regular expression in an input string. With/g
, they consider all matches./i
(.ignoreCase
): switches on case-insensitive matching:
/m
(.multiline
): If this flag is on,^
matches the beginning of each line and$
matches the end of each line. If it is off,^
matches the beginning of the whole input string and$
matches the end of the whole input string.
/u
(.unicode
): This flag switches on the Unicode mode for a regular expression. That mode is explained in the next subsection./y
(.sticky
): This flag only makes sense in conjunction with/g
. When both are switched on, any match after the first one must directly follow the previous match (without any characters between them).
/s
(.dotall
): By default, the dot does not match line terminators. With this flag, it does:
Alternative for older ECMAScript versions:
39.3.1. Flag: Unicode mode via /u
The flag /u
switches on a special Unicode mode for a regular expression. That mode enables several features:
In patterns, you can use Unicode code point escapes such as
\u{1F42A}
to specify characters. Code unit escapes such as\u03B1
only have a range of four hexadecimal digits (which equals the basic multilingual plane).In patterns, you can use Unicode property escapes (ES2018) such as
\p{White_Space}
.Many escapes are now forbidden (which enables the previous feature):
- The atomic units for matching (“characters”) are code points, not code units.
The following subsections explain the last item in more detail. They use the following Unicode character to explain when the atomic units are code points and when they are code units:
I’m only switching between ?
and \uD83D\uDE42
, to illustrate how JavaScript sees things. Both are equivalent and can be used interchangeably in strings and regular expressions.
39.3.1.1. Consequence: you can put code points in character classes
With /u
, the two code units of ?
are interpreted as a single character:
Without /u
, ?
is interpreted as two characters:
Note that ^
and $
demand that the input string have a single character. That’s why the first result is false
.
39.3.1.2. Consequence: the dot operator (.) matches code points, not code units
With /u
, the dot operator matches code points (.match()
plus /g
returns an Array with all the matches of a regular expression):
Without /u
, the dot operator matches single code units:
39.3.1.3. Consequence: quantifiers apply to code points, not code units
With /u
, a quantifier applies to the whole preceding code point:
Without /u
, a quantifier only applies to the preceding code unit:
39.4. Properties of regular expression objects
Noteworthy:
- Strictly speaking, only
.lastIndex
is a real instance property. All other properties are implemented via getters. - Accordingly,
.lastIndex
is the only mutable property. All other properties are read-only. If you want to change them, you need to copy the regular expression (consult the section on cloning for details).
39.4.1. Flags as properties
Each regular expression flag exists as a property, with a longer, more descriptive name:
This is the complete list of flag properties:
.dotall
(/s
).global
(/g
).ignoreCase
(/i
).multiline
(/m
).sticky
(/y
).unicode
(/u
)
39.4.2. Other properties
Each regular expression also has the following properties:
.source
: The regular expression pattern.
.flags
: The flags of the regular expression.
.lastIndex
: Used when flag/g
is switched on. Consult the section on that flag for details.
39.5. Methods for working with regular expressions
39.5.1. regExp.test(str): is there a match?
The regular expression method .test()
returns true
if regExp
matches str
:
With .test()
you should normally avoid the /g
flag. If you use it, you generally don’t get the same result every time you call the method:
The results are due to /a/
having two matches in the string. After all of those were found, .test()
returns false
.
39.5.2. str.search(regExp): at what index is the match?
The string method .search()
returns the first index of str
at which there is a match for regExp
:
39.5.3. regExp.exec(str): capturing groups
39.5.3.1. Getting a match object for the first match
Without the flag /g
, .exec()
returns all captures of the first match for regExp
in str
:
The result is a match object with the following properties:
[0]
: the complete substring matched by the regular expression[1]
: capture of positional group 1 (etc.).index
: where did the match occur?.input
: the string that was matched against.groups
: captures of named groups
39.5.3.2. Named groups (ES2018)
The previous example contained a single positional group. The following example demonstrates named groups:
As you can see, the named groups key
and value
also exist as positional groups.
39.5.3.3. Looping over multiple matches
If you want to retrieve all matches of a regular expression (not just the first one), you need to switch on the flag /g
. Then you can call .exec()
multiple times and get another match each time. After the last match, .exec()
returns null
.
Therefore, you can loop over all matches as follows:
Sharing regular expressions with /g
has a few pitfalls, which are explained later.
39.5.4. str.match(regExp): return all matching substrings
Without /g
, .match()
works like .exec()
– it returns a single match object.
With /g
, .match()
returns all substrings of str
that match regExp
:
If there is no match, .match()
returns null
:
You can use the Or operator to protect yourself against null
:
39.5.5. str.replace(searchValue, replacementValue)
.replace()
has several different modes, depending on what values you provide for its parameters:
searchValue
is …- a regular expression without
/g
: replace first occurrence. - a regular expression with
/g
: replace all occurrences. - a string: replace first occurrence (the string is interpreted verbatim, not as a regular expression). Alas, that means that strings are of limited use as search values. Later in this chapter, you’ll find a tool function for turning an arbitrary text into a regular expression.
- a regular expression without
replacementValue
is …- a string: describe replacement
- a function: compute replacement
The next subsections assume that a regular expression with/g
is being used.
39.5.5.1. replacementValue is a string
If the replacement value is a string, the dollar sign has special meaning – it inserts things matched by the regular expression:
Text | Result |
---|---|
$$ | single $ |
$& | complete match |
$` | text before match |
$' | text after match |
$n | capture of positional group n (n > 0) |
$<name> | capture of named group name |
Example: Inserting the text before, inside, and after the matched substring.
Example: Inserting the captures of positional groups.
Example: Inserting the captures of named groups.
39.5.5.2. replacementValue is a function
If the replacement value is a function, you can compute each replacement. In the following example, we multiply each non-negative integer, that we find, by two.
The replacement function gets the following parameters. Note how similar they are to match objects. The parameters are all positional, but I’ve included how one usually names them:
all
: complete matchg1
: capture of positional group 1- Etc.
index
: where did the match occur?input
: the string that was matched againstgroups
: captures of named groups (an object)
39.5.6. Other methods for working with regular expressions
The first parameter of String.prototype.split()
is either a string or a regular expression. If it is the latter then substrings captured by groups are added to the result of the method:
Consult the chapter on strings for more information.
39.6. Flag /g and its pitfalls
The following two regular expression methods do something unusual if /g
is switched on:
RegExp.prototype.exec()
RegExp.prototype.test()
Then they can be called repeatedly and deliver all matches inside a string. Property.lastIndex
of the regular expression is used to track the current position inside the string. For example:
const r = /a/g;
assert.equal(r.lastIndex, 0);
assert.equal(r.test('aa'), true); // 1st match?
assert.equal(r.lastIndex, 1); // after 1st match
assert.equal(r.test('aa'), true); // 2nd match?
assert.equal(r.lastIndex, 2); // after 2nd match
assert.equal(r.test('aa'), false); // 3rd match?
assert.equal(r.lastIndex, 0); // start over
So how is flag /g
problematic? We’ll first explore the problems and then solutions.
39.6.1. Problem: You can’t inline a regular expression with flag /g
A regular expression with /g
can’t be inlined: For example, in the following while
loop, the regular expression is created fresh, every time the condition is checked. Therefore, its .lastIndex
is always zero and the loop never terminates.
39.6.2. Problem: Removing /g can break code
If code expects a regular expression with /g
and has a loop over the results of .exec()
or .test()
then a regular expression without /g
can cause an infinite loop:
Why? Because .test()
always returns the first result, true
, and never false
.
39.6.3. Problem: Adding /g can break code
With .test()
, there is another caveat: If you want to check exactly once if a regular expression matches a string then the regular expression must not have /g
. Otherwise, you generally get a different result, every time you call .test()
:
Normally, you won’t add /g
if you intend to use .test()
in this manner. But it can happen if, e.g., you use the same regular expression for testing and for replacing. Or if you get the regular expression via a parameter.
39.6.4. Problem: Code can break if .lastIndex isn’t zero
When a regular expression is created, .lastIndex
is initialized to zero. If code ever receives a regular expression whose .lastIndex
is not zero, it can break. For example:
.lastIndex
not being zero can happen relatively easily if a regular expression is shared and not handled properly.
39.6.5. Dealing with /g and .lastIndex
Consider the following scenario: You want to implement a function countOccurrences(regExp, str)
that counts how often regExp
has a match inside str
. How do you prevent a wrong regExp
from breaking your code? Let’s look at three approaches.
First, you can throw exceptions if /g
isn’t set or .lastIndex
isn’t zero:
Second, you can clone the parameter. That has the added benefit that regExp
won’t be changed.
Third, you can use .match()
to count occurrences – which doesn’t change or depend on .lastIndex
.
39.7. Techniques for working with regular expressions
39.7.1. Escaping arbitrary text for regular expressions
The following function escapes an arbitrary text so that it is matched verbatim if you put it inside a regular expression:
In line A, we escape all syntax characters. Note that /u
forbids many escapes: among others, \:
and -
.
This is how you can use escapeForRegExp()
to replace an arbitrary text multiple times:
39.7.2. Matching everything or nothing
Sometimes, you may need a regular expression that matches everything or nothing. For example, as a sentinel value.
- Match everything:
/(?:)/
(the empty group matches everything; making it noncapturing avoids unnecessary work)
- Match nothing:
/.^/
(once matching has progressed beyond the first character,^
doesn’t match, anymore)