IX. More standard library - 39. Regular expressions (RegExp) - 《JavaScript for impatient programmers (beta)》

39. Regular expressions (RegExp)

Please support this book: buy it or donate

39. Regular expressions (RegExp)

39.1. Creating regular expressions

39.1.1. Literal vs. constructor

The two main ways of creating regular expressions, are:

Literal: /abc/ui, compiled statically (at load time).
Constructor: new RegExp('abc', 'ui'), compiled dynamically (at runtime)
- The second parameter is optional.
  Both regular expressions have the same two parts:
The body abc – the actual regular expression.
The flags u and i. Flags configure how the pattern is interpreted: u switches on the Unicode mode. i enables case-insensitive matching.

39.1.2. Cloning and non-destructively modifying regular expressions

There are two variants of the constructor RegExp():

new RegExp(pattern : string, flags = '')A new regular expression is created as specified via pattern. If flags is missing, the empty string '' is used.
new RegExp(regExp : RegExp, flags = regExp.flags) [ES6]regExp is cloned. If flags is provided then it determines the flags of the copy.

The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them. For example:

function copyAndAddFlags(regExp, flags='') {
  // The constructor doesn’t allow duplicate flags,
  // make sure there aren’t any:
  const newFlags = [...new Set(regExp.flags + flags)].join('');
  return new RegExp(regExp, newFlags);
}
assert.equal(/abc/i.flags, 'i');
assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');

39.2. Syntax

39.2.1. Syntax characters

At the top level of a regular expression, the following syntax characters are special. They are escaped by prefixing a backslash (\).

\ ^ $ . * + ? ( ) [ ] { } |

In regular expression literals, you must also escape the slash (not necessary with new RegExp()):

> /\//.test('/')
true
> new RegExp('/').test('/')
true

39.2.2. Basic atoms

Atoms are the basic building blocks of regular expressions.

Pattern characters: are all characters except syntax characters (^, $, etc.). Pattern characters match themselves. Examples: A b %
. matches any character. You can use the flag /s (dotall) to control if the dot matches line terminators or not (more below).
Character escapes (each escape matches a single fixed character):
- Control escapes (for a few control characters):
  - \f: form feed (FF)
  - \n: line feed (LF)
  - \r: carriage return (CR)
  - \t: character tabulation
  - \v: line tabulation
- Arbitrary control characters: \cA (Ctrl-A), \cB (Ctrl-B), etc.
- Unicode code units: \u00E4
- Unicode code points (require flag /u): \u{1F44D}
Character class escapes (each escape matches one out of a set of characters):
- \d: digits (same as [0-9])
  - \D: non-digits
- \w: “word” characters (same as [A-Za-z0-9_])
  - \W: non-word characters
- \s: whitespace (space, tab, line terminators, etc.)
  - \S: non-whitespace
- Unicode property escapes (ES2018): \p{White_Space}, \P{White_Space}, etc.
  - Require flag /u.
  - Described in the next subsection.

39.2.2.1. Unicode property escapes

Unicode property escapes look like this:

\p{prop=value}: matches all characters whose property prop has the value value.
\P{prop=value}: matches all characters that do not have a property prop whose value is value.
\p{bin_prop}: matches all characters whose binary property bin_prop is True.
\P{bin_prop}: matches all characters whose binary property bin_prop is False.
Comments:
You can only use Unicode property escapes if the flag /u is set. Without /u, \p is the same as p.
Forms (3) and (4) can be used as abbreviations if the property is General_Category. For example, \p{Lowercase_Letter} is an abbreviation for \p{General_Category=Lowercase_Letter}

Examples:

Checking for whitespace:

> /^\p{White_Space}+$/u.test('\t \n\r')
true

Checking for Greek letters:

> /^\p{Script=Greek}+$/u.test('μετά')
true

Deleting any letters:

> '1π2ü3é4'.replace(/\p{Letter}/ug, '')
'1234'

Deleting lowercase letters:

> 'AbCdEf'.replace(/\p{Lowercase_Letter}/ug, '')
'ACE'

39.2.3. Character classes

Match one of a set of characters: [abc]
- Match any character not in a set: [^abc]
Inside the square brackets, only the following characters are special and must be escaped:

^ \ - ]

^ only has to be escaped if it comes first. - need not be escaped if it comes first or last

Character escapes (\n, \u{1F44D}) and character class escapes (\d, \p{White_Space}) work as usual.
- Exception: Inside square brackets, \b matches backspace. Elsewhere, it matches word boundaries.
Character ranges are specified via dashes: [a-z], [^a-z]

39.2.4. Groups

Positional capturing group: (#+)
- Backreference: \1, \2, etc.
Named capturing group (ES2018): (?<hashes>#+)
- Backreference: \k<hashes>
Noncapturing group: (?:#+)

39.2.5. Quantifiers

By default, all of the following quantifiers are greedy:

?: match never or once
*: match zero or more times
+: match one or more times
{n}: match n times
{n,}: match n or more times
{n,m}: match at least n times, at most m times.
To make them reluctant, put question marks (?) after them:

> /".*"/.exec('"abc"def"')[0]  // greedy
'"abc"def"'
> /".*?"/.exec('"abc"def"')[0] // reluctant
'"abc"'

39.2.6. Assertions

^ matches only at the beginning of the input
$ matches only at the end of the input
\b matches only at a word boundary
- \B matches only when not at a word boundary
Lookahead:
- (?=«pattern») matches if pattern matches what comes next (positive lookahead). Example (“sequences of lowercase letters that are followed by an X” – note that the X itself is not part of the matched substring):

> 'abcX def'.match(/[a-z]+(?=X)/g)
[ 'abc' ]

(?!«pattern») matches if pattern does not match what comes next (negative lookahead). Example (“sequences of lowercase letters that are not followed by an X”)

> 'abcX def'.match(/[a-z]+(?!X)/g)
[ 'ab', 'def' ]

Further reading: “RegExp lookbehind assertions” in “Exploring ES2018 and ES2019” (covers lookahead assertions, too)

Lookbehind (ES2018):
- (?<=«pattern») matches if pattern matches what came before (positive lookbehind)

> 'Xabc def'.match(/(?<=X)[a-z]+/g)
[ 'abc' ]

(?<!«pattern») matches if pattern does not match what came before (negative lookbehind)

> 'Xabc def'.match(/(?<!X)[a-z]+/g)
[ 'bc', 'def' ]

Further reading: “RegExp lookbehind assertions” in “Exploring ES2018 and ES2019”

39.2.7. Disjunction (|)

Caveat: this operator has low precedence. Use groups if necessary:

^aa|zz$ matches all strings that start with aa and/or end with zz. Note that | has a lower precedence than ^ and $.
^(aa|zz)$ matches the two strings 'aa' and 'zz'.
^a(a|z)z$ matches the two strings 'aaz' and 'azz'.

39.3. Flags

Table 20: These are the regular expression flags supported by JavaScript.
Literal flag	Property name	ES	Description
`g`	`global`	ES3	Match multiple times
`i`	`ignoreCase`	ES3	Match case-insensitively
`m`	`multiline`	ES3	`^` and `$` match per line
`s`	`dotall`	ES2018	Dot matches line terminators
`u`	`unicode`	ES6	Unicode mode (recommended)
`y`	`sticky`	ES6	No characters between matches

The following regular expression flags are available in JavaScript (tbl. 20 provides a compact overview):

/g (.global): fundamentally changes how the methods RegExp.prototype.test(), RegExp.prototype.exec() and String.prototype.match() work. It is explained in detail along with these methods. In a nutshell: Without /g, the methods only consider the first match for a regular expression in an input string. With /g, they consider all matches.
/i (.ignoreCase): switches on case-insensitive matching:

> /a/.test('A')
false
> /a/i.test('A')
true

/m (.multiline): If this flag is on, ^ matches the beginning of each line and $ matches the end of each line. If it is off, ^ matches the beginning of the whole input string and $ matches the end of the whole input string.

> 'a1\na2\na3'.match(/^a./gm)
[ 'a1', 'a2', 'a3' ]
> 'a1\na2\na3'.match(/^a./g)
[ 'a1' ]

/u (.unicode): This flag switches on the Unicode mode for a regular expression. That mode is explained in the next subsection.
/y (.sticky): This flag only makes sense in conjunction with /g. When both are switched on, any match after the first one must directly follow the previous match (without any characters between them).

> 'a1a2 a3'.match(/a./gy)
[ 'a1', 'a2' ]
> 'a1a2 a3'.match(/a./g)
[ 'a1', 'a2', 'a3' ]

/s (.dotall): By default, the dot does not match line terminators. With this flag, it does:

> /./.test('\n')
false
> /./s.test('\n')
true

Alternative for older ECMAScript versions:

> /[^]/.test('\n')
true

39.3.1. Flag: Unicode mode via /u

The flag /u switches on a special Unicode mode for a regular expression. That mode enables several features:

In patterns, you can use Unicode code point escapes such as \u{1F42A} to specify characters. Code unit escapes such as \u03B1 only have a range of four hexadecimal digits (which equals the basic multilingual plane).
In patterns, you can use Unicode property escapes (ES2018) such as \p{White_Space}.
Many escapes are now forbidden (which enables the previous feature):

> /\a/
/\a/
> /\a/u
SyntaxError: Invalid regular expression: /\a/: Invalid escape
> /\-/
/\-/
> /\-/u
SyntaxError: Invalid regular expression: /\-/: Invalid escape
> /\:/
/\:/
> /\:/u
SyntaxError: Invalid regular expression: /\:/: Invalid escape

The atomic units for matching (“characters”) are code points, not code units.

The following subsections explain the last item in more detail. They use the following Unicode character to explain when the atomic units are code points and when they are code units:

const codePoint = '?';
const codeUnits = '\uD83D\uDE42'; // UTF-16
assert.equal(codePoint, codeUnits); // same string!

I’m only switching between ? and \uD83D\uDE42, to illustrate how JavaScript sees things. Both are equivalent and can be used interchangeably in strings and regular expressions.

39.3.1.1. Consequence: you can put code points in character classes

With /u, the two code units of ? are interpreted as a single character:

> /^[?]$/u.test('?')
true

Without /u, ? is interpreted as two characters:

> /^[\uD83D\uDE42]$/.test('\uD83D\uDE42')
false
> /^[\uD83D\uDE42]$/.test('\uDE42')
true

Note that ^ and $ demand that the input string have a single character. That’s why the first result is false.

39.3.1.2. Consequence: the dot operator (.) matches code points, not code units

With /u, the dot operator matches code points (.match() plus /g returns an Array with all the matches of a regular expression):

> '?'.match(/./gu).length
1

Without /u, the dot operator matches single code units:

> '\uD83D\uDE80'.match(/./g).length
2

39.3.1.3. Consequence: quantifiers apply to code points, not code units

With /u, a quantifier applies to the whole preceding code point:

> /^?{3}$/u.test('???')
true

Without /u, a quantifier only applies to the preceding code unit:

> /^\uD83D\uDE80{3}$/.test('\uD83D\uDE80\uDE80\uDE80')
true

39.4. Properties of regular expression objects

Noteworthy:

Strictly speaking, only .lastIndex is a real instance property. All other properties are implemented via getters.
Accordingly, .lastIndex is the only mutable property. All other properties are read-only. If you want to change them, you need to copy the regular expression (consult the section on cloning for details).

39.4.1. Flags as properties

Each regular expression flag exists as a property, with a longer, more descriptive name:

> /a/i.ignoreCase
true
> /a/.ignoreCase
false

This is the complete list of flag properties:

.dotall (/s)
.global (/g)
.ignoreCase (/i)
.multiline (/m)
.sticky (/y)
.unicode (/u)

39.4.2. Other properties

Each regular expression also has the following properties:

.source: The regular expression pattern.

> /abc/ig.source
'abc'

.flags: The flags of the regular expression.

> /abc/ig.flags
'gi'

.lastIndex: Used when flag /g is switched on. Consult the section on that flag for details.

39.5. Methods for working with regular expressions

39.5.1. regExp.test(str): is there a match?

The regular expression method .test() returns true if regExp matches str:

> /abc/.test('ABC')
false
> /abc/i.test('ABC')
true
> /\.js$/.test('main.js')
true

With .test() you should normally avoid the /g flag. If you use it, you generally don’t get the same result every time you call the method:

> const r = /a/g;
> r.test('aab')
true
> r.test('aab')
true
> r.test('aab')
false

The results are due to /a/ having two matches in the string. After all of those were found, .test() returns false.

39.5.2. str.search(regExp): at what index is the match?

The string method .search() returns the first index of str at which there is a match for regExp:

> '_abc_'.search(/abc/)
1
> 'main.js'.search(/\.js$/)
4

39.5.3. regExp.exec(str): capturing groups

39.5.3.1. Getting a match object for the first match

Without the flag /g, .exec() returns all captures of the first match for regExp in str:

assert.deepEqual(
  /(a+)b/.exec('ab aab'),
  {
    0: 'ab',
    1: 'a',
    index: 0,
    input: 'ab aab',
    groups: undefined,
  }
);

The result is a match object with the following properties:

[0]: the complete substring matched by the regular expression
[1]: capture of positional group 1 (etc.)
.index: where did the match occur?
.input: the string that was matched against
.groups: captures of named groups

39.5.3.2. Named groups (ES2018)

The previous example contained a single positional group. The following example demonstrates named groups:

const regExp = /^(?<key>[A-Za-z]+): (?<value>.*)$/u;
assert.deepEqual(
  regExp.exec('first: Jane'),
  {
    0: 'first: Jane',
    1: 'first',
    2: 'Jane',
    index: 0,
    input: 'first: Jane',
    groups: { key: 'first', value: 'Jane' },
  }
);

As you can see, the named groups key and value also exist as positional groups.

39.5.3.3. Looping over multiple matches

If you want to retrieve all matches of a regular expression (not just the first one), you need to switch on the flag /g. Then you can call .exec() multiple times and get another match each time. After the last match, .exec() returns null.

> const regExp = /(a+)b/g;
> regExp.exec('ab aab')
{ 0: 'ab', 1: 'a', index: 0, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
{ 0: 'aab', 1: 'aa', index: 3, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
null

Therefore, you can loop over all matches as follows:

const regExp = /(a+)b/g;
const str = 'ab aab';
let match;
// Check for null via truthiness
// Alternative: while ((match = regExp.exec(str)) !== null)
while (match = regExp.exec(str)) {
  console.log(match[1]);
}
// Output:
// 'a'
// 'aa'

Sharing regular expressions with /g has a few pitfalls, which are explained later.

39.5.4. str.match(regExp): return all matching substrings

Without /g, .match() works like .exec() – it returns a single match object.

With /g, .match() returns all substrings of str that match regExp:

> 'ab aab'.match(/(a+)b/g)  // important: /g
[ 'ab', 'aab' ]

If there is no match, .match() returns null:

> 'xyz'.match(/(a+)b/g)
null

You can use the Or operator to protect yourself against null:

const numberOfMatches = (str.match(regExp) || []).length;

39.5.5. str.replace(searchValue, replacementValue)

.replace() has several different modes, depending on what values you provide for its parameters:

searchValue is …
- a regular expression without /g: replace first occurrence.
- a regular expression with /g: replace all occurrences.
- a string: replace first occurrence (the string is interpreted verbatim, not as a regular expression). Alas, that means that strings are of limited use as search values. Later in this chapter, you’ll find a tool function for turning an arbitrary text into a regular expression.
replacementValue is …
- a string: describe replacement
- a function: compute replacement
  The next subsections assume that a regular expression with /g is being used.

39.5.5.1. replacementValue is a string

If the replacement value is a string, the dollar sign has special meaning – it inserts things matched by the regular expression:

Text	Result
`$$`	single `$`
`$&`	complete match
$`	text before match
`$'`	text after match
`$n`	capture of positional group `n` (`n` > 0)
`$<name>`	capture of named group `name`

Example: Inserting the text before, inside, and after the matched substring.

> 'a1 a2'.replace(/a/g, "($`|$&|$')")
'(|a|1 a2)1 (a1 |a|2)2'

Example: Inserting the captures of positional groups.

> const regExp = /^([A-Za-z]+): (.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $1, VALUE: $2')
'KEY: first, VALUE: Jane'

Example: Inserting the captures of named groups.

> const regExp = /^(?<key>[A-Za-z]+): (?<value>.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $<key>, VALUE: $<value>')
'KEY: first, VALUE: Jane'

39.5.5.2. replacementValue is a function

If the replacement value is a function, you can compute each replacement. In the following example, we multiply each non-negative integer, that we find, by two.

assert.equal(
  '3 cats and 4 dogs'.replace(/[0-9]+/g, (all) => 2 * Number(all)),
  '6 cats and 8 dogs'
);

The replacement function gets the following parameters. Note how similar they are to match objects. The parameters are all positional, but I’ve included how one usually names them:

all: complete match
g1: capture of positional group 1
Etc.
index: where did the match occur?
input: the string that was matched against
groups: captures of named groups (an object)

39.5.6. Other methods for working with regular expressions

The first parameter of String.prototype.split() is either a string or a regular expression. If it is the latter then substrings captured by groups are added to the result of the method:

> 'a : b : c'.split(/( *):( *)/)
[ 'a', ' ', ' ', 'b', ' ', ' ', 'c' ]

Consult the chapter on strings for more information.

39.6. Flag /g and its pitfalls

The following two regular expression methods do something unusual if /g is switched on:

RegExp.prototype.exec()
RegExp.prototype.test()
Then they can be called repeatedly and deliver all matches inside a string. Property .lastIndex of the regular expression is used to track the current position inside the string. For example:

const r = /a/g;
assert.equal(r.lastIndex, 0);
assert.equal(r.test('aa'), true); // 1st match?
assert.equal(r.lastIndex, 1); // after 1st match
assert.equal(r.test('aa'), true); // 2nd match?
assert.equal(r.lastIndex, 2); // after 2nd match
assert.equal(r.test('aa'), false); // 3rd match?
assert.equal(r.lastIndex, 0); // start over

So how is flag /g problematic? We’ll first explore the problems and then solutions.

39.6.1. Problem: You can’t inline a regular expression with flag /g

A regular expression with /g can’t be inlined: For example, in the following while loop, the regular expression is created fresh, every time the condition is checked. Therefore, its .lastIndex is always zero and the loop never terminates.

let count = 0;
// Infinite loop
while (/a/g.test('babaa')) {
  count++;
}

39.6.2. Problem: Removing /g can break code

If code expects a regular expression with /g and has a loop over the results of .exec() or .test() then a regular expression without /g can cause an infinite loop:

const regExp = /a/; // Missing: flag /g
let count = 0;
// Infinite loop
while (regExp.test('babaa')) {
  count++;
}

Why? Because .test() always returns the first result, true, and never false.

39.6.3. Problem: Adding /g can break code

With .test(), there is another caveat: If you want to check exactly once if a regular expression matches a string then the regular expression must not have /g. Otherwise, you generally get a different result, every time you call .test():

> const r = /^X/g;
> r.test('Xa')
true
> r.test('Xa')
false

Normally, you won’t add /g if you intend to use .test() in this manner. But it can happen if, e.g., you use the same regular expression for testing and for replacing. Or if you get the regular expression via a parameter.

39.6.4. Problem: Code can break if .lastIndex isn’t zero

When a regular expression is created, .lastIndex is initialized to zero. If code ever receives a regular expression whose .lastIndex is not zero, it can break. For example:

const regExp = /a/g;
regExp.lastIndex = 4;
let count = 0;
while (regExp.test('babaa')) {
  count++;
}
assert.equal(count, 1); // should be 3

.lastIndex not being zero can happen relatively easily if a regular expression is shared and not handled properly.

39.6.5. Dealing with /g and .lastIndex

Consider the following scenario: You want to implement a function countOccurrences(regExp, str) that counts how often regExp has a match inside str. How do you prevent a wrong regExp from breaking your code? Let’s look at three approaches.

First, you can throw exceptions if /g isn’t set or .lastIndex isn’t zero:

function countOccurrences(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  if (regExp.lastIndex !== 0) {
    throw new Error('regExp.lastIndex must be zero');
  }
  
  let count = 0;
  while (regExp.test(str)) {
    count++;
  }
  return count;
}

Second, you can clone the parameter. That has the added benefit that regExp won’t be changed.

function countOccurrences(regExp, str) {
  const cloneFlags = regExp.flags + (regExp.global ? '' : 'g');
  const clone = new RegExp(regExp, cloneFlags);
  let count = 0;
  while (clone.test(str)) {
    count++;
  }
  return count;
}

Third, you can use .match() to count occurrences – which doesn’t change or depend on .lastIndex.

function countOccurrences(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  return (str.match(regExp) || []).length;
}

39.7. Techniques for working with regular expressions

39.7.1. Escaping arbitrary text for regular expressions

The following function escapes an arbitrary text so that it is matched verbatim if you put it inside a regular expression:

function escapeForRegExp(str) {
  return str.replace(/[\\^$.*+?()[\]{}|]/g, '\\$&'); // (A)
}
assert.equal(escapeForRegExp('[yes?]'), String.raw`\[yes\?\]`);
assert.equal(escapeForRegExp('_g_'), String.raw`_g_`);

In line A, we escape all syntax characters. Note that /u forbids many escapes: among others, \: and -.

This is how you can use escapeForRegExp() to replace an arbitrary text multiple times:

> const re = new RegExp(escapeForRegExp(':-)'), 'ug');
> ':-) :-) :-)'.replace(re, '?')
'? ? ?'

39.7.2. Matching everything or nothing

Sometimes, you may need a regular expression that matches everything or nothing. For example, as a sentinel value.

Match everything: /(?:)/ (the empty group matches everything; making it noncapturing avoids unnecessary work)

> /(?:)/.test('')
true
> /(?:)/.test('abc')
true

Match nothing: /.^/ (once matching has progressed beyond the first character, ^ doesn’t match, anymore)

> /.^/.test('')
false
> /.^/.test('abc')
false