III. JavaScript in Depth - 19. Regular Expressions - 《Speaking JavaScript: An In-Depth Guide for Programmers》

Chapter 19. Regular Expressions
Regular Expression Syntax
Unicode and Regular Expressions
Creating a Regular Expression
RegExp.prototype.test: Is There a Match?
String.prototype.search: At What Index Is There a Match?
RegExp.prototype.exec: Capture Groups
- First Match (Flag /g Not Set)
- All Matches (Flag /g Set)
String.prototype.match: Capture Groups or Return All Matching Substrings
String.prototype.replace: Search and Replace
- Replacement Is a String
- Replacement Is a Function
Problems with the Flag /g
- Tip
Tips and Tricks
Regular Expression Cheat Sheet
- Acknowledgments

buy the book to support the author.

Chapter 19. Regular Expressions

This chapter gives an overview of the JavaScript API for regular expressions. It assumes that you are roughly familiar with how they work. If you are not, there are many good tutorials on the Web. Two examples are:

Regular-Expressions.info by Jan Goyvaerts
JavaScript Regular Expression Enlightenment by Cody Lindley

Regular Expression Syntax

The terms used here closely reflect the grammar in the ECMAScript specification. I sometimes deviate to make things easier to understand.

Atoms: General

The syntax for general atoms is as follows:

Special characters
All of the following characters have special meaning:

\ ^ $ . * + ? ( ) [ ] { } |

You can escape them by prefixing a backslash. For example:

> /^(ab)$/.test('(ab)')
false
> /^\(ab\)$/.test('(ab)')
true

Additional special characters are:

Inside a character class […]:

Inside a group that starts with a question mark (?…):

: = ! < >

The angle brackets are used only by the XRegExp library (see Chapter 30), to name groups.

Pattern characters
All characters except the aforementioned special ones match themselves.
. (dot)
Matches any JavaScript character (UTF-16 code unit) except line terminators (newline, carriage return, etc.). To really match any character, use [\s\S]. For example:

> /./.test('\n')
false
> /[\s\S]/.test('\n')
true

Character escapes (match single characters)
- Specific control characters include \f (form feed), \n (line feed, newline), \r (carriage return), \t (horizontal tab), and \v (vertical tab).
- \0 matches the NUL character (\u0000).
- Any control character: \cA – \cZ.
- Unicode character escapes: \u0000 – \xFFFF (Unicode code units; see Chapter 24).
- Hexadecimal character escapes: \x00 – \xFF.
Character class escapes (match one of a set of characters)
- Digits: \d matches any digit (same as [0-9]);\D matches any nondigit (same as [^0-9]).
- Alphanumeric characters: \w matches any Latin alphanumeric character plus underscore (same as [A-Za-z0-9_]);\W matches all characters not matched by \w.
- Whitespace: \s matches whitespace characters (space, tab, line feed, carriage return, form feed, all Unicode spaces, etc.);\S matches all nonwhitespace characters.

Atoms: Character Classes

The syntax for character classes is as follows:

[«charSpecs»] matches any single character that matches at least one of the charSpecs.
[^«charSpecs»] matches any single character that does not match any of the charSpecs.

The following constructs are all character specifications:

Source characters match themselves. Most characters are source characters (even many characters that are special elsewhere). Only three characters are not:

    \ ] -

As usual, you escape via a backslash. If you want to match a dash without escaping it, it must be the first character after the opening bracket or the right side of a range, as described shortly.

Class escapes: Any of the character escapes and character class escapes listed previously are allowed. There is one additional escape:

Backspace (\b): Outside a character class, \b matches word boundaries. Inside a character class, it matches the control character backspace.

Ranges comprise a source character or a class escape, followed by a dash (-), followed by a source character or a class escape.

To demonstrate using character classes, this example parses a date formatted in the ISO 8601 standard:

function parseIsoDate(str) {
    var match = /^([0-9]{4})-([0-9]{2})-([0-9]{2})$/.exec(str);
 
    // Other ways of writing the regular expression:
    // /^([0-9][0-9][0-9][0-9])-([0-9][0-9])-([0-9][0-9])$/
    // /^(\d\d\d\d)-(\d\d)-(\d\d)$/
 
    if (!match) {
        throw new Error('Not an ISO date: '+str);
    }
    console.log('Year: '  + match[1]);
    console.log('Month: ' + match[2]);
    console.log('Day: '   + match[3]);
}

And here is the interaction:

> parseIsoDate('2001-12-24')
Year: 2001
Month: 12
Day: 24

Atoms: Groups

The syntax for groups is as follows:

(«pattern») is a capturing group. Whatever is matched by pattern can be accessed via backreferences or as the result of a match operation.
(?:«pattern») is a noncapturing group. pattern is still matched against the input, but not saved as a capture. Therefore, the group does not have a number you can refer to (e.g., via a backreference).

\1, \2, and so on are known as backreferences; they refer back to a previously matched group. The number after the backslash can be any integer greater than or equal to 1, but the first digit must not be 0.

In this example, a backreference guarantees the same amount of a’s before and after the dash:

> /^(a+)-\1$/.test('a-a')
true
> /^(a+)-\1$/.test('aaa-aaa')
true
> /^(a+)-\1$/.test('aa-a')
false

This example uses a backreference to match an HTML tag (obviously, you should normally use a proper parser to process HTML):

> var tagName = /<([^>]+)>[^<]*<\/\1>/;
> tagName.exec('<b>bold</b>')[1]
'b'
> tagName.exec('<strong>text</strong>')[1]
'strong'
> tagName.exec('<strong>text</stron>')
null

Quantifiers

Any atom (including character classes and groups) can be followed by a quantifier:

? means match never or once.
* means match zero or more times.
+ means match one or more times.
{n} means match exactly n times.
{n,} means match n or more times.
{n,m} means match at least n, at most m, times.

By default, quantifiers are greedy; that is, they match as much as possible. You can get reluctant matching (as little as possible) by suffixing any of the preceding quantifiers (including the ranges in curly braces) with a question mark (?). For example:

> '<a> <strong>'.match(/^<(.*)>/)[1]  // greedy
'a> <strong'
> '<a> <strong>'.match(/^<(.*?)>/)[1]  // reluctant
'a'

Thus, .? is a useful pattern for matching everything until the next occurrence of the following atom. For example, the following is a more compact version of the regular expression for HTML tags just shown (which used [^<] instead of .*?):

/<(.+?)>.*?<\/\1>/

Assertions

Assertions, shown in the following list, are checks about the current position in the input:

`^`	Matches only at the beginning of the input.
`$`	Matches only at the end of the input.
`\b`	Matches only at a word boundary. Don’t confuse with `[\b]`, which matches a backspace.
`\B`	Matches only if not at a word boundary.
`(?=«pattern»)`	Positive lookahead: Matches only if `pattern` matches what comes next. `pattern` is used only to look ahead, but otherwise ignored.
`(?!«pattern»)`	Negative lookahead: Matches only if `pattern` does not match what comes next. `pattern` is used only to look ahead, but otherwise ignored.

This example matches a word boundary via \b:

> /\bell\b/.test('hello')
false
> /\bell\b/.test('ello')
false
> /\bell\b/.test('ell')
true

This example matches the inside of a word via \B:

> /\Bell\B/.test('ell')
false
> /\Bell\B/.test('hell')
false
> /\Bell\B/.test('hello')
true

Note

Lookbehind is not supported. Manually Implementing Lookbehind explains how to implement it manually.

Disjunction

A disjunction operator (|) separates two alternatives; either of the alternatives must match for the disjunction to match. The alternatives are atoms (optionally including quantifiers).

The operator binds very weakly, so you have to be careful that the alternatives don’t extend too far.For example, the following regular expression matches all strings that either start with aa or end with bb:

> /^aa|bb$/.test('aaxx')
true
> /^aa|bb$/.test('xxbb')
true

In other words, the disjunction binds more weakly than even ^ and $ and the two alternatives are ^aa and bb$. If you want to match the two strings 'aa' and 'bb', you need parentheses:

/^(aa|bb)$/

Similarly, if you want to match the strings 'aab' and 'abb':

/^a(a|b)b$/

Unicode and Regular Expressions

JavaScript’s regular expressions have only very limited support for Unicode. Especially when it comes to code points in the astral planes, you have to be careful. Chapter 24 explains the details.

Creating a Regular Expression

You can create a regular expression via either a literal or a constructor and configure how it works via flags.

Literal Versus Constructor

There are two ways to create a regular expression: you can use a literal or the constructor RegExp:

Literal	`/xyz/i`	Compiled at load time
Constructor (second argument is optional)	`new RegExp('xyz', 'i')`	Compiled at runtime

A literal and a constructor differ in when they are compiled:

The literal is compiled at load time. The following code will cause an exception when it is evaluated:

function foo() {
    /[/;
}

The constructor compiles the regular expression when it is called. The following code will not cause an exception, but calling foo() will:

function foo() {
    new RegExp('[');
}

Thus, you should normally use literals, but you need the constructor if you want to dynamically assemble a regular expression.

Flags

Flags are a suffix of regular expression literals and a parameter of regular expression constructors; they modify the matching behavior of regular expressions. The following flags exist:

Short name	Long name	Description
`g`	`global`	The given regular expression is matched multiple times. Influences several methods, especially `replace()`.
`i`	`ignoreCase`	Case is ignored when trying to match the given regular expression.
`m`	`multiline`	In multiline mode, the begin operator `^` and the end operator `$` match each line, instead of the complete input string.

The short name is used for literal prefixes and constructor parameters (see examples in the next section).The long name is used for properties of a regular expression that indicate what flags were set during its creation.

Instance Properties of Regular Expressions

Regular expressions have the following instance properties:

Flags: boolean values indicating what flags are set:

global: Is flag /g set?
ignoreCase: Is flag /i set?
multiline: Is flag /m set?

Data for matching multiple times (flag /g is set):

lastIndex is the index where to continue the search next time.

The following is an example of accessing the instance properties for flags:

> var regex = /abc/i;
> regex.ignoreCase
true
> regex.multiline
false

Examples of Creating Regular Expressions

In this example, we create the same regular expression first with a literal, then with a constructor, and use the test() method to determine whether it matches a string:

> /abc/.test('ABC')
false
> new RegExp('abc').test('ABC')
false

In this example, we create a regular expression that ignores case (flag /i):

> /abc/i.test('ABC')
true
> new RegExp('abc', 'i').test('ABC')
true

RegExp.prototype.test: Is There a Match?

The test() method checks whether a regular expression, regex, matches a string, str:

regex.test(str)

test() operates differently depending on whether the flag /g is set or not.

If the flag /g is not set, then the method checks whether there is a match somewhere in str. For example:

> var str = '_x_x';
 
> /x/.test(str)
true
> /a/.test(str)
false

If the flag /g is set, then the method returns true as many times as there are matches for regex in str. The property regex.lastIndex contains the index after the last match:

> var regex = /x/g;
> regex.lastIndex
0
 
> regex.test(str)
true
> regex.lastIndex
2
 
> regex.test(str)
true
> regex.lastIndex
4
 
> regex.test(str)
false

String.prototype.search: At What Index Is There a Match?

The search() method looks for a match with regex within str:

str.search(regex)

If there is a match, the index where it was found is returned. Otherwise, the result is -1. The properties global and lastIndex of regex are ignored as the search is performed (and lastIndex is not changed).

For example:

> 'abba'.search(/b/)
1
> 'abba'.search(/x/)
-1

If the argument of search() is not a regular expression, it is converted to one:

> 'aaab'.search('^a+b+$')
0

RegExp.prototype.exec: Capture Groups

The following method call captures groups while matching regex against str:

var matchData = regex.exec(str);

If there was no match, matchData is null. Otherwise, matchData is a match result, an array with two additional properties:

Array elements
- Element 0 is the match for the complete regular expression (group 0, if you will).
- Element n > 1 is the capture of group n.
Properties
- input is the complete input string.
- index is the index where the match was found.

First Match (Flag /g Not Set)

If the flag /g is not set, only the first match is returned:

> var regex = /a(b+)/;
> regex.exec('_abbb_ab_')
[ 'abbb',
  'bbb',
  index: 1,
  input: '_abbb_ab_' ]
> regex.lastIndex
0

All Matches (Flag /g Set)

If the flag /g is set, all matches are returned if you invoke exec() repeatedly. The return value null signals that there are no more matches. The property lastIndex indicates where matching will continue next time:

> var regex = /a(b+)/g;
> var str = '_abbb_ab_';
 
> regex.exec(str)
[ 'abbb',
  'bbb',
  index: 1,
  input: '_abbb_ab_' ]
> regex.lastIndex
6
 
> regex.exec(str)
[ 'ab',
  'b',
  index: 7,
  input: '_abbb_ab_' ]
> regex.lastIndex
10
 
> regex.exec(str)
null

Here we loop over matches:

var regex = /a(b+)/g;
var str = '_abbb_ab_';
var match;
while (match = regex.exec(str)) {
    console.log(match[1]);
}

and we get the following output:

bbb
b

String.prototype.match: Capture Groups or Return All Matching Substrings

The following method call matches regex against str:

var matchData = str.match(regex);

If the flag /g of regex is not set, this method works like RegExp.prototype.exec():

> 'abba'.match(/a/)
[ 'a', index: 0, input: 'abba' ]

If the flag is set, then the method returns an array with all matching substrings in str (i.e., group 0 of every match) or null if there is no match:

> 'abba'.match(/a/g)
[ 'a', 'a' ]
> 'abba'.match(/x/g)
null

String.prototype.replace: Search and Replace

The replace() method searches a string, str, for matches with search and replaces them with replacement:

str.replace(search, replacement)

There are several ways in which the two parameters can be specified:

search
Either a string or a regular expression:

String: To be found literally in the input string. Be warned that only the first occurrence of a string is replaced. If you want to replace multiple occurrences, you must use a regular expression with a /g flag. This is unexpected and a major pitfall.
Regular expression: To be matched against the input string. Warning: Use the global flag, otherwise only one attempt is made to match the regular expression.

replacement
Either a string or a function:

String: Describes how to replace what has been found.
Function: Computes a replacement and is given matching information via parameters.

Replacement Is a String

If replacement is a string, its content is used verbatim to replace the match. The only exception is the special character dollar sign ($), which starts so-called replacement directives:

Groups: $n inserts group n from the match. n must be at least 1 ($0 has no special meaning).
The matching substring:

$` (backtick) inserts the text before the match.
$& inserts the complete match.
$' (apostrophe) inserts the text after the match.

$$ inserts a single $.

This example refers to the matching substring and its prefix and suffix:

> 'axb cxd'.replace(/x/g, "[$`,$&,$']")
'a[a,x,b cxd]b c[axb c,x,d]d'

This example refers to a group:

> '"foo" and "bar"'.replace(/"(.*?)"/g, '#$1#')
'#foo# and #bar#'

Replacement Is a Function

If replacement is a function, it computes the string that is to replace the match. This function has the following signature:

function (completeMatch, group_1, ..., group_n, offset, inputStr)

completeMatch is the same as $& previously, offset indicates where the match was found, and inputStr is what is being matched against.Thus, you can use the special variable arguments to access groups (group 1 via arguments[1], and so on). For example:

> function replaceFunc(match) { return 2 * match }
> '3 apples and 5 oranges'.replace(/[0-9]+/g, replaceFunc)
'6 apples and 10 oranges'

Problems with the Flag /g

Regular expressions whose /g flag is set are problematic if a method invoked on them must be invoked multiple times to return all results. That’s the case for two methods:

RegExp.prototype.test()
RegExp.prototype.exec()

Then JavaScript abuses the regular expression as an iterator, as a pointer into the sequence of results. That causes problems:

Problem 1: /g regular expressions can’t be inlined
For example:

// Don’t do that:
var count = 0;
while (/a/g.test('babaa')) count++;

The preceding loop is infinite, because a new regular expression is created for each loop iteration, which restarts the iteration over the results. Therefore, the code must be rewritten:

var count = 0;
var regex = /a/g;
while (regex.test('babaa')) count++;

Here is another example:

// Don’t do that:
function extractQuoted(str) {
    var match;
    var result = [];
    while ((match = /"(.*?)"/g.exec(str)) != null) {
        result.push(match[1]);
    }
    return result;
}

Calling the preceding function will again result in an infinite loop. The correct version is (why lastIndex is set to 0 is explained shortly):

var QUOTE_REGEX = /"(.*?)"/g;
function extractQuoted(str) {
    QUOTE_REGEX.lastIndex = 0;
    var match;
    var result = [];
    while ((match = QUOTE_REGEX.exec(str)) != null) {
        result.push(match[1]);
    }
    return result;
}

Using the function:

> extractQuoted('"hello", "world"')
[ 'hello', 'world' ]

Tip

It’s a best practice not to inline anyway (then you can give regular expressions descriptive names). But you have to be aware that you can’t do it, not even in quick hacks.

Problem 2: /g regular expressions as parameters
Code that wants to invoke test() and exec() multiple times must be careful with a regular expression handed to it as a parameter. Its flag /g must active and, to be safe, its lastIndex should be set to zero (an explanation is offered in the next example).
Problem 3: Shared /g regular expressions (e.g., constants)
Whenever you are referring to a regular expression that has not been freshly created, you should set its lastIndex property to zero, before using it as an iterator (an explanation is offered in the next example). As iteration depends on lastIndex, such a regular expression can’t be used in more than one iteration at the same time.

The following example illustrates problem 2. It is a naive implementation of a function that counts how many matches there are for the regular expression regex in the string str:

// Naive implementation
function countOccurrences(regex, str) {
    var count = 0;
    while (regex.test(str)) count++;
    return count;
}

Here’s an example of using this function:

> countOccurrences(/x/g, '_x_x')
2

The first problem is that this function goes into an infinite loop if the regular expression’s /g flag is not set. For example:

countOccurrences(/x/, '_x_x') // never terminates

The second problem is that the function doesn’t work correctly if regex.lastIndex isn’t 0, because that property indicates where to start the search. For example:

> var regex = /x/g;
> regex.lastIndex = 2;
> countOccurrences(regex, '_x_x')
1

The following implementation fixes the two problems:

function countOccurrences(regex, str) {
    if (! regex.global) {
        throw new Error('Please set flag /g of regex');
    }
    var origLastIndex = regex.lastIndex;  // store
    regex.lastIndex = 0;
 
    var count = 0;
    while (regex.test(str)) count++;
 
    regex.lastIndex = origLastIndex;  // restore
    return count;
}

A simpler alternative is to use match():

function countOccurrences(regex, str) {
    if (! regex.global) {
        throw new Error('Please set flag /g of regex');
    }
    return (str.match(regex) || []).length;
}

There’s one possible pitfall: str.match() returns null if the /g flag is set and there are no matches. We avoid that pitfall in the preceding code by using [] if the result of match() isn’t truthy.

Tips and Tricks

This section gives a few tips and tricks for working with regular expressions in JavaScript.

Quoting Text

Sometimes, when you assemble a regular expression manually, you want to use a given string verbatim. That means that none of the special characters (e.g., *, [) should be interpreted as such—all of them need to be escaped. JavaScript has no built-in means for this kind of quoting, but you can program your own function, quoteText, that would work as follows:

> console.log(quoteText('*All* (most?) aspects.'))
\*All\* \(most\?\) aspects\.

Such a function is especially handy if you need to do a search and replace with multiple occurrences. Then the value to search for must be a regular expression with the global flag set. With quoteText(), you can use arbitrary strings. The function looks like this:

function quoteText(text) {
    return text.replace(/[\\^$.*+?()[\]{}|=!<>:-]/g, '\\$&');
}

All special characters are escaped, because you may want to quote several characters inside parentheses or square brackets.

Pitfall: Without an Assertion (e.g., ^, $), a Regular Expression Is Found Anywhere

If you don’t use assertions such as ^ and $, most regular expression methods find a pattern anywhere. For example:

> /aa/.test('xaay')
true
> /^aa$/.test('xaay')
false

Matching Everything or Nothing

It’s a rare use case, but sometimes you need a regular expression that matches everything or nothing. For example, a function may have a parameter with a regular expression that is used for filtering. If that parameter is missing, you give it a default value, a regular expression that matches everything.

Matching everything

The empty regular expression matches everything. We can create an instance of RegExp based on that regular expression like this:

> new RegExp('').test('dfadsfdsa')
true
> new RegExp('').test('')
true

However, the empty regular expression literal would be //, which is interpreted as a comment by JavaScript. Therefore, the following is the closest you can get via a literal: /(?:)/ (empty noncapturing group). The group matches everything, while not capturing anything, which the group from influencing the result returned by exec(). Even JavaScript itself uses the preceding representation when displaying an empty regular expression:

> new RegExp('')
/(?:)/

Matching nothing

The empty regular expression has an inverse—the regular expression that matches nothing:

> var never = /.^/;
> never.test('abc')
false
> never.test('')
false

Manually Implementing Lookbehind

Lookbehind is an assertion. Similar to lookahead, a pattern is used to check something about the current position in the input, but otherwise ignored. In contrast to lookahead, the match for the pattern has to end at the current position (not start at it).

The following function replaces each occurrence of the string 'NAME' with the value of the parameter name, but only if the occurrence is not preceded by a quote. We handle the quote by “manually” checking the character before the current match:

function insertName(str, name) {
    return str.replace(
        /NAME/g,
        function (completeMatch, offset) {
            if (offset === 0 ||
                (offset > 0 && str[offset-1] !== '"')) {
                return name;
            } else {
                return completeMatch;
            }
        }
    );
}

> insertName('NAME "NAME"', 'Jane')
'Jane "NAME"'
> insertName('"NAME" NAME', 'Jane')
'"NAME" Jane'

An alternative is to include the characters that may escape in the regular expression. Then you have to temporarily add a prefix to the string you are searching in; otherwise, you’d miss matches at the beginning of that string:

function insertName(str, name) {
    var tmpPrefix = ' ';
    str = tmpPrefix + str;
    str = str.replace(
        /([^"])NAME/g,
        function (completeMatch, prefix) {
            return prefix + name;
        }
    );
    return str.slice(tmpPrefix.length); // remove tmpPrefix
}

Regular Expression Cheat Sheet

Atoms (see Atoms: General):

. (dot) matches everything except line terminators (e.g., newlines). Use [\s\S] to really match everything.
Character class escapes:

\d matches digits ([0-9]); \D matches nondigits ([^0-9]).
\w matches Latin alphanumeric characters plus underscore ([A-Za-z0-9_]); \W matches all other characters.
\s matches all whitespace characters (space, tab, line feed, etc.); \S matches all nonwhitespace characters.

Character class (set of characters): […] and [^…]

Source characters: [abc] (all characters except \ ] - match themselves)
Character class escapes (see previous): [\d\w]
Ranges: [A-Za-z0-9]

Groups:

Capturing group: (…); backreference: \1
Noncapturing group: (?:…)

Quantifiers (see Quantifiers):

Greedy:

? * +
{n} {n,} {n,m}

Reluctant: Put a ? after any of the greedy quantifiers.

Assertions (see Assertions):

Beginning of input, end of input: ^ $
At a word boundary, not at a word boundary: \b \B
Positive lookahead: (?=…) (pattern must come next, but is otherwise ignored)
Negative lookahead: (?!…) (pattern must not come next, but is otherwise ignored)

Disjunction: |

Creating a regular expression (see Creating a Regular Expression):

Literal: /xyz/i (compiled at load time)
Constructor: new RegExp('xzy', 'i') (compiled at runtime)

Flags (see Flags):

global: /g (influences several regular expression methods)
ignoreCase: /i
multiline: /m (^ and $ match per line, as opposed to the complete input)

Methods:

regex.test(str): Is there a match (see RegExp.prototype.test: Is There a Match?)?

/g is not set: Is there a match somewhere?
/g is set: Return true as many times as there are matches.

str.search(regex): At what index is there a match (see String.prototype.search: At What Index Is There a Match?)?
regex.exec(str): Capture groups (see the section RegExp.prototype.exec: Capture Groups)?

/g is not set: Capture groups of first match only (invoked once)
/g is set: Capture groups of all matches (invoked repeatedly; returns null if there are no more matches)

str.match(regex): Capture groups or return all matching substrings (see String.prototype.match: Capture Groups or Return All Matching Substrings)

/g is not set: Capture groups
/g is set: Return all matching substrings in an array

str.replace(search, replacement): Search and replace (see String.prototype.replace: Search and Replace)

search: String or regular expression (use the latter, set /g!)
replacement: String (with $1, etc.) or function (arguments[1] is group 1, etc.) that returns a string

For tips on using the flag /g, see Problems with the Flag /g.

Acknowledgments

Mathias Bynens (@mathias) and Juan Ignacio Dopazo (@juandopazo) recommended using match() and test() for counting occurrences, and Šime Vidas (@simevidas) warned me about being careful with match() if there are no matches. The pitfall of the global flag causing infinite loops comes from a talk by Andrea Giammarchi (@webreflection). Claude Pache told me to escape more characters in quoteText().