Regular expression syntax

Regular expression syntax

A regular expression is a way to match patterns in data using placeholder characters, called operators.

Elasticsearch supports regular expressions in the following queries:

Elasticsearch uses Apache Lucene‘s regular expression engine to parse these queries.

Reserved characters

Lucene’s regular expression engine supports all Unicode characters. However, the following characters are reserved as operators:

  1. . ? + * | { } [ ] ( ) " \

Depending on the optional operators enabled, the following characters may also be reserved:

  1. # @ & < > ~

To use one of these characters literally, escape it with a preceding backslash or surround it with double quotes. For example:

  1. \@ # renders as a literal '@'
  2. \\ # renders as a literal '\'
  3. "john@smith.com" # renders as 'john@smith.com'

The backslash is an escape character in both JSON strings and regular expressions. You need to escape both backslashes in a query, unless you use a language client, which takes care of this. For example, the string a\b needs to be indexed as "a\\b":

  1. resp = client.index(
  2. index="my-index-000001",
  3. id="1",
  4. document={
  5. "my_field": "a\\b"
  6. },
  7. )
  8. print(resp)
  1. response = client.index(
  2. index: 'my-index-000001',
  3. id: 1,
  4. body: {
  5. my_field: 'a\\b'
  6. }
  7. )
  8. puts response
  1. const response = await client.index({
  2. index: "my-index-000001",
  3. id: 1,
  4. document: {
  5. my_field: "a\\b",
  6. },
  7. });
  8. console.log(response);
  1. PUT my-index-000001/_doc/1
  2. {
  3. "my_field": "a\\b"
  4. }

This document matches the following regexp query:

  1. resp = client.search(
  2. index="my-index-000001",
  3. query={
  4. "regexp": {
  5. "my_field.keyword": "a\\\\.*"
  6. }
  7. },
  8. )
  9. print(resp)
  1. response = client.search(
  2. index: 'my-index-000001',
  3. body: {
  4. query: {
  5. regexp: {
  6. 'my_field.keyword' => 'a\\\\.*'
  7. }
  8. }
  9. }
  10. )
  11. puts response
  1. const response = await client.search({
  2. index: "my-index-000001",
  3. query: {
  4. regexp: {
  5. "my_field.keyword": "a\\\\.*",
  6. },
  7. },
  8. });
  9. console.log(response);
  1. GET my-index-000001/_search
  2. {
  3. "query": {
  4. "regexp": {
  5. "my_field.keyword": "a\\\\.*"
  6. }
  7. }
  8. }

Standard operators

Lucene’s regular expression engine does not use the Perl Compatible Regular Expressions (PCRE) library, but it does support the following standard operators.

.

Matches any character. For example:

  1. ab. # matches 'aba', 'abb', 'abz', etc.

?

Repeat the preceding character zero or one times. Often used to make the preceding character optional. For example:

  1. abc? # matches 'ab' and 'abc'

+

Repeat the preceding character one or more times. For example:

  1. ab+ # matches 'ab', 'abb', 'abbb', etc.

*

Repeat the preceding character zero or more times. For example:

  1. ab* # matches 'a', 'ab', 'abb', 'abbb', etc.

{}

Minimum and maximum number of times the preceding character can repeat. For example:

  1. a{2} # matches 'aa'
  2. a{2,4} # matches 'aa', 'aaa', and 'aaaa'
  3. a{2,} # matches 'a` repeated two or more times

|

OR operator. The match will succeed if the longest pattern on either the left side OR the right side matches. For example:

  1. abc|xyz # matches 'abc' and 'xyz'

( … )

Forms a group. You can use a group to treat part of the expression as a single character. For example:

  1. abc(def)? # matches 'abc' and 'abcdef' but not 'abcd'

[ … ]

Match one of the characters in the brackets. For example:

  1. [abc] # matches 'a', 'b', 'c'

Inside the brackets, - indicates a range unless - is the first character or escaped. For example:

  1. [a-c] # matches 'a', 'b', or 'c'
  2. [-abc] # '-' is first character. Matches '-', 'a', 'b', or 'c'
  3. [abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'

A ^ before a character in the brackets negates the character or range. For example:

  1. [^abc] # matches any character except 'a', 'b', or 'c'
  2. [^a-c] # matches any character except 'a', 'b', or 'c'
  3. [^-abc] # matches any character except '-', 'a', 'b', or 'c'
  4. [^abc\-] # matches any character except 'a', 'b', 'c', or '-'

Optional operators

You can use the flags parameter to enable more optional operators for Lucene’s regular expression engine.

To enable multiple operators, use a | separator. For example, a flags value of COMPLEMENT|INTERVAL enables the COMPLEMENT and INTERVAL operators.

Valid values

ALL (Default)

Enables all optional operators.

"" (empty string)

Alias for the ALL value.

COMPLEMENT

Enables the ~ operator. You can use ~ to negate the shortest following pattern. For example:

  1. a~bc # matches 'adc' and 'aec' but not 'abc'

EMPTY

Enables the # (empty language) operator. The # operator doesn’t match any string, not even an empty string.

If you create regular expressions by programmatically combining values, you can pass # to specify “no string.” This lets you avoid accidentally matching empty strings or other unwanted strings. For example:

  1. #|abc # matches 'abc' but nothing else, not even an empty string

INTERVAL

Enables the <> operators. You can use <> to match a numeric range. For example:

  1. foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
  2. foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'

INTERSECTION

Enables the & operator, which acts as an AND operator. The match will succeed if patterns on both the left side AND the right side matches. For example:

  1. aaa.+&.+bbb # matches 'aaabbb'

ANYSTRING

Enables the @ operator. You can use @ to match any entire string.

You can combine the @ operator with & and ~ operators to create an “everything except” logic. For example:

  1. @&~(abc.+) # matches everything except terms beginning with 'abc'

NONE

Disables all optional operators.

Unsupported operators

Lucene’s regular expression engine does not support anchor operators, such as ^ (beginning of line) or $ (end of line). To match a term, the regular expression must match the entire string.