6.10. Regular Expression Functions
All of the regular expression functions use the Java pattern syntax,with a few notable exceptions:
- When using multi-line mode (enabled via the (?m) flag),only \n is recognized as a line terminator. Additionally,the (?d) flag is not supported and must not be used.
- Case-insensitive matching (enabled via the (?i) flag) is alwaysperformed in a Unicode-aware manner. However, context-sensitive andlocal-sensitive matching is not supported. Additionally, the(?u) flag is not supported and must not be used.
- Surrogate pairs are not supported. For example, \uD800\uDC00 isnot treated as U+10000 and must be specified as \x{10000}.
- Boundaries (\b) are incorrectly handled for a non-spacing markwithout a base character.
- \Q and \E are not supported in character classes(such as [A-Z123]) and are instead treated as literals.
- Unicode character classes (\p{prop}) are supported withthe following differences:
- All underscores in names must be removed. For example, useOldItalic instead of Old_Italic.
- Scripts must be specified directly, without theIs, script= or sc= prefixes.Example: \p{Hiragana}
- Blocks must be specified with the In prefix.The block= and blk= prefixes are not supported.Example: \p{Mongolian}
- Categories must be specified directly, without the Is,general_category= or gc= prefixes.Example: \p{L}
- Binary properties must be specified directly, without the Is.Example: \p{NoncharacterCodePoint}
regexpextract_all
(_string, pattern) → array<varchar>
Returns the substring(s) matched by the regular expressionpattern
instring
:
- SELECT regexp_extract_all('1a 2b 14m', '\d+'); -- [1, 2, 14]
regexpextract_all
(_string, pattern, group) → array<varchar>
Finds all occurrences of the regular expressionpattern
instring
and returns the capturing group numbergroup
:- SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm']
regexpextract
(_string, pattern) → varchar
Returns the first substring matched by the regular expressionpattern
instring
:- SELECT regexp_extract('1a 2b 14m', '\d+'); -- 1
regexpextract
(_string, pattern, group) → varchar
Finds the first occurrence of the regular expressionpattern
instring
and returns the capturing group numbergroup
:- SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a'
regexplike
(_string, pattern) → boolean
Evaluates the regular expressionpattern
and determines if it iscontained withinstring
.This function is similar to the
LIKE
operator, expect that thepattern only needs to be contained withinstring
, rather thanneeding to match all ofstring
. In other words, this performs acontains operation rather than a match operation. You can matchthe entire string by anchoring the pattern using^
and$
:- SELECT regexp_like('1a 2b 14m', '\d+b'); -- true
regexpreplace
(_string, pattern) → varchar
Removes every instance of the substring matched by the regular expressionpattern
fromstring
:- SELECT regexp_replace('1a 2b 14m', '\d+[ab] '); -- '14m'
regexpreplace
(_string, pattern, replacement) → varchar
Replaces every instance of the substring matched by the regular expressionpattern
instring
withreplacement
. Capturing groups can bereferenced inreplacement
using$g
for a numbered group or${name}
for a named group. A dollar sign ($
) may be included in thereplacement by escaping it with a backslash (\$
):- SELECT regexp_replace('1a 2b 14m', '(\d+)([ab]) ', '3c$2 '); -- '3ca 3cb 14m'
regexpreplace
(_string, pattern, function) → varchar
Replaces every instance of the substring matched by the regular expressionpattern
instring
usingfunction
. The lambda expressionfunction
is invoked for each match with the capturing groups passed as anarray. Capturing group numbers start at one; there is no group for the entire match(if you need this, surround the entire expression with parenthesis).- SELECT regexp_replace('new york', '(\w)(\w*)', x -> upper(x[1]) || lower(x[2])); --'New York'
regexpsplit
(_string, pattern) → array<varchar>
Splitsstring
using the regular expressionpattern
and returns anarray. Trailing empty strings are preserved:- SELECT regexp_split('1a 2b 14m', '\s*[a-z]+\s*'); -- [1, 2, 14, ]