Re2

List of functions

  • Re2::Grep(String) -> (String?) -> Bool
  • Re2::Match(String) -> (String?) -> Bool
  • Re2::Capture(String) -> (String?) -> Struct<_1:String?,foo:String?,...>
  • Re2::FindAndConsume(String) -> (String?) -> List<String>
  • Re2::Replace(String) -> (String?, String) -> String?
  • Re2::Count(String) -> (String?) -> Uint32
  • Re2::Options([CaseSensitive:Bool?,DotNl:Bool?,Literal:Bool?,LogErrors:Bool?,LongestMatch:Bool?,MaxMem:Uint64?,NeverCapture:Bool?,NeverNl:Bool?,OneLine:Bool?,PerlClasses:Bool?,PosixSyntax:Bool?,Utf8:Bool?,WordBoundary:Bool?]) -> Struct<CaseSensitive:Bool,DotNl:Bool,Literal:Bool,LogErrors:Bool,LongestMatch:Bool,MaxMem:Uint64,NeverCapture:Bool,NeverNl:Bool,OneLine:Bool,PerlClasses:Bool,PosixSyntax:Bool,Utf8:Bool,WordBoundary:Bool>

As Pire has certain limitations needed to ensure efficient string matching against regular expressions, it might be too complex or even impossible to use Pire for some tasks. For such situations, we added another module to support regular expressions based on google::RE2. It offers a broader range of features (see the official documentation).

By default, the UTF-8 mode is enabled automatically if the regular expression is a valid UTF-8-encoded string, but is not a valid ASCII string. You can manually control the settings of the re2 library, if you pass the result of the Re2::Options function as the second argument to other module functions, next to the regular expression.

Warning

Make sure to double all the backslashes in your regular expressions (if they are within a quoted string): standard string literals are treated as C-escaped strings in SQL. You can also format regular expressions as raw strings @@regexp@@: double slashes are not needed in this case.

Examples

  1. $value = "xaaxaaxaa";
  2. $options = Re2::Options(false AS CaseSensitive);
  3. $match = Re2::Match("[ax]+\\d");
  4. $grep = Re2::Grep("a.*");
  5. $capture = Re2::Capture(".*(?P<foo>xa?)(a{2,}).*");
  6. $replace = Re2::Replace("x(a+)x");
  7. $count = Re2::Count("a", $options);
  8. SELECT
  9. $match($value) AS match,
  10. $grep($value) AS grep,
  11. $capture($value) AS capture,
  12. $capture($value)._1 AS capture_member,
  13. $replace($value, "b\\1z") AS replace,
  14. $count($value) AS count;
  15. /*
  16. - match: `false`
  17. - grep: `true`
  18. - capture: `(_0: 'xaaxaaxaa', _1: 'aa', foo: 'x')`
  19. - capture_member: `"aa"`
  20. - replace: `"baazaaxaa"`
  21. - count:: `6`
  22. */

Re2 - 图1

Re2::Grep / Re2::Match

If you leave out the details of implementation and syntax of regular expressions, those functions are totally similar to the applicable functions from the Pire modules. With other things equal and no specific preferences, we recommend that you use Pire::Grep or Pire::Match.

Re2::Capture

Unlike Pire::Capture, Re2::Capture supports multiple and named capturing groups.
Result type: a structure with the fields of the type String?.

  • Each field corresponds to a capturing group with the applicable name.
  • For unnamed groups, the following names are generated: _1, _2, etc.
  • The result always includes the _0 field containing the entire substring matching the regular expression.

For more information about working with structures in YQL, see the section on containers.

Re2::FindAndConsume

Searches for all occurrences of the regular expression in the passed text and returns a list of values corresponding to the parenthesized part of the regular expression for each occurrence.

Re2::Replace

Works as follows:

  • In the input string (first argument), all the non-overlapping substrings matching the regular expression are replaced by the specified string (second argument).
  • In the replacement string, you can use the contents of capturing groups from the regular expression using back-references in the format: \\1, \\2 etc. The \\0 back-reference stands for the whole substring that matches the regular expression.

Re2::Count

Returns the number of non-overlapping substrings of the input string that have matched the regular expression.

Re2::Options

Notes on Re2::Options from the official repository

ParameterDefaultComments
CaseSensitive:Bool?truematch is case-sensitive (regexp can override with (?i) unless in posix_syntax mode)
DotNl:Bool?falselet . match \n (default )
Literal:Bool?falseinterpret string as literal, not regexp
LogErrors:Bool?truelog syntax and execution errors to ERROR
LongestMatch:Bool?falsesearch for longest match, not first match
MaxMem:Uint64?-(see below) approx. max memory footprint of RE2
NeverCapture:Bool?falseparse all parents as non-capturing
NeverNl:Bool?falsenever match \n, even if it is in regexp
PosixSyntax:Bool?falserestrict regexps to POSIX egrep syntax
Utf8:Bool?truetext and pattern are UTF-8; otherwise Latin-1
The following options are only consulted when PosixSyntax == true. When PosixSyntax == false, these features are always enabled and cannot be turned off; to perform multi-line matching in that case, begin the regexp with (?m).
PerlClasses:Bool?falseallow Perl’s \d \s \w \D \S \W
WordBoundary:Bool?falseallow Perl’s \b \B (word boundary and not)
OneLine:Bool?false^ and $ only match beginning and end of text

It is not recommended to use Re2::Options in the code. Most parameters can be replaced with regular expression flags.

Flag usage examples

  1. $value = "Foo bar FOO"u;
  2. -- enable case-insensitive mode
  3. $capture = Re2::Capture(@@(?i)(foo)@@);
  4. SELECT
  5. $capture($value) AS capture;
  6. $capture = Re2::Capture(@@(?i)(?P<vasya>FOO).*(?P<banan>bar)@@);
  7. SELECT
  8. $capture($value) AS capture;

Re2 - 图2

In both cases, the word VASYA will be found. Using the raw string @@regexp@@ lets you avoid double slashes.