Regular Expressions

Regular expressions are represented by the Regex class.

A Regex is typically created with a regex literal using PCRE2 syntax. It consists of a string of UTF-8 characters enclosed in forward slashes (/):

  1. /foo|bar/
  2. /h(e+)llo/
  3. /\d+/
  4. /あ/

Note

Prior to Crystal 1.8 the compiler expected regex literals to follow the original PCRE pattern syntax. The newer PCRE2 pattern syntax was introduced in 1.8.

Escaping

Regular expressions support the same escape sequences as String literals.

  1. /\// # slash
  2. /\\/ # backslash
  3. /\b/ # backspace
  4. /\e/ # escape
  5. /\f/ # form feed
  6. /\n/ # newline
  7. /\r/ # carriage return
  8. /\t/ # tab
  9. /\v/ # vertical tab
  10. /\NNN/ # octal ASCII character
  11. /\xNN/ # hexadecimal ASCII character
  12. /\x{FFFF}/ # hexadecimal unicode character
  13. /\x{10FFFF}/ # hexadecimal unicode character

The delimiter character / must be escaped inside slash-delimited regular expression literals. Note that special characters of the PCRE syntax need to be escaped if they are intended as literal characters.

Interpolation

Interpolation works in regular expression literals just as it does in string literals. Be aware that using this feature will cause an exception to be raised at runtime, if the resulting string results in an invalid regular expression.

Modifiers

The closing delimiter may be followed by a number of optional modifiers to adjust the matching behaviour of the regular expression.

  • i: case-insensitive matching (PCRE_CASELESS): Unicode letters in the pattern match both upper and lower case letters in the subject string.
  • m: multiline matching (PCRE_MULTILINE): The start of line (^) and end of line ($) metacharacters match immediately following or immediately before internal newlines in the subject string, respectively, as well as at the very start and end.
  • x: extended whitespace matching (PCRE_EXTENDED): Most white space characters in the pattern are totally ignored except when ignore or inside a character class. Unescaped hash characters # denote the start of a comment ranging to the end of the line.
  1. /foo/i.match("FOO") # => #<Regex::MatchData "FOO">
  2. /foo/m.match("bar\nfoo") # => #<Regex::MatchData "foo">
  3. /foo /x.match("foo") # => #<Regex::MatchData "foo">
  4. /foo /imx.match("bar\nFOO") # => #<Regex::MatchData "FOO">

Percent regex literals

Besides slash-delimited literals, regular expressions may also be expressed as a percent literal indicated by %r and a pair of delimiters. Valid delimiters are parentheses (), square brackets [], curly braces {}, angles <> and pipes ||. Except for the pipes, all delimiters can be nested; meaning a start delimiter inside the literal escapes the next end delimiter.

These are handy to write regular expressions that include slashes which would have to be escaped in slash-delimited literals.

  1. %r((/)) # => /(\/)/
  2. %r[[/]] # => /[\/]/
  3. %r{{/}} # => /{\/}/
  4. %r<</>> # => /<\/>/
  5. %r|/| # => /\//