String Functions and Operators
String Operators
The ||
operator performs concatenation.
String Functions
Note
These functions assume that the input strings contain valid UTF-8 encoded Unicode code points. There are no explicit checks for valid UTF-8 and the functions may return incorrect results on invalid UTF-8. Invalid UTF-8 data can be corrected with from_utf8().
Additionally, the functions operate on Unicode code points and not user visible characters (or grapheme clusters). Some languages combine multiple code points into a single user-perceived character, the basic unit of a writing system for a language, but the functions will treat each code point as a separate unit.
The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.
chr(n) → varchar
Returns the Unicode code point n
as a single character string.
codepoint(string) → integer
Returns the Unicode code point of the only character of string
.
concat(string1, …, stringN) → varchar
Returns the concatenation of string1
, string2
, ...
, stringN
. This function provides the same functionality as the SQL-standard concatenation operator (||
).
ends_with(string, substring) → boolean
Returns whether string
ends_with with substring
.
hamming_distance(string1, string2) → bigint
Returns the Hamming distance of string1
and string2
, i.e. the number of positions at which the corresponding characters are different. Note that the two strings must have the same length.
length(string) → bigint
Returns the length of string
in characters.
levenshtein_distance(string1, string2) → bigint
Returns the Levenshtein edit distance of string1
and string2
, i.e. the minimum number of single-character edits (insertions, deletions or substitutions) needed to change string1
into string2
.
lower(string) → varchar
Converts string
to lowercase.
lpad(string, size, padstring) → varchar
Left pads string
to size
characters with padstring
. If size
is less than the length of string
, the result is truncated to size
characters. size
must not be negative and padstring
must be non-empty.
ltrim(string) → varchar
Removes leading whitespace from string
. See trim() for the set of recognized whitespace characters.
ltrim(string, chars) → varchar
Removes the longest substring containing only characters in chars
from the beginning of the string
.
SELECT ltrim('test', 't'); -- est
SELECT ltrim('tetris', 'te'); -- ris
replace(string, search) → varchar
Removes all instances of search
from string
.
replace(string, search, replace) → varchar
Replaces all instances of search
with replace
in string
.
If search
is an empty string, inserts replace
in front of every character and at the end of the string
.
replace_first(string, search, replace) → varchar
Replaces the first instances of ``search`` with ``replace`` in ``string``.
If search
is an empty string, it inserts replace
at the beginning of the string
.
reverse(string) → varchar
Returns string
with the characters in reverse order.
rpad(string, size, padstring) → varchar
Right pads string
to size
characters with padstring
. If size
is less than the length of string
, the result is truncated to size
characters. size
must not be negative and padstring
must be non-empty.
rtrim(string) → varchar
Removes trailing whitespace from string
. See trim() for the set of recognized whitespace characters.
rtrim(string, chars) → varchar
Removes the longest substring containing only characters in chars
from the end of the string
.
SELECT rtrim('test', 't'); -- tes
SELECT rtrim('test...', '.'); -- test
split(string, delimiter)
Splits string
on delimiter
and returns an array.
split(string, delimiter, limit)
Splits string
on delimiter
and returns an array of size at most limit
. The last element in the array always contain everything left in the string
. limit
must be a positive number.
split_part(string, delimiter, index) → varchar
Splits string
on delimiter
and returns the field index
. Field indexes start with 1
. If the index is larger than than the number of fields, then null is returned.
split_to_map(string, entryDelimiter, keyValueDelimiter) → map<varchar, varchar>
Splits string
by entryDelimiter
and keyValueDelimiter
and returns a map. entryDelimiter
splits string
into key-value pairs. keyValueDelimiter
splits each pair into key and value. Note that entryDelimiter
and keyValueDelimiter
are interpreted literally, i.e., as full string matches.
split_to_map(string, entryDelimiter, keyValueDelimiter, function(K, V1, V2, R)) → map<varchar, varchar>
Splits string
by entryDelimiter
and keyValueDelimiter
and returns a map. entryDelimiter
splits string
into key-value pairs. keyValueDelimiter
splits each pair into key and value. Note that entryDelimiter
and keyValueDelimiter
are interpreted literally, i.e., as full string matches. function(K,V1,V2,R)
is invoked in cases of duplicate keys to resolve the value that should be in the map.
SELECT(split_to_map('a:1;b:2;a:3', ';', ':', (k, v1, v2) -> v1)); -- {"a": "1", "b": "2"}
SELECT(split_to_map('a:1;b:2;a:3', ';', ':', (k, v1, v2) -> CONCAT(v1, v2))); -- {"a": "13", "b": "2"}
split_to_multimap(string, entryDelimiter, keyValueDelimiter)
Splits string
by entryDelimiter
and keyValueDelimiter
and returns a map containing an array of values for each unique key. entryDelimiter
splits string
into key-value pairs. keyValueDelimiter
splits each pair into key and value. The values for each key will be in the same order as they appeared in string
. Note that entryDelimiter
and keyValueDelimiter
are interpreted literally, i.e., as full string matches.
strpos(string, substring) → bigint
Returns the starting position of the first instance of substring
in string
. Positions start with 1
. If not found, 0
is returned.
starts_with(string, substring) → boolean
Returns whether string
starts with substring
.
strpos(string, substring, instance) → bigint
Returns the position of the N-th instance
of substring
in string
. instance
must be a positive number. Positions start with 1
. If not found, 0
is returned.
strrpos(string, substring) → bigint
Returns the starting position of the last instance of substring
in string
. Positions start with 1
. If not found, 0
is returned.
strrpos(string, substring, instance) → bigint
Returns the position of the N-th instance
of substring
in string
starting from the end of the string. instance
must be a positive number. Positions start with 1
. If not found, 0
is returned.
position(substring IN string) → bigint
Returns the starting position of the first instance of substring
in string
. Positions start with 1
. If not found, 0
is returned.
substr(string, start) → varchar
Returns the rest of string
from the starting position start
. Positions start with 1
. A negative starting position is interpreted as being relative to the end of the string.
substr(string, start, length) → varchar
Returns a substring from string
of length length
from the starting position start
. Positions start with 1
. A negative starting position is interpreted as being relative to the end of the string.
trail(string, N) → varchar
Returns the last N characters of the input string.
trim(string) → varchar
Removes leading and trailing whitespace from string
.
Recognized whitespace characters:
Code | Description | Code | Description |
---|---|---|---|
9 | TAB (horizontal tab) | U+1680 | Ogham Space Mark |
10 | LF (NL line feed, new line) | U+2000 | En Quad |
11 | VT (vertical tab) | U+2001 | Em Quad |
12 | FF (NP form feed, new page) | U+2002 | En Space |
13 | CR (carriage return) | U+2003 | Em Space |
28 | FS (file separator) | U+2004 | Three-Per-Em Space |
29 | GS (group separator) | U+2005 | Four-Per-Em Space |
30 | RS (record separator) | U+2006 | Four-Per-Em Space |
31 | US (unit separator) | U+2008 | Punctuation Space |
32 | Space | U+2009 | Thin Space |
U+200a | Hair Space | ||
U+200a | Hair Space | ||
U+2028 | Line Separator | ||
U+2029 | Paragraph Separator | ||
U+205f | Medium Mathematical Space | ||
U+3000 | Ideographic Space |
trim(string, chars) → varchar
Removes the longest substring containing only characters in chars
from the beginning and end of the string
.
SELECT trim('test', 't'); -- es
SELECT trim('.t.e.s.t.', '.t'); -- e.s
upper(string) → varchar
Converts string
to uppercase.
word_stem(word) → varchar
Returns the stem of word
in the English language.
word_stem(word, lang) → varchar
Returns the stem of word
in the lang
language.
Unicode Functions
normalize(string) → varchar
Transforms string
with NFC normalization form.
normalize(string, form) → varchar
Transforms string
with the specified normalization form. form
must be be one of the following keywords:
Form | Description |
---|---|
| Canonical Decomposition |
| Canonical Decomposition, followed by Canonical Composition |
| Compatibility Decomposition |
| Compatibility Decomposition, followed by Canonical Composition |
Note
This SQL-standard function has special syntax and requires specifying form
as a keyword, not as a string.
to_utf8(string) → varbinary
Encodes string
into a UTF-8 varbinary representation.
from_utf8(binary) → varchar
Decodes a UTF-8 encoded string from binary
. Invalid UTF-8 sequences are replaced with the Unicode replacement character U+FFFD
.
from_utf8(binary, replace) → varchar
Decodes a UTF-8 encoded string from binary
. Invalid UTF-8 sequences are replaced with replace. The replacement string replace must either be a single character or empty (in which case invalid characters are removed).
key_sampling_percent(varchar) → double
Generates a double value between 0.0 and 1.0 based on the hash of the given varchar
. This function is useful for deterministic sampling of data.