Impala String Functions
String functions are classified as those primarily accepting or returning STRING
, VARCHAR
, or CHAR
data types, for example to measure the length of a string or concatenate two strings together.
- All the functions that accept
STRING
arguments also accept theVARCHAR
andCHAR
types introduced in Impala 2.0. - Whenever
VARCHAR
orCHAR
values are passed to a function that returns a string value, the return type is normalized toSTRING
. For example, a call toconcat()
with a mix ofSTRING
,VARCHAR
, andCHAR
arguments produces aSTRING
result.
Related information:
The string functions operate mainly on these data types: STRING Data Type, VARCHAR Data Type (Impala 2.0 or higher only), and CHAR Data Type (Impala 2.0 or higher only).
Function reference:
Impala supports the following string functions:
ascii(string str)
Purpose: Returns the numeric ASCII code of the first character of the argument.
Return type: int
base64decode(string str)
Purpose:
Return type: string
Usage notes:
For general information about Base64 encoding, see Base64 article on Wikipedia.
The functions base64encode()
and base64decode()
are typically used in combination, to store in an Impala table string data that is problematic to store or transmit. For example, you could use these functions to store string data that uses an encoding other than UTF-8, or to transform the values in contexts that require ASCII values, such as for partition key columns. Keep in mind that base64-encoded values produce different results for string functions such as LENGTH()
, MAX()
, and MIN()
than when those functions are called with the unencoded string values.
The set of characters that can be generated as output from base64encode()
, or specified in the argument string to base64decode()
, are the ASCII uppercase and lowercase letters (A-Z, a-z), digits (0-9), and the punctuation characters +
, /
, and =
.
All return values produced by base64encode()
are a multiple of 4 bytes in length. All argument values supplied to base64decode()
must also be a multiple of 4 bytes in length. If a base64-encoded value would otherwise have a different length, it can be padded with trailing =
characters to reach a length that is a multiple of 4 bytes.
If the argument string to base64decode()
does not represent a valid base64-encoded value, subject to the constraints of the Impala implementation such as the allowed character set, the function returns NULL
.
Examples:
The following examples show how to use base64encode()
and base64decode()
together to store and retrieve string values:
-- An arbitrary string can be encoded in base 64.
-- The length of the output is a multiple of 4 bytes,
-- padded with trailing = characters if necessary.
select base64encode('hello world') as encoded,
length(base64encode('hello world')) as length;
+------------------+--------+
| encoded | length |
+------------------+--------+
| aGVsbG8gd29ybGQ= | 16 |
+------------------+--------+
-- Passing an encoded value to base64decode() produces
-- the original value.
select base64decode('aGVsbG8gd29ybGQ=') as decoded;
+-------------+
| decoded |
+-------------+
| hello world |
+-------------+
These examples demonstrate incorrect encoded values that produce NULL
return values when decoded:
-- The input value to base64decode() must be a multiple of 4 bytes.
-- In this case, leaving off the trailing = padding character
-- produces a NULL return value.
select base64decode('aGVsbG8gd29ybGQ') as decoded;
+---------+
| decoded |
+---------+
| NULL |
+---------+
WARNINGS: UDF WARNING: Invalid base64 string; input length is 15,
which is not a multiple of 4.
-- The input to base64decode() can only contain certain characters.
-- The $ character in this case causes a NULL return value.
select base64decode('abc$');
+----------------------+
| base64decode('abc$') |
+----------------------+
| NULL |
+----------------------+
WARNINGS: UDF WARNING: Could not base64 decode input in space 4; actual output length 0
These examples demonstrate “round-tripping” of an original string to an encoded string, and back again. This technique is applicable if the original source is in an unknown encoding, or if some intermediate processing stage might cause national characters to be misrepresented:
select 'circumflex accents: â, ê, î, ô, û' as original,
base64encode('circumflex accents: â, ê, î, ô, û') as encoded;
+-----------------------------------+------------------------------------------------------+
| original | encoded |
+-----------------------------------+------------------------------------------------------+
| circumflex accents: â, ê, î, ô, û | Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= |
+-----------------------------------+------------------------------------------------------+
select base64encode('circumflex accents: â, ê, î, ô, û') as encoded,
base64decode(base64encode('circumflex accents: â, ê, î, ô, û')) as decoded;
+------------------------------------------------------+-----------------------------------+
| encoded | decoded |
+------------------------------------------------------+-----------------------------------+
| Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= | circumflex accents: â, ê, î, ô, û |
+------------------------------------------------------+-----------------------------------+
base64encode(string str)
Purpose:
Return type: string
Usage notes:
For general information about Base64 encoding, see Base64 article on Wikipedia.
The functions base64encode()
and base64decode()
are typically used in combination, to store in an Impala table string data that is problematic to store or transmit. For example, you could use these functions to store string data that uses an encoding other than UTF-8, or to transform the values in contexts that require ASCII values, such as for partition key columns. Keep in mind that base64-encoded values produce different results for string functions such as LENGTH()
, MAX()
, and MIN()
than when those functions are called with the unencoded string values.
The set of characters that can be generated as output from base64encode()
, or specified in the argument string to base64decode()
, are the ASCII uppercase and lowercase letters (A-Z, a-z), digits (0-9), and the punctuation characters +
, /
, and =
.
All return values produced by base64encode()
are a multiple of 4 bytes in length. All argument values supplied to base64decode()
must also be a multiple of 4 bytes in length. If a base64-encoded value would otherwise have a different length, it can be padded with trailing =
characters to reach a length that is a multiple of 4 bytes.
Examples:
The following examples show how to use base64encode()
and base64decode()
together to store and retrieve string values:
-- An arbitrary string can be encoded in base 64.
-- The length of the output is a multiple of 4 bytes,
-- padded with trailing = characters if necessary.
select base64encode('hello world') as encoded,
length(base64encode('hello world')) as length;
+------------------+--------+
| encoded | length |
+------------------+--------+
| aGVsbG8gd29ybGQ= | 16 |
+------------------+--------+
-- Passing an encoded value to base64decode() produces
-- the original value.
select base64decode('aGVsbG8gd29ybGQ=') as decoded;
+-------------+
| decoded |
+-------------+
| hello world |
+-------------+
These examples demonstrate incorrect encoded values that produce NULL
return values when decoded:
-- The input value to base64decode() must be a multiple of 4 bytes.
-- In this case, leaving off the trailing = padding character
-- produces a NULL return value.
select base64decode('aGVsbG8gd29ybGQ') as decoded;
+---------+
| decoded |
+---------+
| NULL |
+---------+
WARNINGS: UDF WARNING: Invalid base64 string; input length is 15,
which is not a multiple of 4.
-- The input to base64decode() can only contain certain characters.
-- The $ character in this case causes a NULL return value.
select base64decode('abc$');
+----------------------+
| base64decode('abc$') |
+----------------------+
| NULL |
+----------------------+
WARNINGS: UDF WARNING: Could not base64 decode input in space 4; actual output length 0
These examples demonstrate “round-tripping” of an original string to an encoded string, and back again. This technique is applicable if the original source is in an unknown encoding, or if some intermediate processing stage might cause national characters to be misrepresented:
select 'circumflex accents: â, ê, î, ô, û' as original,
base64encode('circumflex accents: â, ê, î, ô, û') as encoded;
+-----------------------------------+------------------------------------------------------+
| original | encoded |
+-----------------------------------+------------------------------------------------------+
| circumflex accents: â, ê, î, ô, û | Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= |
+-----------------------------------+------------------------------------------------------+
select base64encode('circumflex accents: â, ê, î, ô, û') as encoded,
base64decode(base64encode('circumflex accents: â, ê, î, ô, û')) as decoded;
+------------------------------------------------------+-----------------------------------+
| encoded | decoded |
+------------------------------------------------------+-----------------------------------+
| Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= | circumflex accents: â, ê, î, ô, û |
+------------------------------------------------------+-----------------------------------+
btrim(string a)
, btrim(string a, string chars_to_trim)
Purpose: Removes all instances of one or more characters from the start and end of a STRING
value. By default, removes only spaces. If a non-NULL
optional second argument is specified, the function removes all occurrences of characters in that second argument from the beginning and end of the string.
Return type: string
Added in: Impala 2.3.0
Examples:
The following examples show the default btrim()
behavior, and what changes when you specify the optional second argument. All the examples bracket the output value with [ ]
so that you can see any leading or trailing spaces in the btrim()
result. By default, the function removes and number of both leading and trailing spaces. When the second argument is specified, any number of occurrences of any character in the second argument are removed from the start and end of the input string; in this case, spaces are not removed (unless they are part of the second argument) and any instances of the characters are not removed if they do not come right at the beginning or end of the string.
-- Remove multiple spaces before and one space after.
select concat('[',btrim(' hello '),']');
+---------------------------------------+
| concat('[', btrim(' hello '), ']') |
+---------------------------------------+
| [hello] |
+---------------------------------------+
-- Remove any instances of x or y or z at beginning or end. Leave spaces alone.
select concat('[',btrim('xy hello zyzzxx','xyz'),']');
+------------------------------------------------------+
| concat('[', btrim('xy hello zyzzxx', 'xyz'), ']') |
+------------------------------------------------------+
| [ hello ] |
+------------------------------------------------------+
-- Remove any instances of x or y or z at beginning or end.
-- Leave x, y, z alone in the middle of the string.
select concat('[',btrim('xyhelxyzlozyzzxx','xyz'),']');
+----------------------------------------------------+
| concat('[', btrim('xyhelxyzlozyzzxx', 'xyz'), ']') |
+----------------------------------------------------+
| [helxyzlo] |
+----------------------------------------------------+
char_length(string a), character_length(string a)
Purpose: Returns the length in characters of the argument string, including any trailing spaces that pad a CHAR
value.
Return type: int
Usage notes:
When applied to a STRING
value, it returns the same result as the length()
function. When applied to a CHAR
value, it might return a larger value than length()
does, to account for trailing spaces in the CHAR
.
Examples:
The following example demonstrates how length()
and char_length()
sometimes produce the same result, and sometimes produce different results depending on the type of the argument and the presence of trailing spaces for CHAR
values. The S
and C
values are displayed with enclosing quotation marks to show any trailing spaces.
create table length_demo (s string, c char(5));
insert into length_demo values
('a',cast('a' as char(5))),
('abc',cast('abc' as char(5))),
('hello',cast('hello' as char(5)));
select concat('"',s,'"') as s, concat('"',c,'"') as c,
length(s), length(c),
char_length(s), char_length(c)
from length_demo;
+---------+---------+-----------+-----------+----------------+----------------+
| s | c | length(s) | length(c) | char_length(s) | char_length(c) |
+---------+---------+-----------+-----------+----------------+----------------+
| "a" | "a " | 1 | 1 | 1 | 5 |
| "abc" | "abc " | 3 | 3 | 3 | 5 |
| "hello" | "hello" | 5 | 5 | 5 | 5 |
+---------+---------+-----------+-----------+----------------+----------------+
chr(int character_code)
Purpose: Returns a character specified by a decimal code point value. The interpretation and display of the resulting character depends on your system locale. Because consistent processing of Impala string values is only guaranteed for values within the ASCII range, only use this function for values corresponding to ASCII characters. In particular, parameter values greater than 255 return an empty string.
Return type: string
Usage notes: Can be used as the inverse of the ascii()
function, which converts a character to its numeric ASCII code.
Added in: Impala 2.3.0
Examples:
SELECT chr(65);
+---------+
| chr(65) |
+---------+
| A |
+---------+
SELECT chr(97);
+---------+
| chr(97) |
+---------+
| a |
+---------+
concat(string a, string b...)
Purpose: Returns a single string representing all the argument values joined together.
Return type: string
Usage notes: concat()
and concat_ws()
are appropriate for concatenating the values of multiple columns within the same row, while group_concat()
joins together values from different rows.
concat_ws(string sep, string a, string b...)
Purpose: Returns a single string representing the second and following argument values joined together, delimited by a specified separator.
Return type: string
Usage notes: concat()
and concat_ws()
are appropriate for concatenating the values of multiple columns within the same row, while group_concat()
joins together values from different rows.
find_in_set(string str, string strList)
Purpose: Returns the position (starting from 1) of the first occurrence of a specified string within a comma-separated string. Returns NULL
if either argument is NULL
, 0 if the search string is not found, or 0 if the search string contains a comma.
Return type: int
group_concat(string s [, string sep])
Purpose: Returns a single string representing the argument value concatenated together for each row of the result set. If the optional separator string is specified, the separator is added between each pair of concatenated values.
Return type: string
Usage notes: concat()
and concat_ws()
are appropriate for concatenating the values of multiple columns within the same row, while group_concat()
joins together values from different rows.
By default, returns a single string covering the whole result set. To include other columns or values in the result set, or to produce multiple concatenated strings for subsets of rows, include a GROUP BY
clause in the query.
Strictly speaking, group_concat()
is an aggregate function, not a scalar function like the others in this list. For additional details and examples, see GROUP_CONCAT Function.
initcap(string str)
Purpose: Returns the input string with the first letter capitalized.
Return type: string
instr(string str, string substr [, bigint position [, bigint occurrence ] ])
Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer string.
Return type: int
Usage notes:
If the substring is not present in the string, the function returns 0:
select instr('foo bar bletch', 'z');
+------------------------------+
| instr('foo bar bletch', 'z') |
+------------------------------+
| 0 |
+------------------------------+
The optional third and fourth arguments let you find instances of the substring other than the first instance starting from the left:
The third argument lets you specify a starting point within the string other than 1:
-- Restricting the search to positions 7..end,
-- the first occurrence of 'b' is at position 9.
select instr('foo bar bletch', 'b', 7);
+---------------------------------+
| instr('foo bar bletch', 'b', 7) |
+---------------------------------+
| 9 |
+---------------------------------+
-- If there are no more occurrences after the
-- specified position, the result is 0.
select instr('foo bar bletch', 'b', 10);
+----------------------------------+
| instr('foo bar bletch', 'b', 10) |
+----------------------------------+
| 0 |
+----------------------------------+
If the third argument is negative, the search works right-to-left starting that many characters from the right. The return value still represents the position starting from the left side of the string.
-- Scanning right to left, the first occurrence of 'o'
-- is at position 8. (8th character from the left.)
select instr('hello world','o',-1);
+-------------------------------+
| instr('hello world', 'o', -1) |
+-------------------------------+
| 8 |
+-------------------------------+
-- Scanning right to left, starting from the 6th character
-- from the right, the first occurrence of 'o' is at
-- position 5 (5th character from the left).
select instr('hello world','o',-6);
+-------------------------------+
| instr('hello world', 'o', -6) |
+-------------------------------+
| 5 |
+-------------------------------+
-- If there are no more occurrences after the
-- specified position, the result is 0.
select instr('hello world','o',-10);
+--------------------------------+
| instr('hello world', 'o', -10) |
+--------------------------------+
| 0 |
+--------------------------------+
The fourth argument lets you specify an occurrence other than the first:
-- 2nd occurrence of 'b' is at position 9.
select instr('foo bar bletch', 'b', 1, 2);
+------------------------------------+
| instr('foo bar bletch', 'b', 1, 2) |
+------------------------------------+
| 9 |
+------------------------------------+
-- Negative position argument means scan right-to-left.
-- This example finds second instance of 'b' from the right.
select instr('foo bar bletch', 'b', -1, 2);
+-------------------------------------+
| instr('foo bar bletch', 'b', -1, 2) |
+-------------------------------------+
| 5 |
+-------------------------------------+
If the fourth argument is greater than the number of matching occurrences, the function returns 0:
-- There is no 3rd occurrence within the string.
select instr('foo bar bletch', 'b', 1, 3);
+------------------------------------+
| instr('foo bar bletch', 'b', 1, 3) |
+------------------------------------+
| 0 |
+------------------------------------+
-- There is not even 1 occurrence when scanning
-- the string starting at position 10.
select instr('foo bar bletch', 'b', 10, 1);
+-------------------------------------+
| instr('foo bar bletch', 'b', 10, 1) |
+-------------------------------------+
| 0 |
+-------------------------------------+
The fourth argument cannot be negative or zero. A non-positive value for this argument causes an error:
select instr('foo bar bletch', 'b', 1, 0);
ERROR: UDF ERROR: Invalid occurrence parameter to instr function: 0
select instr('aaaaaaaaa','aa', 1, -1);
ERROR: UDF ERROR: Invalid occurrence parameter to instr function: -1
If either of the optional arguments is
NULL
, the function also returnsNULL
:select instr('foo bar bletch', 'b', null);
+------------------------------------+
| instr('foo bar bletch', 'b', null) |
+------------------------------------+
| NULL |
+------------------------------------+
select instr('foo bar bletch', 'b', 1, null);
+---------------------------------------+
| instr('foo bar bletch', 'b', 1, null) |
+---------------------------------------+
| NULL |
+---------------------------------------+
left(string a, int num_chars)
See the strleft
function.
length(string a)
Purpose: Returns the length in characters of the argument string, ignoring any trailing spaces in CHAR
values.
Return type: int
Usage notes:
When applied to a STRING
value, it returns the same result as the char_length()
function. When applied to a CHAR
value, it might return a smaller value than char_length()
does, because length()
ignores any trailing spaces in the CHAR
.
Note: Because the behavior of length()
with CHAR
values containing trailing spaces is not standardized across the industry, when porting code from other database systems, evaluate the behavior of length()
on the source system and switch to char_length()
for Impala if necessary.
Examples:
The following example demonstrates how length()
and char_length()
sometimes produce the same result, and sometimes produce different results depending on the type of the argument and the presence of trailing spaces for CHAR
values. The S
and C
values are displayed with enclosing quotation marks to show any trailing spaces.
create table length_demo (s string, c char(5));
insert into length_demo values
('a',cast('a' as char(5))),
('abc',cast('abc' as char(5))),
('hello',cast('hello' as char(5)));
select concat('"',s,'"') as s, concat('"',c,'"') as c,
length(s), length(c),
char_length(s), char_length(c)
from length_demo;
+---------+---------+-----------+-----------+----------------+----------------+
| s | c | length(s) | length(c) | char_length(s) | char_length(c) |
+---------+---------+-----------+-----------+----------------+----------------+
| "a" | "a " | 1 | 1 | 1 | 5 |
| "abc" | "abc " | 3 | 3 | 3 | 5 |
| "hello" | "hello" | 5 | 5 | 5 | 5 |
+---------+---------+-----------+-----------+----------------+----------------+
locate(string substr, string str[, int pos])
Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer string, optionally after a particular position.
Return type: int
lower(string a), lcase(string a)
Purpose: Returns the argument string converted to all-lowercase.
Return type: string
Usage notes:
In Impala 2.5 and higher, you can simplify queries that use many UPPER()
and LOWER()
calls to do case-insensitive comparisons, by using the ILIKE
or IREGEXP
operators instead. See ILIKE Operator and IREGEXP Operator for details.
lpad(string str, int len, string pad)
Purpose: Returns a string of a specified length, based on the first argument string. If the specified string is too short, it is padded on the left with a repeating sequence of the characters from the pad string. If the specified string is too long, it is truncated on the right.
Return type: string
ltrim(string a [, string chars_to_trim])
Purpose: Returns the argument string with all occurrences of characters specified by the second argument removed from the left side. Removes spaces if the second argument is not specified.
Return type: string
parse_url(string urlString, string partToExtract [, string keyToExtract])
Purpose: Returns the portion of a URL corresponding to a specified part. The part argument can be 'PROTOCOL'
, 'HOST'
, 'PATH'
, 'REF'
, 'AUTHORITY'
, 'FILE'
, 'USERINFO'
, or 'QUERY'
. Uppercase is required for these literal values. When requesting the QUERY
portion of the URL, you can optionally specify a key to retrieve just the associated value from the key-value pairs in the query string.
Return type: string
Usage notes: This function is important for the traditional Hadoop use case of interpreting web logs. For example, if the web traffic data features raw URLs not divided into separate table columns, you can count visitors to a particular page by extracting the 'PATH'
or 'FILE'
field, or analyze search terms by extracting the corresponding key from the 'QUERY'
field.
regexp_escape(string source)
Purpose: The regexp_escape
function returns a string escaped for the special character in RE2 library so that the special characters are interpreted literally rather than as special characters. The following special characters are escaped by the function:
.\+*?[^]$(){}=!<>|:-
Return type: string
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]]
instead of \d
which you would have to escape as \\d
.
Examples:
This example shows escaping one of special characters in RE2.
+------------------------------------------------------+
| regexp_escape('Hello.world') |
+------------------------------------------------------+
| Hello\.world |
+------------------------------------------------------+
This example shows escaping all the special characters in RE2.
+------------------------------------------------------------+
| regexp_escape('a.b\\c+d*e?f[g]h$i(j)k{l}m=n!o<p>q|r:s-t') |
+------------------------------------------------------------+
| a\.b\\c\+d\*e\?f\[g\]h\$i\(j\)k\{l\}m\=n\!o\<p\>q\|r\:s\-t |
+------------------------------------------------------------+
regexp_extract(string subject, string pattern, int index)
Purpose: Returns the specified () group from a string based on a regular expression pattern. Group 0 refers to the entire extracted string, while group 1, 2, and so on refers to the first, second, and so on (...)
portion.
Return type: string
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]]
instead of \d
which you would have to escape as \\d
.
Examples:
This example shows how group 0 matches the full pattern string, including the portion outside any ()
group:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',0);
+------------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 0) |
+------------------------------------------------------+
| abcdef123ghi456 |
+------------------------------------------------------+
Returned 1 row(s) in 0.11s
This example shows how group 1 matches just the contents inside the first ()
group in the pattern string:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',1);
+------------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 1) |
+------------------------------------------------------+
| 456 |
+------------------------------------------------------+
Returned 1 row(s) in 0.11s
Unlike in earlier Impala releases, the regular expression library used in Impala 2.0 and later supports the .*?
idiom for non-greedy matches. This example shows how a pattern string starting with .*?
matches the shortest possible portion of the source string, returning the rightmost set of lowercase letters. A pattern string both starting and ending with .*?
finds two potential matches of equal length, and returns the first one found (the leftmost set of lowercase letters).
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+)',1);
+--------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+)', 1) |
+--------------------------------------------------------+
| def |
+--------------------------------------------------------+
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+).*?',1);
+-----------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+).*?', 1) |
+-----------------------------------------------------------+
| bcd |
+-----------------------------------------------------------+
regexp_like(string source, string pattern[, string options])
Purpose: Returns true
or false
to indicate whether the source string contains anywhere inside it the regular expression given by the pattern. The optional third argument consists of letter flags that change how the match is performed, such as i
for case-insensitive matching.
Syntax:
The flags that you can include in the optional third argument are:
c
: Case-sensitive matching (the default).i
: Case-insensitive matching. If multiple instances ofc
andi
are included in the third argument, the last such option takes precedence.m
: Multi-line matching. The^
and$
operators match the start or end of any line within the source string, not the start and end of the entire string.n
: Newline matching. The.
operator can match the newline character. A repetition operator such as.*
can match a portion of the source string that spans multiple lines.
Return type: boolean
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]]
instead of \d
which you would have to escape as \\d
.
Examples:
This example shows how regexp_like()
can test for the existence of various kinds of regular expression patterns within a source string:
-- Matches because the 'f' appears somewhere in 'foo'.
select regexp_like('foo','f');
+-------------------------+
| regexp_like('foo', 'f') |
+-------------------------+
| true |
+-------------------------+
-- Does not match because the comparison is case-sensitive by default.
select regexp_like('foo','F');
+-------------------------+
| regexp_like('foo', 'f') |
+-------------------------+
| false |
+-------------------------+
-- The 3rd argument can change the matching logic, such as 'i' meaning case-insensitive.
select regexp_like('foo','F','i');
+------------------------------+
| regexp_like('foo', 'f', 'i') |
+------------------------------+
| true |
+------------------------------+
-- The familiar regular expression notations work, such as ^ and $ anchors...
select regexp_like('foo','f$');
+--------------------------+
| regexp_like('foo', 'f$') |
+--------------------------+
| false |
+--------------------------+
select regexp_like('foo','o$');
+--------------------------+
| regexp_like('foo', 'o$') |
+--------------------------+
| true |
+--------------------------+
-- ...and repetition operators such as * and +
select regexp_like('foooooobar','fo+b');
+-----------------------------------+
| regexp_like('foooooobar', 'fo+b') |
+-----------------------------------+
| true |
+-----------------------------------+
select regexp_like('foooooobar','fx*y*o*b');
+---------------------------------------+
| regexp_like('foooooobar', 'fx*y*o*b') |
+---------------------------------------+
| true |
+---------------------------------------+
regexp_replace(string initial, string pattern, string replacement)
Purpose: Returns the initial argument with the regular expression pattern replaced by the final argument string.
Return type: string
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]]
instead of \d
which you would have to escape as \\d
.
Examples:
These examples show how you can replace parts of a string matching a pattern with replacement text, which can include backreferences to any ()
groups in the pattern string. The backreference numbers start at 1, and any \
characters must be escaped as \\
.
Replace a character pattern with new text:
[localhost:21000] > select regexp_replace('aaabbbaaa','b+','xyz');
+------------------------------------------+
| regexp_replace('aaabbbaaa', 'b+', 'xyz') |
+------------------------------------------+
| aaaxyzaaa |
+------------------------------------------+
Returned 1 row(s) in 0.11s
Replace a character pattern with substitution text that includes the original matching text:
[localhost:21000] > select regexp_replace('aaabbbaaa','(b+)','<\\1>');
+----------------------------------------------+
| regexp_replace('aaabbbaaa', '(b+)', '<\\1>') |
+----------------------------------------------+
| aaa<bbb>aaa |
+----------------------------------------------+
Returned 1 row(s) in 0.11s
Remove all characters that are not digits:
[localhost:21000] > select regexp_replace('123-456-789','[^[:digit:]]','');
+---------------------------------------------------+
| regexp_replace('123-456-789', '[^[:digit:]]', '') |
+---------------------------------------------------+
| 123456789 |
+---------------------------------------------------+
Returned 1 row(s) in 0.12s
repeat(string str, int n)
Purpose: Returns the argument string repeated a specified number of times.
Return type: string
replace(string initial, string target, string replacement)
Purpose: Returns the initial argument with all occurrences of the target string replaced by the replacement string.
Return type: string
Usage notes:
Because this function does not use any regular expression patterns, it is typically faster than regexp_replace()
for simple string substitutions.
If any argument is NULL
, the return value is NULL
.
Matching is case-sensitive.
If the replacement string contains another instance of the target string, the expansion is only performed once, instead of applying again to the newly constructed string.
Added in: Impala 2.9.0
Examples:
-- Replace one string with another.
select replace('hello world','world','earth');
+------------------------------------------+
| replace('hello world', 'world', 'earth') |
+------------------------------------------+
| hello earth |
+------------------------------------------+
-- All occurrences of the target string are replaced.
select replace('hello world','o','0');
+----------------------------------+
| replace('hello world', 'o', '0') |
+----------------------------------+
| hell0 w0rld |
+----------------------------------+
-- If no match is found, the original string is returned unchanged.
select replace('hello world','xyz','abc');
+--------------------------------------+
| replace('hello world', 'xyz', 'abc') |
+--------------------------------------+
| hello world |
+--------------------------------------+
reverse(string a)
Purpose: Returns the argument string with characters in reversed order.
Return type: string
right(string a, int num_chars)
See the strright
function.
rpad(string str, int len, string pad)
Purpose: Returns a string of a specified length, based on the first argument string. If the specified string is too short, it is padded on the right with a repeating sequence of the characters from the pad string. If the specified string is too long, it is truncated on the right.
Return type: string
rtrim(string a [, string chars_to_trim])
Purpose: Returns the argument string with all occurrences of characters specified by the second argument removed from the right side. Removes spaces if the second argument is not specified.
Return type: string
space(int n)
Purpose: Returns a concatenated string of the specified number of spaces. Shorthand for repeat(' ',n)
.
Return type: string
split_part(string source, string delimiter, bigint n)
Purpose: Returns the nth field within a delimited string. The fields are numbered starting from 1. The delimiter can consist of multiple characters, not just a single character. All matching of the delimiter is done exactly, not using any regular expression patterns.
Return type: string
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]]
instead of \d
which you would have to escape as \\d
.
Examples:
These examples show how to retrieve the nth field from a delimited string:
select split_part('x,y,z',',',1);
+-----------------------------+
| split_part('x,y,z', ',', 1) |
+-----------------------------+
| x |
+-----------------------------+
select split_part('x,y,z',',',2);
+-----------------------------+
| split_part('x,y,z', ',', 2) |
+-----------------------------+
| y |
+-----------------------------+
select split_part('x,y,z',',',3);
+-----------------------------+
| split_part('x,y,z', ',', 3) |
+-----------------------------+
| z |
+-----------------------------+
These examples show what happens for out-of-range field positions. Specifying a value less than 1 produces an error. Specifying a value greater than the number of fields returns a zero-length string (which is not the same as NULL
).
select split_part('x,y,z',',',0);
ERROR: Invalid field position: 0
with t1 as (select split_part('x,y,z',',',4) nonexistent_field)
select
nonexistent_field
, concat('[',nonexistent_field,']')
, length(nonexistent_field);
from t1
+-------------------+-------------------------------------+---------------------------+
| nonexistent_field | concat('[', nonexistent_field, ']') | length(nonexistent_field) |
+-------------------+-------------------------------------+---------------------------+
| | [] | 0 |
+-------------------+-------------------------------------+---------------------------+
These examples show how the delimiter can be a multi-character value:
select split_part('one***two***three','***',2);
+-------------------------------------------+
| split_part('one***two***three', '***', 2) |
+-------------------------------------------+
| two |
+-------------------------------------------+
select split_part('one\|/two\|/three','\|/',3);
+-------------------------------------------+
| split_part('one\|/two\|/three', '\|/', 3) |
+-------------------------------------------+
| three |
+-------------------------------------------+
strleft(string a, int num_chars)
Purpose: Returns the leftmost characters of the string. Shorthand for a call to substr()
with 2 arguments.
Return type: string
strright(string a, int num_chars)
Purpose: Returns the rightmost characters of the string. Shorthand for a call to substr()
with 2 arguments.
Return type: string
substr(string a, int start [, int len]), substring(string a, int start [, int len])
Purpose: Returns the portion of the string starting at a specified point, optionally with a specified maximum length. The characters in the string are indexed starting at 1.
Return type: string
translate(string input, string from, string to)
Purpose: Returns the input string with a set of characters replaced by another set of characters.
Return type: string
trim(string a)
Purpose: Returns the input string with both leading and trailing spaces removed. The same as passing the string through both ltrim()
and rtrim()
.
Usage notes: Often used during data cleansing operations during the ETL cycle, if input values might still have surrounding spaces. For a more general-purpose function that can remove other leading and trailing characters besides spaces, see btrim()
.
Return type: string
upper(string a), ucase(string a)
Purpose: Returns the argument string converted to all-uppercase.
Return type: string
Usage notes:
In Impala 2.5 and higher, you can simplify queries that use many UPPER()
and LOWER()
calls to do case-insensitive comparisons, by using the ILIKE
or IREGEXP
operators instead. See ILIKE Operator and IREGEXP Operator for details.
Parent topic: Impala Built-In Functions