- User-Defined Functions (UDFs)
- UDF Concepts
- Runtime Environment for UDFs
- Installing the UDF Development Package
- Writing User-Defined Functions (UDFs)
- Writing User-Defined Aggregate Functions (UDAFs)
- Building and Deploying UDFs
- Performance Considerations for UDFs
- Examples of Creating and Using UDFs
- Security Considerations for User-Defined Functions
- Limitations and Restrictions for Impala UDFs
User-Defined Functions (UDFs)
User-defined functions (frequently abbreviated as UDFs) let you code your own application logic for processing column values during an Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.
You can use UDFs to simplify query logic when producing reports, or to transform data in flexible ways when copying from one table to another with the INSERT ... SELECT
syntax.
You might be familiar with this feature from other database products, under names such as stored functions or stored routines.
Impala support for UDFs is available in Impala 1.2 and higher:
- In Impala 1.1, using UDFs in a query required using the Hive shell. (Because Impala and Hive share the same metastore database, you could switch to Hive to run just those queries requiring UDFs, then switch back to Impala.)
- Starting in Impala 1.2, Impala can run both high-performance native code UDFs written in C++, and Java-based Hive UDFs that you might already have written.
- Impala can run scalar UDFs that return a single value for each row of the result set, and user-defined aggregate functions (UDAFs) that return a value based on a set of rows. Currently, Impala does not support user-defined table functions (UDTFs) or window functions.
Parent topic: Impala SQL Language Reference
UDF Concepts
Depending on your use case, you might write all-new functions, reuse Java UDFs that you have already written for Hive, or port Hive Java UDF code to higher-performance native Impala UDFs in C++. You can code either scalar functions for producing results one row at a time, or more complex aggregate functions for doing analysis across. The following sections discuss these different aspects of working with UDFs.
UDFs and UDAFs
Depending on your use case, the user-defined functions (UDFs) you write might accept or produce different numbers of input and output values:
The most general kind of user-defined function (the one typically referred to by the abbreviation UDF) takes a single input value and produces a single output value. When used in a query, it is called once for each row in the result set. For example:
select customer_name, is_frequent_customer(customer_id) from customers;
select obfuscate(sensitive_column) from sensitive_data;
A user-defined aggregate function (UDAF) accepts a group of values and returns a single value. You use UDAFs to summarize and condense sets of rows, in the same style as the built-in
COUNT
,MAX()
,SUM()
, andAVG()
functions. When called in a query that uses theGROUP BY
clause, the function is called once for each combination ofGROUP BY
values. For example:-- Evaluates multiple rows but returns a single value.
select closest_restaurant(latitude, longitude) from places;
-- Evaluates batches of rows and returns a separate value for each batch.
select most_profitable_location(store_id, sales, expenses, tax_rate, depreciation) from franchise_data group by year;
Currently, Impala does not support other categories of user-defined functions, such as user-defined table functions (UDTFs) or window functions.
Native Impala UDFs
Impala supports UDFs written in C++, in addition to supporting existing Hive UDFs written in Java. Where practical, use C++ UDFs because the compiled native code can yield higher performance, with UDF execution time often 10x faster for a C++ UDF than the equivalent Java UDF.
Using Hive UDFs with Impala
Impala can run Java-based user-defined functions (UDFs), originally written for Hive, with no changes, subject to the following conditions:
- The parameters and return value must all use scalar data types supported by Impala. For example, complex or nested types are not supported.
- Hive/Java UDFs must extend
org.apache.hadoop.hive.ql.exec.UDF
class. - Currently, Hive UDFs that accept or return the
TIMESTAMP
type are not supported. - Prior to Impala 2.5 the return type must be a “Writable” type such as
Text
orIntWritable
, rather than a Java primitive type such asString
orint
. Otherwise, the UDF returnsNULL
. In Impala 2.5 and higher, this restriction is lifted, and both UDF arguments and return values can be Java primitive types. - Hive UDAFs and UDTFs are not supported.
- Typically, a Java UDF will execute several times slower in Impala than the equivalent native UDF written in C++.
- In Impala 2.5 and higher, you can transparently call Hive Java UDFs through Impala, or call Impala Java UDFs through Hive. This feature does not apply to built-in Hive functions. Any Impala Java UDFs created with older versions must be re-created using new
CREATE FUNCTION
syntax, without any signature for arguments or the return value.
To take full advantage of the Impala architecture and performance features, you can also write Impala-specific UDFs in C++.
For background about Java-based Hive UDFs, see the Hive documentation for UDFs. For examples or tutorials for writing such UDFs, search the web for related blog posts.
The ideal way to understand how to reuse Java-based UDFs (originally written for Hive) with Impala is to take some of the Hive built-in functions (implemented as Java UDFs) and take the applicable JAR files through the UDF deployment process for Impala, creating new UDFs with different names:
- Take a copy of the Hive JAR file containing the Hive built-in functions. For example, the path might be like /usr/lib/hive/lib/hive-exec-0.10.0.jar, with different version numbers corresponding to your specific level of .
- Use
jar tf jar_file
to see a list of the classes inside the JAR. You will see names likeorg/apache/hadoop/hive/ql/udf/UDFLower.class
andorg/apache/hadoop/hive/ql/udf/UDFOPNegative.class
. Make a note of the names of the functions you want to experiment with. When you specify the entry points for the ImpalaCREATE FUNCTION
statement, change the slash characters to dots and strip off the.class
suffix, for exampleorg.apache.hadoop.hive.ql.udf.UDFLower
andorg.apache.hadoop.hive.ql.udf.UDFOPNegative
. - Copy that file to an HDFS location that Impala can read. (In the examples here, we renamed the file to hive-builtins.jar in HDFS for simplicity.)
- For each Java-based UDF that you want to call through Impala, issue a
CREATE FUNCTION
statement, with aLOCATION
clause containing the full HDFS path of the JAR file, and aSYMBOL
clause with the fully qualified name of the class, using dots as separators and without the.class
extension. Remember that user-defined functions are associated with a particular database, so issue aUSE
statement for the appropriate database first, or specify the SQL function name asdb_name.function_name
. Use completely new names for the SQL functions, because Impala UDFs cannot have the same name as Impala built-in functions. - Call the function from your queries, passing arguments of the correct type to match the function signature. These arguments could be references to columns, arithmetic or other kinds of expressions, the results of
CAST
functions to ensure correct data types, and so on.
Note:
In Impala 2.9 and higher, you can refresh the user-defined functions (UDFs) that Impala recognizes, at the database level, by running the REFRESH FUNCTIONS
statement with the database name as an argument. Java-based UDFs can be added to the metastore database through Hive CREATE FUNCTION
statements, and made visible to Impala by subsequently running REFRESH FUNCTIONS
. For example:
CREATE DATABASE shared_udfs;
USE shared_udfs;
...use CREATE FUNCTION statements in Hive to create some Java-based UDFs
that Impala is not initially aware of...
REFRESH FUNCTIONS shared_udfs;
SELECT udf_created_by_hive(c1) FROM ...
Java UDF Example: Reusing lower() Function
For example, the following impala-shell session creates an Impala UDF my_lower()
that reuses the Java code for the Hive lower()
: built-in function. We cannot call it lower()
because Impala does not allow UDFs to have the same name as built-in functions. From SQL, we call the function in a basic way (in a query with no WHERE
clause), directly on a column, and on the results of a string expression:
[localhost:21000] > create database udfs;
[localhost:21000] > use udfs;
localhost:21000] > create function lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';
ERROR: AnalysisException: Function cannot have the same name as a builtin: lower
[localhost:21000] > create function my_lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';
[localhost:21000] > select my_lower('Some String NOT ALREADY LOWERCASE');
+----------------------------------------------------+
| udfs.my_lower('some string not already lowercase') |
+----------------------------------------------------+
| some string not already lowercase |
+----------------------------------------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > create table t2 (s string);
[localhost:21000] > insert into t2 values ('lower'),('UPPER'),('Init cap'),('CamelCase');
Inserted 4 rows in 2.28s
[localhost:21000] > select * from t2;
+-----------+
| s |
+-----------+
| lower |
| UPPER |
| Init cap |
| CamelCase |
+-----------+
Returned 4 row(s) in 0.47s
[localhost:21000] > select my_lower(s) from t2;
+------------------+
| udfs.my_lower(s) |
+------------------+
| lower |
| upper |
| init cap |
| camelcase |
+------------------+
Returned 4 row(s) in 0.54s
[localhost:21000] > select my_lower(concat('ABC ',s,' XYZ')) from t2;
+------------------------------------------+
| udfs.my_lower(concat('abc ', s, ' xyz')) |
+------------------------------------------+
| abc lower xyz |
| abc upper xyz |
| abc init cap xyz |
| abc camelcase xyz |
+------------------------------------------+
Returned 4 row(s) in 0.22s
Java UDF Example: Reusing negative() Function
Here is an example that reuses the Hive Java code for the negative()
built-in function. This example demonstrates how the data types of the arguments must match precisely with the function signature. At first, we create an Impala SQL function that can only accept an integer argument. Impala cannot find a matching function when the query passes a floating-point argument, although we can call the integer version of the function by casting the argument. Then we overload the same function name to also accept a floating-point argument.
[localhost:21000] > create table t (x int);
[localhost:21000] > insert into t values (1), (2), (4), (100);
Inserted 4 rows in 1.43s
[localhost:21000] > create function my_neg(bigint) returns bigint location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFOPNegative';
[localhost:21000] > select my_neg(4);
+----------------+
| udfs.my_neg(4) |
+----------------+
| -4 |
+----------------+
[localhost:21000] > select my_neg(x) from t;
+----------------+
| udfs.my_neg(x) |
+----------------+
| -2 |
| -4 |
| -100 |
+----------------+
Returned 3 row(s) in 0.60s
[localhost:21000] > select my_neg(4.0);
ERROR: AnalysisException: No matching function with signature: udfs.my_neg(FLOAT).
[localhost:21000] > select my_neg(cast(4.0 as int));
+-------------------------------+
| udfs.my_neg(cast(4.0 as int)) |
+-------------------------------+
| -4 |
+-------------------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > create function my_neg(double) returns double location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFOPNegative';
[localhost:21000] > select my_neg(4.0);
+------------------+
| udfs.my_neg(4.0) |
+------------------+
| -4 |
+------------------+
Returned 1 row(s) in 0.11s
Runtime Environment for UDFs
By default, Impala copies UDFs into /tmp, and you can configure this location through the --local_library_dir
startup flag for the impalad daemon.
Installing the UDF Development Package
To develop UDFs for Impala, download and install the impala-udf-devel
package (RHEL-based distributions) or impala-udf-dev
(Ubuntu and Debian). This package contains header files, sample source, and build configuration files.
- Locate the appropriate
.repo
or list file for your operating system version. - Use the familiar
yum
,zypper
, orapt-get
commands depending on your operating system. For the package name, specifyimpala-udf-devel
(RHEL-based distributions) orimpala-udf-dev
(Ubuntu and Debian).
Note: The UDF development code does not rely on Impala being installed on the same machine. You can write and compile UDFs on a minimal development system, then deploy them on a different one for use with Impala.
When you are ready to start writing your own UDFs, download the sample code and build scripts from the Impala sample UDF github. Then see Writing User-Defined Functions (UDFs) for how to code UDFs, and Examples of Creating and Using UDFs for how to build and run UDFs.
Writing User-Defined Functions (UDFs)
Before starting UDF development, make sure to install the development package and download the UDF code samples, as described in Installing the UDF Development Package.
When writing UDFs:
- Keep in mind the data type differences as you transfer values from the high-level SQL to your lower-level UDF code. For example, in the UDF code you might be much more aware of how many bytes different kinds of integers require.
- Use best practices for function-oriented programming: choose arguments carefully, avoid side effects, make each function do a single thing, and so on.
Getting Started with UDF Coding
To understand the layout and member variables and functions of the predefined UDF data types, examine the header file /usr/include/impala_udf/udf.h:
// This is the only Impala header required to develop UDFs and UDAs. This header
// contains the types that need to be used and the FunctionContext object. The context
// object serves as the interface object between the UDF/UDA and the impala process.
For the basic declarations needed to write a scalar UDF, see the header file udf-sample.h within the sample build environment, which defines a simple function named AddUdf()
:
#ifndef IMPALA_UDF_SAMPLE_UDF_H
#define IMPALA_UDF_SAMPLE_UDF_H
#include <impala_udf/udf.h>
using namespace impala_udf;
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2);
#endif
For sample C++ code for a simple function named AddUdf()
, see the source file udf-sample.cc within the sample build environment:
#include "udf-sample.h"
// In this sample we are declaring a UDF that adds two ints and returns an int.
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2) {
if (arg1.is_null || arg2.is_null) return IntVal::null();
return IntVal(arg1.val + arg2.val);
}
// Multiple UDFs can be defined in the same file
Data Types for Function Arguments and Return Values
Each value that a user-defined function can accept as an argument or return as a result value must map to a SQL data type that you could specify for a table column.
Currently, Impala UDFs cannot accept arguments or return values of the Impala complex types (STRUCT
, ARRAY
, or MAP
).
Each data type has a corresponding structure defined in the C++ and Java header files, with two member fields and some predefined comparison operators and constructors:
is_null
indicates whether the value isNULL
or not.val
holds the actual argument or return value when it is non-NULL
.Each struct also defines a
null()
member function that constructs an instance of the struct with theis_null
flag set.The built-in SQL comparison operators and clauses such as
<
,>=
,BETWEEN
, andORDER BY
all work automatically based on the SQL return type of each UDF. For example, Impala knows how to evaluateBETWEEN 1 AND udf_returning_int(col1)
orORDER BY udf_returning_string(col2)
without you declaring any comparison operators within the UDF itself.For convenience within your UDF code, each struct defines
==
and!=
operators for comparisons with other structs of the same type. These are for typical C++ comparisons within your own code, not necessarily reproducing SQL semantics. For example, if theis_null
flag is set in both structs, they compare as equal. That behavior ofnull
comparisons is different from SQL (whereNULL == NULL
isNULL
rather thantrue
), but more in line with typical C++ behavior.Each kind of struct has one or more constructors that define a filled-in instance of the struct, optionally with default values.
Impala cannot process UDFs that accept composite or nested types as arguments or return them as result values. This limitation applies both to Impala UDFs written in C++ and Java-based Hive UDFs.
You can overload functions by creating multiple functions with the same SQL name but different argument types. For overloaded functions, you must use different C++ or Java entry point names in the underlying functions.
The data types defined on the C++ side (in /usr/include/impala_udf/udf.h) are:
IntVal
represents anINT
column.BigIntVal
represents aBIGINT
column. Even if you do not need the full range of aBIGINT
value, it can be useful to code your function arguments asBigIntVal
to make it convenient to call the function with different kinds of integer columns and expressions as arguments. Impala automatically casts smaller integer types to larger ones when appropriate, but does not implicitly cast large integer types to smaller ones.SmallIntVal
represents aSMALLINT
column.TinyIntVal
represents aTINYINT
column.StringVal
represents aSTRING
column. It has alen
field representing the length of the string, and aptr
field pointing to the string data. It has constructors that create a newStringVal
struct based on a null-terminated C-style string, or a pointer plus a length; these new structs still refer to the original string data rather than allocating a new buffer for the data. It also has a constructor that takes a pointer to aFunctionContext
struct and a length, that does allocate space for a new copy of the string data, for use in UDFs that return string values.BooleanVal
represents aBOOLEAN
column.FloatVal
represents aFLOAT
column.DoubleVal
represents aDOUBLE
column.TimestampVal
represents aTIMESTAMP
column. It has adate
field, a 32-bit integer representing the Gregorian date, that is, the days past the epoch date. It also has atime_of_day
field, a 64-bit integer representing the current time of day in nanoseconds.
Variable-Length Argument Lists
UDFs typically take a fixed number of arguments, with each one named explicitly in the signature of your C++ function. Your function can also accept additional optional arguments, all of the same type. For example, you can concatenate two strings, three strings, four strings, and so on. Or you can compare two numbers, three numbers, four numbers, and so on.
To accept a variable-length argument list, code the signature of your function like this:
StringVal Concat(FunctionContext* context, const StringVal& separator,
int num_var_args, const StringVal* args);
In the CREATE FUNCTION
statement, after the type of the first optional argument, include ...
to indicate it could be followed by more arguments of the same type. For example, the following function accepts a STRING
argument, followed by one or more additional STRING
arguments:
[localhost:21000] > create function my_concat(string, string ...) returns string location '/user/test_user/udfs/sample.so' symbol='Concat';
The call from the SQL query must pass at least one argument to the variable-length portion of the argument list.
When Impala calls the function, it fills in the initial set of required arguments, then passes the number of extra arguments and a pointer to the first of those optional arguments.
Handling NULL Values
For correctness, performance, and reliability, it is important for each UDF to handle all situations where any NULL
values are passed to your function. For example, when passed a NULL
, UDFs typically also return NULL
. In an aggregate function, which could be passed a combination of real and NULL
values, you might make the final value into a NULL
(as in CONCAT()
), ignore the NULL
value (as in AVG()
), or treat it the same as a numeric zero or empty string.
Each parameter type, such as IntVal
or StringVal
, has an is_null
Boolean member. Test this flag immediately for each argument to your function, and if it is set, do not refer to the val
field of the argument structure. The val
field is undefined when the argument is NULL
, so your function could go into an infinite loop or produce incorrect results if you skip the special handling for NULL
.
If your function returns NULL
when passed a NULL
value, or in other cases such as when a search string is not found, you can construct a null instance of the return type by using its null()
member function.
Memory Allocation for UDFs
By default, memory allocated within a UDF is deallocated when the function exits, which could be before the query is finished. The input arguments remain allocated for the lifetime of the function, so you can refer to them in the expressions for your return values. If you use temporary variables to construct all-new string values, use the StringVal()
constructor that takes an initial FunctionContext*
argument followed by a length, and copy the data into the newly allocated memory buffer.
Thread-Safe Work Area for UDFs
One way to improve performance of UDFs is to specify the optional PREPARE_FN
and CLOSE_FN
clauses on the CREATE FUNCTION
statement. The “prepare” function sets up a thread-safe data structure in memory that you can use as a work area. The “close” function deallocates that memory. Each subsequent call to the UDF within the same thread can access that same memory area. There might be several such memory areas allocated on the same host, as UDFs are parallelized using multiple threads.
Within this work area, you can set up predefined lookup tables, or record the results of complex operations on data types such as STRING
or TIMESTAMP
. Saving the results of previous computations rather than repeating the computation each time is an optimization known as http://en.wikipedia.org/wiki/Memoization. For example, if your UDF performs a regular expression match or date manipulation on a column that repeats the same value over and over, you could store the last-computed value or a hash table of already-computed values, and do a fast lookup to find the result for subsequent iterations of the UDF.
Each such function must have the signature:
void function_name(impala_udf::FunctionContext*, impala_udf::FunctionContext::FunctionScope)
Currently, only THREAD_SCOPE
is implemented, not FRAGMENT_SCOPE
. See udf.h for details about the scope values.
Error Handling for UDFs
To handle errors in UDFs, you call functions that are members of the initial FunctionContext*
argument passed to your function.
A UDF can record one or more warnings, for conditions that indicate minor, recoverable problems that do not cause the query to stop. The signature for this function is:
bool AddWarning(const char* warning_msg);
For a serious problem that requires cancelling the query, a UDF can set an error flag that prevents the query from returning any results. The signature for this function is:
void SetError(const char* error_msg);
Writing User-Defined Aggregate Functions (UDAFs)
User-defined aggregate functions (UDAFs or UDAs) are a powerful and flexible category of user-defined functions. If a query processes N rows, calling a UDAF during the query condenses the result set, anywhere from a single value (such as with the SUM
or MAX
functions), or some number less than or equal to N (as in queries using the GROUP BY
or HAVING
clause).
The Underlying Functions for a UDA
A UDAF must maintain a state value across subsequent calls, so that it can accumulate a result across a set of calls, rather than derive it purely from one set of arguments. For that reason, a UDAF is represented by multiple underlying functions:
- An initialization function that sets any counters to zero, creates empty buffers, and does any other one-time setup for a query.
- An update function that processes the arguments for each row in the query result set and accumulates an intermediate result for each node. For example, this function might increment a counter, append to a string buffer, or set flags.
- A merge function that combines the intermediate results from two different nodes.
- A serialize function that flattens any intermediate values containing pointers, and frees any memory allocated during the init, update, and merge phases.
- A finalize function that either passes through the combined result unchanged, or does one final transformation.
Intermediate values returned by the init, update and merge functions that referred to allocations must be allocated using FunctionContext::Allocate()
and freed using FunctionContext::Free()
. Both serialize and finalize functions are responsible for cleaning up the intermediate value and freeing such allocations. StringVals returned to Impala directly by Serialize(), Finalize() or GetValue() functions should be backed by temporary results memory allocated using the StringVal(FunctionContext*, int) constructor
, StringVal::CopyFrom(FunctionContext*, const uint8_t*, size_t)
, or StringVal::Resize()
.
In the SQL syntax, you create a UDAF by using the statement CREATE AGGREGATE FUNCTION
. You specify the entry points of the underlying C++ functions using the clauses INIT_FN
, UPDATE_FN
, MERGE_FN
, SERIALIZE_FN
, and FINALIZE_FN
.
For convenience, you can use a naming convention for the underlying functions and Impala automatically recognizes those entry points. Specify the UPDATE_FN
clause, using an entry point name containing the string update
or Update
. When you omit the other _FN
clauses from the SQL statement, Impala looks for entry points with names formed by substituting the update
or Update
portion of the specified name.
Intermediate Results for UDAs
A user-defined aggregate function might produce and combine intermediate results during some phases of processing, using a different data type than the final return value. For example, if you implement a function similar to the built-in AVG()
function, it must keep track of two values, the number of values counted and the sum of those values. Or, you might accumulate a string value over the course of a UDA, then in the end return a numeric or Boolean result.
In such a case, specify the data type of the intermediate results using the optional INTERMEDIATE type_name
clause of the CREATE AGGREGATE FUNCTION
statement. If the intermediate data is a typeless byte array (for example, to represent a C++ struct or array), specify the type name as CHAR(n)
, with n representing the number of bytes in the intermediate result buffer.
For an example of this technique, see the trunc_sum()
aggregate function, which accumulates intermediate results of type DOUBLE
and returns BIGINT
at the end. View the CREATE FUNCTION
statement and the implementation of the underlying TruncSum*() functions on Github.
Building and Deploying UDFs
This section explains the steps to compile Impala UDFs from C++ source code, and deploy the resulting libraries for use in Impala queries.
Impala ships with a sample build environment for UDFs, that you can study, experiment with, and adapt for your own use. This sample build environment starts with the cmake configuration command, which reads the file CMakeLists.txt and generates a Makefile customized for your particular directory paths. Then the make command runs the actual build steps based on the rules in the Makefile.
Impala loads the shared library from an HDFS location. After building a shared library containing one or more UDFs, use hdfs dfs
or hadoop fs
commands to copy the binary file to an HDFS location readable by Impala.
The final step in deployment is to issue a CREATE FUNCTION
statement in the impala-shell interpreter to make Impala aware of the new function. See CREATE FUNCTION Statement for syntax details. Because each function is associated with a particular database, always issue a USE
statement to the appropriate database before creating a function, or specify a fully qualified name, that is, CREATE FUNCTION db_name.function_name
.
As you update the UDF code and redeploy updated versions of a shared library, use DROP FUNCTION
and CREATE FUNCTION
to let Impala pick up the latest version of the code.
Note:
In Impala 2.5 and higher, Impala UDFs and UDAs written in C++ are persisted in the metastore database. Java UDFs are also persisted, if they were created with the new CREATE FUNCTION
syntax for Java UDFs, where the Java function argument and return types are omitted. Java-based UDFs created with the old CREATE FUNCTION
syntax do not persist across restarts because they are held in the memory of the catalogd daemon. Until you re-create such Java UDFs using the new CREATE FUNCTION
syntax, you must reload those Java-based UDFs by running the original CREATE FUNCTION
statements again each time you restart the catalogd daemon. Prior to Impala 2.5 the requirement to reload functions after a restart applied to both C++ and Java functions.
See CREATE FUNCTION Statement and DROP FUNCTION Statement for the new syntax for the persistent Java UDFs.
Prerequisites for the build environment are:
# Use the appropriate package installation command for your Linux distribution.
sudo yum install gcc-c++ cmake boost-devel
sudo yum install impala-udf-devel
# The package name on Ubuntu and Debian is impala-udf-dev.
Then, unpack the sample code in udf_samples.tar.gz and use that as a template to set up your build environment.
To build the original samples:
# Process CMakeLists.txt and set up appropriate Makefiles.
cmake .
# Generate shared libraries from UDF and UDAF sample code,
# udf_samples/libudfsample.so and udf_samples/libudasample.so
make
The sample code to examine, experiment with, and adapt is in these files:
- udf-sample.h: Header file that declares the signature for a scalar UDF (
AddUDF
). - udf-sample.cc: Sample source for a simple UDF that adds two integers. Because Impala can reference multiple function entry points from the same shared library, you could add other UDF functions in this file and add their signatures to the corresponding header file.
- udf-sample-test.cc: Basic unit tests for the sample UDF.
- uda-sample.h: Header file that declares the signature for sample aggregate functions. The SQL functions will be called
COUNT
,AVG
, andSTRINGCONCAT
. Because aggregate functions require more elaborate coding to handle the processing for multiple phases, there are several underlying C++ functions such asCountInit
,AvgUpdate
, andStringConcatFinalize
. - uda-sample.cc: Sample source for simple UDAFs that demonstrate how to manage the state transitions as the underlying functions are called during the different phases of query processing.
- The UDAF that imitates the
COUNT
function keeps track of a single incrementing number; the merge functions combine the intermediate count values from each Impala node, and the combined number is returned verbatim by the finalize function. - The UDAF that imitates the
AVG
function keeps track of two numbers, a count of rows processed and the sum of values for a column. These numbers are updated and merged as withCOUNT
, then the finalize function divides them to produce and return the final average value. - The UDAF that concatenates string values into a comma-separated list demonstrates how to manage storage for a string that increases in length as the function is called for multiple rows.
- The UDAF that imitates the
- uda-sample-test.cc: basic unit tests for the sample UDAFs.
Performance Considerations for UDFs
Because a UDF typically processes each row of a table, potentially being called billions of times, the performance of each UDF is a critical factor in the speed of the overall ETL or ELT pipeline. Tiny optimizations you can make within the function body can pay off in a big way when the function is called over and over when processing a huge result set.
Examples of Creating and Using UDFs
This section demonstrates how to create and use all kinds of user-defined functions (UDFs).
Sample C++ UDFs: HasVowels, CountVowels, StripVowels
This example shows 3 separate UDFs that operate on strings and return different data types. In the C++ code, the functions are HasVowels()
(checks if a string contains any vowels), CountVowels()
(returns the number of vowels in a string), and StripVowels()
(returns a new string with vowels removed).
First, we add the signatures for these functions to udf-sample.h in the demo build environment:
BooleanVal HasVowels(FunctionContext* context, const StringVal& input);
IntVal CountVowels(FunctionContext* context, const StringVal& arg1);
StringVal StripVowels(FunctionContext* context, const StringVal& arg1);
Then, we add the bodies of these functions to udf-sample.cc:
BooleanVal HasVowels(FunctionContext* context, const StringVal& input)
{
if (input.is_null) return BooleanVal::null();
int index;
uint8_t *ptr;
for (ptr = input.ptr, index = 0; index <= input.len; index++, ptr++)
{
uint8_t c = tolower(*ptr);
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u')
{
return BooleanVal(true);
}
}
return BooleanVal(false);
}
IntVal CountVowels(FunctionContext* context, const StringVal& arg1)
{
if (arg1.is_null) return IntVal::null();
int count;
int index;
uint8_t *ptr;
for (ptr = arg1.ptr, count = 0, index = 0; index <= arg1.len; index++, ptr++)
{
uint8_t c = tolower(*ptr);
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u')
{
count++;
}
}
return IntVal(count);
}
StringVal StripVowels(FunctionContext* context, const StringVal& arg1)
{
if (arg1.is_null) return StringVal::null();
int index;
std::string original((const char *)arg1.ptr,arg1.len);
std::string shorter("");
for (index = 0; index < original.length(); index++)
{
uint8_t c = original[index];
uint8_t l = tolower(c);
if (l == 'a' || l == 'e' || l == 'i' || l == 'o' || l == 'u')
{
;
}
else
{
shorter.append(1, (char)c);
}
}
// The modified string is stored in 'shorter', which is destroyed when this function ends. We need to make a string val
// and copy the contents.
StringVal result(context, shorter.size()); // Only the version of the ctor that takes a context object allocates new memory
memcpy(result.ptr, shorter.c_str(), shorter.size());
return result;
}
We build a shared library, libudfsample.so, and put the library file into HDFS where Impala can read it:
$ make
[ 0%] Generating udf_samples/uda-sample.ll
[ 16%] Built target uda-sample-ir
[ 33%] Built target udasample
[ 50%] Built target uda-sample-test
[ 50%] Generating udf_samples/udf-sample.ll
[ 66%] Built target udf-sample-ir
Scanning dependencies of target udfsample
[ 83%] Building CXX object CMakeFiles/udfsample.dir/udf-sample.o
Linking CXX shared library udf_samples/libudfsample.so
[ 83%] Built target udfsample
Linking CXX executable udf_samples/udf-sample-test
[100%] Built target udf-sample-test
$ hdfs dfs -put ./udf_samples/libudfsample.so /user/hive/udfs/libudfsample.so
Finally, we go into the impala-shell interpreter where we set up some sample data, issue CREATE FUNCTION
statements to set up the SQL function names, and call the functions in some queries:
[localhost:21000] > create database udf_testing;
[localhost:21000] > use udf_testing;
[localhost:21000] > create function has_vowels (string) returns boolean location '/user/hive/udfs/libudfsample.so' symbol='HasVowels';
[localhost:21000] > select has_vowels('abc');
+------------------------+
| udfs.has_vowels('abc') |
+------------------------+
| true |
+------------------------+
Returned 1 row(s) in 0.13s
[localhost:21000] > select has_vowels('zxcvbnm');
+----------------------------+
| udfs.has_vowels('zxcvbnm') |
+----------------------------+
| false |
+----------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select has_vowels(null);
+-----------------------+
| udfs.has_vowels(null) |
+-----------------------+
| NULL |
+-----------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > select s, has_vowels(s) from t2;
+-----------+--------------------+
| s | udfs.has_vowels(s) |
+-----------+--------------------+
| lower | true |
| UPPER | true |
| Init cap | true |
| CamelCase | true |
+-----------+--------------------+
Returned 4 row(s) in 0.24s
[localhost:21000] > create function count_vowels (string) returns int location '/user/hive/udfs/libudfsample.so' symbol='CountVowels';
[localhost:21000] > select count_vowels('cat in the hat');
+-------------------------------------+
| udfs.count_vowels('cat in the hat') |
+-------------------------------------+
| 4 |
+-------------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select s, count_vowels(s) from t2;
+-----------+----------------------+
| s | udfs.count_vowels(s) |
+-----------+----------------------+
| lower | 2 |
| UPPER | 2 |
| Init cap | 3 |
| CamelCase | 4 |
+-----------+----------------------+
Returned 4 row(s) in 0.23s
[localhost:21000] > select count_vowels(null);
+-------------------------+
| udfs.count_vowels(null) |
+-------------------------+
| NULL |
+-------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > create function strip_vowels (string) returns string location '/user/hive/udfs/libudfsample.so' symbol='StripVowels';
[localhost:21000] > select strip_vowels('abcdefg');
+------------------------------+
| udfs.strip_vowels('abcdefg') |
+------------------------------+
| bcdfg |
+------------------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > select strip_vowels('ABCDEFG');
+------------------------------+
| udfs.strip_vowels('abcdefg') |
+------------------------------+
| BCDFG |
+------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select strip_vowels(null);
+-------------------------+
| udfs.strip_vowels(null) |
+-------------------------+
| NULL |
+-------------------------+
Returned 1 row(s) in 0.16s
[localhost:21000] > select s, strip_vowels(s) from t2;
+-----------+----------------------+
| s | udfs.strip_vowels(s) |
+-----------+----------------------+
| lower | lwr |
| UPPER | PPR |
| Init cap | nt cp |
| CamelCase | CmlCs |
+-----------+----------------------+
Returned 4 row(s) in 0.24s
Sample C++ UDA: SumOfSquares
This example demonstrates a user-defined aggregate function (UDA) that produces the sum of the squares of its input values.
The coding for a UDA is a little more involved than a scalar UDF, because the processing is split into several phases, each implemented by a different function. Each phase is relatively straightforward: the “update” and “merge” phases, where most of the work is done, read an input value and combine it with some accumulated intermediate value.
As in our sample UDF from the previous example, we add function signatures to a header file (in this case, uda-sample.h). Because this is a math-oriented UDA, we make two versions of each function, one accepting an integer value and the other accepting a floating-point value.
void SumOfSquaresInit(FunctionContext* context, BigIntVal* val);
void SumOfSquaresInit(FunctionContext* context, DoubleVal* val);
void SumOfSquaresUpdate(FunctionContext* context, const BigIntVal& input, BigIntVal* val);
void SumOfSquaresUpdate(FunctionContext* context, const DoubleVal& input, DoubleVal* val);
void SumOfSquaresMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst);
void SumOfSquaresMerge(FunctionContext* context, const DoubleVal& src, DoubleVal* dst);
BigIntVal SumOfSquaresFinalize(FunctionContext* context, const BigIntVal& val);
DoubleVal SumOfSquaresFinalize(FunctionContext* context, const DoubleVal& val);
We add the function bodies to a C++ source file (in this case, uda-sample.cc):
void SumOfSquaresInit(FunctionContext* context, BigIntVal* val) {
val->is_null = false;
val->val = 0;
}
void SumOfSquaresInit(FunctionContext* context, DoubleVal* val) {
val->is_null = false;
val->val = 0.0;
}
void SumOfSquaresUpdate(FunctionContext* context, const BigIntVal& input, BigIntVal* val) {
if (input.is_null) return;
val->val += input.val * input.val;
}
void SumOfSquaresUpdate(FunctionContext* context, const DoubleVal& input, DoubleVal* val) {
if (input.is_null) return;
val->val += input.val * input.val;
}
void SumOfSquaresMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst) {
dst->val += src.val;
}
void SumOfSquaresMerge(FunctionContext* context, const DoubleVal& src, DoubleVal* dst) {
dst->val += src.val;
}
BigIntVal SumOfSquaresFinalize(FunctionContext* context, const BigIntVal& val) {
return val;
}
DoubleVal SumOfSquaresFinalize(FunctionContext* context, const DoubleVal& val) {
return val;
}
As with the sample UDF, we build a shared library and put it into HDFS:
$ make
[ 0%] Generating udf_samples/uda-sample.ll
[ 16%] Built target uda-sample-ir
Scanning dependencies of target udasample
[ 33%] Building CXX object CMakeFiles/udasample.dir/uda-sample.o
Linking CXX shared library udf_samples/libudasample.so
[ 33%] Built target udasample
Scanning dependencies of target uda-sample-test
[ 50%] Building CXX object CMakeFiles/uda-sample-test.dir/uda-sample-test.o
Linking CXX executable udf_samples/uda-sample-test
[ 50%] Built target uda-sample-test
[ 50%] Generating udf_samples/udf-sample.ll
[ 66%] Built target udf-sample-ir
[ 83%] Built target udfsample
[100%] Built target udf-sample-test
$ hdfs dfs -put ./udf_samples/libudasample.so /user/hive/udfs/libudasample.so
To create the SQL function, we issue a CREATE AGGREGATE FUNCTION
statement and specify the underlying C++ function names for the different phases:
[localhost:21000] > use udf_testing;
[localhost:21000] > create table sos (x bigint, y double);
[localhost:21000] > insert into sos values (1, 1.1), (2, 2.2), (3, 3.3), (4, 4.4);
Inserted 4 rows in 1.10s
[localhost:21000] > create aggregate function sum_of_squares(bigint) returns bigint
> location '/user/hive/udfs/libudasample.so'
> init_fn='SumOfSquaresInit'
> update_fn='SumOfSquaresUpdate'
> merge_fn='SumOfSquaresMerge'
> finalize_fn='SumOfSquaresFinalize';
[localhost:21000] > -- Compute the same value using literals or the UDA;
[localhost:21000] > select 1*1 + 2*2 + 3*3 + 4*4;
+-------------------------------+
| 1 * 1 + 2 * 2 + 3 * 3 + 4 * 4 |
+-------------------------------+
| 30 |
+-------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select sum_of_squares(x) from sos;
+------------------------+
| udfs.sum_of_squares(x) |
+------------------------+
| 30 |
+------------------------+
Returned 1 row(s) in 0.35s
Until we create the overloaded version of the UDA, it can only handle a single data type. To allow it to handle DOUBLE
as well as BIGINT
, we issue another CREATE AGGREGATE FUNCTION
statement:
[localhost:21000] > select sum_of_squares(y) from sos;
ERROR: AnalysisException: No matching function with signature: udfs.sum_of_squares(DOUBLE).
[localhost:21000] > create aggregate function sum_of_squares(double) returns double
> location '/user/hive/udfs/libudasample.so'
> init_fn='SumOfSquaresInit'
> update_fn='SumOfSquaresUpdate'
> merge_fn='SumOfSquaresMerge'
> finalize_fn='SumOfSquaresFinalize';
[localhost:21000] > -- Compute the same value using literals or the UDA;
[localhost:21000] > select 1.1*1.1 + 2.2*2.2 + 3.3*3.3 + 4.4*4.4;
+-----------------------------------------------+
| 1.1 * 1.1 + 2.2 * 2.2 + 3.3 * 3.3 + 4.4 * 4.4 |
+-----------------------------------------------+
| 36.3 |
+-----------------------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select sum_of_squares(y) from sos;
+------------------------+
| udfs.sum_of_squares(y) |
+------------------------+
| 36.3 |
+------------------------+
Returned 1 row(s) in 0.35s
Typically, you use a UDA in queries with GROUP BY
clauses, to produce a result set with a separate aggregate value for each combination of values from the GROUP BY
clause. Let’s change our sample table to use 0
to indicate rows containing even values, and 1
to flag rows containing odd values. Then the GROUP BY
query can return two values, the sum of the squares for the even values, and the sum of the squares for the odd values:
[localhost:21000] > insert overwrite sos values (1, 1), (2, 0), (3, 1), (4, 0);
Inserted 4 rows in 1.24s
[localhost:21000] > -- Compute 1 squared + 3 squared, and 2 squared + 4 squared;
[localhost:21000] > select y, sum_of_squares(x) from sos group by y;
+---+------------------------+
| y | udfs.sum_of_squares(x) |
+---+------------------------+
| 1 | 10 |
| 0 | 20 |
+---+------------------------+
Returned 2 row(s) in 0.43s
Security Considerations for User-Defined Functions
When the Impala authorization feature is enabled:
- To call a UDF in a query, you must have the required read privilege for any databases and tables used in the query.
- The
CREATE FUNCTION
statement requires:- The
CREATE
privilege on the database. - The
ALL
privilege on URI where URI is the value you specified for theLOCATION
in theCREATE FUNCTION
statement.
- The
See Impala Authorization for details about authorization in Impala.
Limitations and Restrictions for Impala UDFs
The following limitations and restrictions apply to Impala UDFs in the current release:
- Impala does not support Hive UDFs that accept or return composite or nested types, or other types not available in Impala tables.
The Hive
current_user()
function cannot be called from a Java UDF through Impala.All Impala UDFs must be deterministic, that is, produce the same output each time when passed the same argument values. For example, an Impala UDF must not call functions such as
rand()
to produce different values for each invocation. It must not retrieve data from external sources, such as from disk or over the network.- An Impala UDF must not spawn other threads or processes.
- Prior to Impala 2.5 when the catalogd process is restarted, all UDFs become undefined and must be reloaded. In Impala 2.5 and higher, this limitation only applies to older Java UDFs. Re-create those UDFs using the new
CREATE FUNCTION
syntax for Java UDFs, which excludes the function signature, to remove the limitation entirely. - Impala currently does not support user-defined table functions (UDTFs).
- The
CHAR
andVARCHAR
types cannot be used as input arguments or return values for UDFs.