Using the Avro File Format with Impala Tables
Impala supports using tables whose data files use the Avro file format. Impala can query Avro tables. In Impala 1.4.0 and higher, Impala can create Avro tables, but cannot insert data into them. For insert operations, use Hive, then switch back to Impala to run queries.
File Type | Format | Compression Codecs | Impala Can CREATE? | Impala Can INSERT? |
---|---|---|---|---|
Avro | Structured | Snappy, gzip, deflate | Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive. | No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala. |
Parent topic: How Impala Works with Hadoop File Formats
Creating Avro Tables
To create a new table using the Avro file format, issue the CREATE TABLE
statement through Impala with the STORED AS AVRO
clause, or through Hive. If you create the table through Impala, you must include column definitions that match the fields specified in the Avro schema. With Hive, you can omit the columns and just specify the Avro schema.
In Impala 2.3 and higher, the CREATE TABLE
for Avro tables can include SQL-style column definitions rather than specifying Avro notation through the TBLPROPERTIES
clause. Impala issues warning messages if there are any mismatches between the types specified in the SQL column definitions and the underlying types; for example, any TINYINT
or SMALLINT
columns are treated as INT
in the underlying Avro files, and therefore are displayed as INT
in any DESCRIBE
or SHOW CREATE TABLE
output.
Note:
Currently, Avro tables cannot contain TIMESTAMP
columns. If you need to store date and time values in Avro tables, as a workaround you can use a STRING
representation of the values, convert the values to BIGINT
with the UNIX_TIMESTAMP()
function, or create separate numeric columns for individual date and time fields using the EXTRACT()
function.
The following examples demonstrate creating an Avro table in Impala, using either an inline column specification or one taken from a JSON file stored in HDFS:
[localhost:21000] > CREATE TABLE avro_only_sql_columns
> (
> id INT,
> bool_col BOOLEAN,
> tinyint_col TINYINT, /* Gets promoted to INT */
> smallint_col SMALLINT, /* Gets promoted to INT */
> int_col INT,
> bigint_col BIGINT,
> float_col FLOAT,
> double_col DOUBLE,
> date_string_col STRING,
> string_col STRING
> )
> STORED AS AVRO;
[localhost:21000] > CREATE TABLE impala_avro_table
> (bool_col BOOLEAN, int_col INT, long_col BIGINT, float_col FLOAT, double_col DOUBLE, string_col STRING, nullable_int INT)
> STORED AS AVRO
> TBLPROPERTIES ('avro.schema.literal'='{
> "name": "my_record",
> "type": "record",
> "fields": [
> {"name":"bool_col", "type":"boolean"},
> {"name":"int_col", "type":"int"},
> {"name":"long_col", "type":"long"},
> {"name":"float_col", "type":"float"},
> {"name":"double_col", "type":"double"},
> {"name":"string_col", "type":"string"},
> {"name": "nullable_int", "type": ["null", "int"]}]}');
[localhost:21000] > CREATE TABLE avro_examples_of_all_types (
> id INT,
> bool_col BOOLEAN,
> tinyint_col TINYINT,
> smallint_col SMALLINT,
> int_col INT,
> bigint_col BIGINT,
> float_col FLOAT,
> double_col DOUBLE,
> date_string_col STRING,
> string_col STRING
> )
> STORED AS AVRO
> TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/avro_schemas/alltypes.json');
The following example demonstrates creating an Avro table in Hive:
hive> CREATE TABLE hive_avro_table
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.literal'='{
> "name": "my_record",
> "type": "record",
> "fields": [
> {"name":"bool_col", "type":"boolean"},
> {"name":"int_col", "type":"int"},
> {"name":"long_col", "type":"long"},
> {"name":"float_col", "type":"float"},
> {"name":"double_col", "type":"double"},
> {"name":"string_col", "type":"string"},
> {"name": "nullable_int", "type": ["null", "int"]}]}');
Each field of the record becomes a column of the table. Note that any other information, such as the record name, is ignored.
Note: For nullable Avro columns, make sure to put the "null"
entry before the actual type name. In Impala, all columns are nullable; Impala currently does not have a NOT NULL
clause. Any non-nullable property is only enforced on the Avro side.
Most column types map directly from Avro to Impala under the same names. These are the exceptions and special cases to consider:
- The
DECIMAL
type is defined in Avro as aBYTE
type with thelogicalType
property set to"decimal"
and a specified precision and scale. - The Avro
long
type maps toBIGINT
in Impala.
If you create the table through Hive, switch back to impala-shell and issue an INVALIDATE METADATA table_name
statement. Then you can run queries for that table through impala-shell.
In rare instances, a mismatch could occur between the Avro schema and the column definitions in the metastore database. In Impala 2.3 and higher, Impala checks for such inconsistencies during a CREATE TABLE
statement and each time it loads the metadata for a table (for example, after INVALIDATE METADATA
). Impala uses the following rules to determine how to treat mismatching columns, a process known as schema reconciliation:
- If there is a mismatch in the number of columns, Impala uses the column definitions from the Avro schema.
- If there is a mismatch in column name or type, Impala uses the column definition from the Avro schema. Because a
CHAR
orVARCHAR
column in Impala maps to an AvroSTRING
, this case is not considered a mismatch and the column is preserved asCHAR
orVARCHAR
in the reconciled schema. Prior to Impala 2.7 the column name and comment for suchCHAR
andVARCHAR
columns was also taken from the SQL column definition. In Impala 2.7 and higher, the column name and comment from the Avro schema file take precedence for such columns, and only theCHAR
orVARCHAR
type is preserved from the SQL column definition. - An Impala
TIMESTAMP
column definition maps to an AvroSTRING
and is presented as aSTRING
in the reconciled schema, because Avro has no binaryTIMESTAMP
representation. As a result, no Avro table can have aTIMESTAMP
column; this restriction is the same as in earlier Impala releases.
Complex type considerations: Although you can create tables in this file format using the complex types (ARRAY
, STRUCT
, and MAP
) available in Impala 2.3 and higher, currently, Impala can query these types only in Parquet tables. The one exception to the preceding rule is COUNT(*)
queries on RCFile tables that include complex types. Such queries are allowed in Impala 2.6 and higher.
Using a Hive-Created Avro Table in Impala
If you have an Avro table created through Hive, you can use it in Impala as long as it contains only Impala-compatible data types. It cannot contain:
- Complex types:
array
,map
,record
,struct
,union
other than[supported_type,null]
or[null,supported_type]
- The Avro-specific types
enum
,bytes
, andfixed
- Any scalar type other than those listed in Data Types
Because Impala and Hive share the same metastore database, Impala can directly access the table definitions and data for tables that were created in Hive.
If you create an Avro table in Hive, issue an INVALIDATE METADATA
the next time you connect to Impala through impala-shell. This is a one-time operation to make Impala aware of the new table. You can issue the statement while connected to any Impala node, and the catalog service broadcasts the change to all other Impala nodes.
If you load new data into an Avro table through Hive, either through a Hive LOAD DATA
or INSERT
statement, or by manually copying or moving files into the data directory for the table, issue a REFRESH table_name
statement the next time you connect to Impala through impala-shell. You can issue the statement while connected to any Impala node, and the catalog service broadcasts the change to all other Impala nodes. If you issue the LOAD DATA
statement through Impala, you do not need a REFRESH
afterward.
Impala only supports fields of type boolean
, int
, long
, float
, double
, and string
, or unions of these types with null; for example, ["string", "null"]
. Unions with null
essentially create a nullable type.
Specifying the Avro Schema through JSON
While you can embed a schema directly in your CREATE TABLE
statement, as shown above, column width restrictions in the Hive metastore limit the length of schema you can specify. If you encounter problems with long schema literals, try storing your schema as a JSON
file in HDFS instead. Specify your schema in HDFS using table properties similar to the following:
tblproperties ('avro.schema.url'='hdfs//your-name-node:port/path/to/schema.json');
Loading Data into an Avro Table
Currently, Impala cannot write Avro data files. Therefore, an Avro table cannot be used as the destination of an Impala INSERT
statement or CREATE TABLE AS SELECT
.
To copy data from another table, issue any INSERT
statements through Hive. For information about loading data into Avro tables through Hive, see Avro page on the Hive wiki.
If you already have data files in Avro format, you can also issue LOAD DATA
in either Impala or Hive. Impala can move existing Avro data files into an Avro table, it just cannot create new Avro data files.
Enabling Compression for Avro Tables
To enable compression for Avro tables, specify settings in the Hive shell to enable compression and to specify a codec, then issue a CREATE TABLE
statement as in the preceding examples. Impala supports the snappy
and deflate
codecs for Avro tables.
For example:
hive> set hive.exec.compress.output=true;
hive> set avro.output.codec=snappy;
How Impala Handles Avro Schema Evolution
Starting in Impala 1.1, Impala can deal with Avro data files that employ schema evolution, where different data files within the same table use slightly different type definitions. (You would perform the schema evolution operation by issuing an ALTER TABLE
statement in the Hive shell.) The old and new types for any changed columns must be compatible, for example a column might start as an int
and later change to a bigint
or float
.
As with any other tables where the definitions are changed or data is added outside of the current impalad node, ensure that Impala loads the latest metadata for the table if the Avro schema is modified through Hive. Issue a REFRESH table_name
or INVALIDATE METADATA table_name
statement. REFRESH
reloads the metadata immediately, INVALIDATE METADATA
reloads the metadata the next time the table is accessed.
When Avro data files or columns are not consulted during a query, Impala does not check for consistency. Thus, if you issue SELECT c1, c2 FROM t1
, Impala does not return any error if the column c3
changed in an incompatible way. If a query retrieves data from some partitions but not others, Impala does not check the data files for the unused partitions.
In the Hive DDL statements, you can specify an avro.schema.literal
table property (if the schema definition is short) or an avro.schema.url
property (if the schema definition is long, or to allow convenient editing for the definition).
For example, running the following SQL code in the Hive shell creates a table using the Avro file format and puts some sample data into it:
CREATE TABLE avro_table (a string, b string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.literal'='{
"type": "record",
"name": "my_record",
"fields": [
{"name": "a", "type": "int"},
{"name": "b", "type": "string"}
]}');
INSERT OVERWRITE TABLE avro_table SELECT 1, "avro" FROM functional.alltypes LIMIT 1;
Once the Avro table is created and contains data, you can query it through the impala-shell command:
[localhost:21000] > select * from avro_table;
+---+------+
| a | b |
+---+------+
| 1 | avro |
+---+------+
Now in the Hive shell, you change the type of a column and add a new column with a default value:
-- Promote column "a" from INT to FLOAT (no need to update Avro schema)
ALTER TABLE avro_table CHANGE A A FLOAT;
-- Add column "c" with default
ALTER TABLE avro_table ADD COLUMNS (c int);
ALTER TABLE avro_table SET TBLPROPERTIES (
'avro.schema.literal'='{
"type": "record",
"name": "my_record",
"fields": [
{"name": "a", "type": "int"},
{"name": "b", "type": "string"},
{"name": "c", "type": "int", "default": 10}
]}');
Once again in impala-shell, you can query the Avro table based on its latest schema definition. Because the table metadata was changed outside of Impala, you issue a REFRESH
statement first so that Impala has up-to-date metadata for the table.
[localhost:21000] > refresh avro_table;
[localhost:21000] > select * from avro_table;
+---+------+----+
| a | b | c |
+---+------+----+
| 1 | avro | 10 |
+---+------+----+
Data Type Considerations for Avro Tables
The Avro format defines a set of data types whose names differ from the names of the corresponding Impala data types. If you are preparing Avro files using other Hadoop components such as Pig or MapReduce, you might need to work with the type names defined by Avro. The following figure lists the Avro-defined types and the equivalent types in Impala.
Primitive types:
Avro type | Impala type |
---|---|
STRING | STRING |
STRING | CHAR |
STRING | VARCHAR |
INT | INT |
BOOLEAN | BOOLEAN |
LONG | BIGINT |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
The Avro specification allows string values up to 2**64 bytes in length. Impala queries for Avro tables use 32-bit integers to hold string lengths.
In Impala 2.5 and higher, Impala truncates CHAR
and VARCHAR
values in Avro tables to (2**31)-1 bytes. If a query encounters a STRING
value longer than (2**31)-1 bytes in an Avro table, the query fails. In earlier releases, encountering such long values in an Avro table could cause a crash.
Logical types:
Avro type | Impala type |
---|---|
BYTES annotated | DECIMAL |
INT32 annotated | DATE |
Impala does not support the following Avro data types: RECORD, MAP, ARRAY, UNION, ENUM, FIXED, NULL
Query Performance for Impala Avro Tables
In general, expect query performance with Avro tables to be faster than with tables using text data, but slower than with Parquet tables. See Using the Parquet File Format with Impala Tables for information about using the Parquet file format for high-performance analytic queries.
In Impala 2.6 and higher, Impala queries are optimized for files stored in Amazon S3. For Impala tables that use the file formats Parquet, ORC, RCFile, SequenceFile, Avro, and uncompressed text, the setting fs.s3a.block.size
in the core-site.xml configuration file determines how Impala divides the I/O work of reading the data files. This configuration setting is specified in bytes. By default, this value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access Parquet files written by MapReduce or Hive, increase fs.s3a.block.size
to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve Parquet files written by Impala, increase fs.s3a.block.size
to 268435456 (256 MB) to match the row group size produced by Impala.