- New Features in Apache Impala
- New Features in Impala 4.0
- New Features in Impala 3.4
- Support for Hive Insert-Only Transactional Tables
- Server-side Spooling of Query Results
- Cookie-based Authentication
- Object Ownership Support
- New Built-in Functions for Fuzzy Matching of Strings
- Capacity Quota for Scratch Disks
- Query Option for Disabling HBase Row Estimation
- Query Option for Controlling Size of Parquet Splits on Non-block Stores
- Query Profile Exported to JSON
- DATE Data Type Supported in Avro Tables
- Primary Key and Foreign Key Constraints
- Enhanced External Kudu Table
- Ranger Column Masking
- BROADCAST_BYTES_LIMIT query option
- Experimental Support for Apache Hudi
- ORC Reads Enabled by Default
- Support for ZSTD and DEFLATE
- New Features in Impala 3.3
- Increased Compatibility with Apache Projects
- Parquet Page Index
- The Remote File Handle Cache Supports S3
- Support for Kudu Integrated with Hive Metastore
- Zstd Compression for Parquet files
- Lz4 Compression for Parquet files
- Data Cache for Remote Reads
- Metadata Performance Improvements
- Scalable Pool Configuration in Admission Controller
- Query Profile
- DATE Data Type and Functions
- Support Hive Insert-Only Transactional Tables
- HiveServer2 HTTP Connection for Clients
- Default File Format Changed to Parquet
- Built-in Function to Process JSON Objects
- Ubuntu 18.04
- New Features in Impala 3.2
- New Features in Impala 3.1
- New Features in Impala 3.0
- New Features in Impala 2.12
- New Features in Impala 2.11
- New Features in Impala 2.10
- New Features in Impala 2.9
- New Features in Impala 2.8
- New Features in Impala 2.7
- New Features in Impala 2.6
- New Features in Impala 2.5
- New Features in Impala 2.4
- New Features in Impala 2.3
- New Features in Impala 2.8
- New Features in Impala 2.1
- New Features in Impala 2.0
- New Features in Impala 1.4
- New Features in Impala 1.3.2
- New Features in Impala 1.3.1
- New Features in Impala 1.3
- New Features in Impala 1.2.4
- New Features in Impala 1.2.3
- New Features in Impala 1.2.2
- New Features in Impala 1.2.1
- New Features in Impala 1.2.0 (Beta)
- New Features in Impala 1.1.1
- New Features in Impala 1.1
- New Features in Impala 1.0.1
- New Features in Impala 1.0
- New Features in Version 0.7 of the Impala Beta Release
- New Features in Version 0.6 of the Impala Beta Release
- New Features in Version 0.5 of the Impala Beta Release
- New Features in Version 0.4 of the Impala Beta Release
- New Features in Version 0.3 of the Impala Beta Release
- New Features in Version 0.2 of the Impala Beta Release
New Features in Apache Impala
This release of Impala contains the following changes and enhancements from previous releases.
Parent topic: Impala Release Notes
New Features in Impala 4.0
For the full list of issues closed in this release, including the issues marked as “new features” or “improvements”, see the release notes or changelog for Impala 4.0.
New Features in Impala 3.4
The following sections describe the noteworthy improvements made in Impala 3.4.
For the full list of issues closed in this release, see the changelog for Impala 3.4.
Support for Hive Insert-Only Transactional Tables
Impala added the support to truncate insert-only transactional tables.
By default, Impala creates an insert-only transactional table when you issue the CREATE TABLE
statement.
Use the Hive compaction to compact small files to improve the performance and scalability of metadata in transactional tables.
See Impala Transactions for more information.
Server-side Spooling of Query Results
You can use the SPOOL_QUERY_RESULTS
query option to control how query results are returned to the client.
By default, when a client fetches a set of query results, the next set of results are fetched in batches until all the result rows are produced. If a client issues a query without fetching all the results, the query fragments continue to hold on to the resources until the query is canceled and unregistered, potentially tying up resources and causing other queries to wait in admission control.
When the query result spooling feature is enabled, the result sets of queries are eagerly fetched and buffered until they are read by the client, and resources are freed up for other queries.
See Spooling Impala Query Results for the new feature and the query options.
Cookie-based Authentication
Starting in this version, Impala supports cookies for authentication when clients connect via HiveServer2 over HTTP.
You can use the --max_cookie_lifetime_s startup
flag to:
- Disable the use of cookies
- Control how long generated cookies are valid for
See Impala Client Access for more information.
Object Ownership Support
Object ownership for tables, views, and databases is enabled by default in Impala. When you create a database, a table, or a view, as the owner of that object, you implicitly have the privileges on the object. The privileges that owners have are specified in Ranger on the special user, {OWNER}
.
The {OWNER}
user must be defined in Ranger for the object ownership privileges work in Impala.
See Impala Authorization for details.
New Built-in Functions for Fuzzy Matching of Strings
Use the new Jaro or Jaro-Winkler functions to perform fuzzy matches on relatively short strings, e.g. to scrub user inputs of names against the records in the database.
JARO_DISTANCE
,JARO_DST
JARO_SIMILARITY
,JARO_SIM
JARO_WINKLER_DISTANCE
,JW_DST
JARO_WINKLER_SIMILARITY
,JW_SIM
See Impala String Functions for details.
Capacity Quota for Scratch Disks
When configuring scratch space for intermediate files used in large sorts, joins, aggregations, or analytic function operations, use the ‑‑scratch_dirs
startup flag to optionally specify a capacity quota per scratch directory, e.g., ‑‑scratch_dirs=/dir1:5MB,/dir2
.
See How Impala Works with Hadoop File Formats for details.
Query Option for Disabling HBase Row Estimation
During query plan generation, Impala samples underlying HBase tables to estimate row count and row size, but the sampling process can negatively impact the planning time. To alleviate the issue, when the HBase table stats do not change much in a short time, disable the sampling with the DISABLE_HBASE_NUM_ROWS_ESTIMATE
query option so that the Impala planner falls back to using Hive Metastore (HMS) table stats instead.
See DISABLE_HBASE_NUM_ROWS_ESTIMATE Query Option.
Query Option for Controlling Size of Parquet Splits on Non-block Stores
To optimize query performance, Impala planner uses the value of the fs.s3a.block.size
startup flag when calculating the split size on non-block based stores, e.g. S3, ADLS, etc. Starting in this release, Impala planner uses the PARQUET_OBJECT_STORE_SPLIT_SIZE
query option to get the Parquet file format specific split size.
For Parquet files, the fs.s3a.block.size
startup flag is no longer used.
The default value of the PARQUET_OBJECT_STORE_SPLIT_SIZE
query option is 256 MB.
See Using Impala with Amazon S3 Object Store for tuning Impala query performance for S3.
Query Profile Exported to JSON
On the Query Details page of Impala Daemon Web UI, you have a new option, in addition to the existing Thrift and Text formats, to export the query profile output in the JSON format.
See Impala Web User Interface for Debugging for generating JSON query profile outputs in Web UI.
DATE Data Type Supported in Avro Tables
You can now use the DATE
data type to query date values from Avro tables.
See Using the Avro File Format with Impala Tables for details.
Primary Key and Foreign Key Constraints
This release adds support for primary and foreign key constraints, but in this release the constraints are advisory and intended for estimating cardinality during query planning in a future release. There is no attempt to enforce constraints. See CREATE TABLE Statement for details.
Enhanced External Kudu Table
By default HMS implicitly translates internal Kudu tables to external Kudu tables with the ‘external.table.purge’ property set to true. These tables behave similar to internal tables. You can explicitly create such external Kudu tables. See CREATE TABLE Statement for details.
Ranger Column Masking
This release supports Ranger column masking, which hides sensitive columnar data in Impala query output. For example, you can define a policy that reveals only the first or last four characters of column data. Column masking is enabled by default. See Ranger Column Masking for details.
BROADCAST_BYTES_LIMIT query option
You can set the default limit for the size of the broadcast input. Such a limit can prevent possible performance problems.
Experimental Support for Apache Hudi
In this release, you can use Read Optimized Queries on Hudi tables. See Using the Hudi File Format for details.
ORC Reads Enabled by Default
Impala stability and performance have been improved. Consequently, ORC reads are now enabled in Impala by default. To disable, set --enable_orc_scanner
to false
when starting the cluster. See Using the ORC File Format with Impala Tables for details.
Support for ZSTD and DEFLATE
This release supports ZSTD and DEFLATE compression codecs for text files. See Using bzip2, deflate, gzip, Snappy, or zstd Text Files for details.
New Features in Impala 3.3
The following sections describe the noteworthy improvements made in Impala 3.3.
For the full list of issues closed in this release, see the changelog for Impala 3.3.
Increased Compatibility with Apache Projects
Impala is integrate with the following components:
Apache Ranger: Use Apache Ranger to manage authorization in Impala. See Impala Authorization for details.
Apache Atlas: Use Apache Atlas to manage data governance in Impala.
Hive 3
Parquet Page Index
To improve performance when using Parquet files, Impala can now write page indexes in Parquet files and use those indexes to skip pages for the faster scan.
See Query Performance for Impala Parquet Tables for details.
The Remote File Handle Cache Supports S3
Impala can now cache remote HDFS file handles when the tables that store their data in Amazon S3 cloud storage.
See Scalability Considerations for File Handle Caching for the information on remote file handle cache.
Support for Kudu Integrated with Hive Metastore
In Impala 3.3 and Kudu 1.10, Kudu is integrated with Hive Metastore (HMS), and from Impala, you can create, update, delete, and query the tables in the Kudu services integrated with HMS.
See Using Kudu with Impala for information on using Kudu tables in Impala.
Zstd Compression for Parquet files
Zstandard (Zstd) is a real-time compression algorithm offering a tradeoff between speed and ratio of compression. Compression levels from 1 up to 22 are supported. The lower the level, the faster the speed at the cost of compression ratio.
Lz4 Compression for Parquet files
Lz4 is a lossless compression algorithm providing extremely fast and scalable compression and decompression.
Data Cache for Remote Reads
To improve performance on multi-cluster HDFS environments as well as on object store environments, Impala now caches data for non-local reads (e.g. S3, ABFS, ADLS) on local storage.
The data cache is enabled with the --data_cache startup
flag.
See Impala Remote Data Cache for the information and steps to enable remote data cache.
Metadata Performance Improvements
The following features to improve metadata performance are enabled by default in this release:
Incremental stats are now compressed in memory in
catalogd
, reducing memory footprint incatalogd
.impalad
coordinators fetch incremental stats fromcatalogd
on-demand, reducing the memory footprint and the network requirements for broadcasting metadata.Time-based and memory-based automatic invalidation of metadata to keep the size of metadata bounded and to reduce the chances of
catalogd
cache running out of memory.Automatic invalidation of metadata
With automatic metadata management enabled, you no longer have to issue
INVALIDATE
/REFRESH
in a number of conditions.In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:
- INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration
See Metadata Management for the information on the above features.
Scalable Pool Configuration in Admission Controller
To offer more dynamic and flexible resource management, Impala supports the new configuration parameters that scale with the number of hosts in the resource pool. You can use the parameters to control the number of running queries, queued queries, and maximum amount of memory allocated for Impala resource pools. See Admission Control and Query Queuing for the information about the new parameters and using them for admission control.
Query Profile
The following information was added to the Query Profile output for better monitoring and troubleshooting of query performance.
Network I/O throughput
System disk I/O throughput
See Impala Query Profile for generating and reading query profile.
DATE Data Type and Functions
You can use the new the DATE type to describe a particular year/month/day, in the form YYYY-MM-DD.
This initial DATE type support the TEXT, Parquet, and HBASE file formats.
The support of DATE data type includes the following features:
DATE
type column as a partitioning key columnDATE
literal- Implicit casting between
DATE
and other types:STRING
andTIMESTAMP
- Most of the built-in functions for
TIMESTAMP
now allow theDATE
type arguments, as well.
See DATE Data Type and Impala Date and Time Functions for using the DATE type.
Support Hive Insert-Only Transactional Tables
Impala added the support to create, drop, query, and insert into the insert-only type of transactional tables.
See Impala Transactions for details.
HiveServer2 HTTP Connection for Clients
Now client applications can connect to Impala over HTTP via HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for authentication. See Impala Clients for details.
Default File Format Changed to Parquet
When you create a table, the default format for that table data is now Parquet.
For backward compatibility, you can use the DEFAULT_FILE_FORMAT
query option to set the default file format to the previous default, text, or other formats.
Built-in Function to Process JSON Objects
The GET_JSON_OBJECT()
function extracts JSON object from a string based on the path specified and returns the extracted JSON object.
See Impala Miscellaneous Functions. for details.
Ubuntu 18.04
This version of Impala is certified to run on Ubuntu 18.04.
New Features in Impala 3.2
The following sections describe the noteworthy improvements made in Impala 3.2.
For the full list of issues closed in this release, see the changelog for Impala 3.2.
Multi-cluster Support
Remote File Handle Cache
Impala can now cache remote HDFS file handles when the
cache_remote_file_handles
impalad flag is set totrue
. This feature does not apply to non-HDFS tables, such as Kudu or HBase tables, and does not apply to the tables that store their data on cloud services, such as S3 or ADLS. See Scalabilty Considerations for file handle caching in Impala.
Enhancements in Resource Management and Admission Control
Admission Debug page is available in Impala Daemon (impalad) web UI at
\admission
and provides the following information about Impala resource pools:- Pool configuration
- Relevant pool stats
- Queued queries in order of being queued (local to the coordinator)
- Running queries (local to this coordinator)
- Histogram of the distribution of peak memory usage by admitted queries
A new query option, NUM_ROWS_PRODUCED_LIMIT, was added to limit the number of rows returned from queries.
Impala will cancel a query if the query produces more rows than the limit specified by this query option. The limit applies only when the results are returned to a client, e.g. for a
SELECT
query, but not anINSERT
query. This query option is a guardrail against users accidentally submitting queries that return a large number of rows.
Metadata Performance Improvements
Automatic Metadata Sync using Hive Metastore Notification Events
When enabled, the
catalogd
polls Hive Metastore (HMS) notifications events at a configurable interval and syncs with HMS. You can use the new web UI pages of thecatalogd
to check the state of the automatic invalidate event processor.Note: This is a preview feature in Impala 3.2.
Compatibility and Usability Enhancements
- Impala can now read the
TIMESTAMP_MILLIS
andTIMESTAMP_MICROS
Parquet types. See Using Parquet File Format for Impala Tables for the Parquet support in Impala. - Impala can now read the complex types in ORC such as ARRAY, STRUCT, and MAP. See Using ORC File Format for Impala Tables for the ORC support in Impala.
The LEVENSHTEIN string function is supported.
The function returns the Levenshtein distance between two input strings, the minimum number of single-character edits required to transform one string to other.
The
IF NOT EXISTS
clause is supported in theALTER TABLE
statement.- The new
DEFAULT_FILE_FORMAT
query option allows you to set the default table file format. This removes the need for theSTORED AS <format>
clause. Set this option if you prefer a value that is notTEXT
. The supported formats are:TEXT
RC_FILE
SEQUENCE_FILE
AVRO
PARQUET
KUDU
ORC
- The extended or verbose
EXPLAIN
output includes the following new information for queries:- The text of the analyzed query that may have been rewritten to include various optimizations and implicit casts.
- The implicit casts and literals shown with the actual types.
- CPU resource utilization (user, system, iowait) metrics were added to the Impala profile output.
Security Enhancement
- The REFRESH AUTHORIZATION statement was implemented for refreshing authorization data.
New Features in Impala 3.1
For the full list of issues closed in this release, including the issues marked as “new features” or “improvements”, see the changelog for Impala 3.1.
New Features in Impala 3.0
For the full list of issues closed in this release, including the issues marked as “new features” or “improvements”, see the changelog for Impala 3.0.
New Features in Impala 2.12
For the full list of issues closed in this release, including the issues marked as “new features” or “improvements”, see the changelog for Impala 2.12.
New Features in Impala 2.11
For the full list of issues closed in this release, including the issues marked as “new features” or “improvements”, see the changelog for Impala 2.11.
New Features in Impala 2.10
For the full list of issues closed in this release, including the issues marked as “new features” or “improvements”, see the changelog for Impala 2.10.
New Features in Impala 2.9
For the full list of issues closed in this release, including the issues marked as “new features” or “improvements”, see the changelog for Impala 2.9.
The following are some of the most significant new features in this release:
A new function,
replace()
, which is faster thanregexp_replace()
for simple string substitutions. See Impala String Functions for details.Startup flags for the impalad daemon,
is_executor
andis_coordinator
, let you divide the work on a large, busy cluster between a small number of hosts acting as query coordinators, and a larger number of hosts acting as query executors. By default, each host can act in both roles, potentially introducing bottlenecks during heavily concurrent workloads. See How to Configure Impala with Dedicated Coordinators for details.
New Features in Impala 2.8
Performance and scalability improvements:
The
COMPUTE STATS
statement can take advantage of multithreading.Improved scalability for highly concurrent loads by reducing the possibility of TCP/IP timeouts. A configuration setting,
accepted_cnxn_queue_depth
, can be adjusted upwards to avoid this type of timeout on large clusters.Several performance improvements were made to the mechanism for generating native code:
Some queries involving analytic functions can take better advantage of native code generation.
Modules produced during intermediate code generation are organized to be easier to cache and reuse during the lifetime of a long-running or complicated query.
The
COMPUTE STATS
statement is more efficient (less time for the codegen phase) for tables with a large number of columns, especially for tables containingTIMESTAMP
columns.The logic for determining whether or not to use a runtime filter is more reliable, and the evaluation process itself is faster because of native code generation.
The
MT_DOP
query option enables multithreading for a number of Impala operations.COMPUTE STATS
statements for Parquet tables use a default ofMT_DOP=4
to improve the intra-node parallelism and CPU efficiency of this data-intensive operation.The
COMPUTE STATS
statement is more efficient (less time for the codegen phase) for tables with a large number of columns.A new hint,
CLUSTERED
, allows ImpalaINSERT
operations on a Parquet table that use dynamic partitioning to process a high number of partitions in a single statement. The data is ordered based on the partition key columns, and each partition is only written by a single host, reducing the amount of memory needed to buffer Parquet data while the data blocks are being constructed.The new configuration setting
inc_stats_size_limit_bytes
lets you reduce the load on the catalog server when running theCOMPUTE INCREMENTAL STATS
statement for very large tables.Impala folds many constant expressions within query statements, rather than evaluating them for each row. This optimization is especially useful when using functions to manipulate and format
TIMESTAMP
values, such as the result of an expression such asto_date(now() - interval 1 day)
.Parsing of complicated expressions is faster. This speedup is especially useful for queries containing large
CASE
expressions.Evaluation is faster for
IN
operators with many constant arguments. The same performance improvement applies to other functions with many constant arguments.Impala optimizes identical comparison operators within multiple
OR
blocks.The reporting for wall-clock times and total CPU time in profile output is more accurate.
A new query option,
SCRATCH_LIMIT
, lets you restrict the amount of space used when a query exceeds the memory limit and activates the “spill to disk” mechanism. This option helps to avoid runaway queries or make queries “fail fast” if they require more memory than anticipated. You can prevent runaway queries from using excessive amounts of spill space, without restarting the cluster to turn the spilling feature off entirely. See SCRATCH_LIMIT Query Option for details.
Integration with Apache Kudu:
The experimental Impala support for the Kudu storage layer has been folded into the main Impala development branch. Impala can now directly access Kudu tables, opening up new capabilities such as enhanced DML operations and continuous ingestion.
The
DELETE
statement is a flexible way to remove data from a Kudu table. Previously, removing data from an Impala table involved removing or rewriting the underlying data files, dropping entire partitions, or rewriting the entire table. This Impala statement only works for Kudu tables.The
UPDATE
statement is a flexible way to modify data within a Kudu table. Previously, updating data in an Impala table involved replacing the underlying data files, dropping entire partitions, or rewriting the entire table. This Impala statement only works for Kudu tables.The
UPSERT
statement is a flexible way to ingest, modify, or both data within a Kudu table. Previously, ingesting data that might contain duplicates involved an inefficient multi-stage operation, and there was no built-in protection against duplicate data. TheUPSERT
statement, in combination with the primary key designation for Kudu tables, lets you add or replace rows in a single operation, and automatically avoids creating any duplicate data.The
CREATE TABLE
statement gains some new clauses that are specific to Kudu tables:PARTITION BY
,PARTITIONS
,STORED AS KUDU
, and column attributesPRIMARY KEY
,NULL
andNOT NULL
,ENCODING
,COMPRESSION
,DEFAULT
, andBLOCK_SIZE
. These clauses replace the explicitTBLPROPERTIES
settings that were required in the early experimental phases of integration between Impala and Kudu.The
ALTER TABLE
statement can change certain attributes of Kudu tables. You can add, drop, or rename columns. You can add or drop range partitions. You can change theTBLPROPERTIES
value to rename or point to a different underlying Kudu table, independently from the Impala table name in the metastore database. You cannot change the data type of an existing column in a Kudu table.The
SHOW PARTITIONS
statement displays information about the distribution of data between partitions in Kudu tables. A new variation,SHOW RANGE PARTITIONS
, displays information about the Kudu-specific partitions that apply across ranges of key values.Not all Impala data types are supported in Kudu tables. In particular, currently the Impala
TIMESTAMP
type is not allowed in a Kudu table. Impala does not recognize theUNIXTIME_MICROS
Kudu type when it is present in a Kudu table. (These two representations of date/time data use different units and are not directly compatible.) You cannot create columns of typeTIMESTAMP
,DECIMAL
,VARCHAR
, orCHAR
within a Kudu table. Within a query, you can cast values in a result set to these types. Certain types, such asBOOLEAN
, cannot be used as primary key columns.Currently, Kudu tables are not interchangeable between Impala and Hive the way other kinds of Impala tables are. Although the metadata for Kudu tables is stored in the metastore database, currently Hive cannot access Kudu tables.
The
INSERT
statement works for Kudu tables. The organization of the Kudu data makes it more efficient than with HDFS-backed tables to insert data in small batches, such as with theINSERT ... VALUES
syntax.Some audit data is recorded for data governance purposes. All
UPDATE
,DELETE
, andUPSERT
statements are characterized asINSERT
operations in the audit log. Currently, lineage metadata is not generated forUPDATE
andDELETE
operations on Kudu tables.Currently, Kudu tables have limited support for Sentry:
Access to Kudu tables must be granted to roles as usual.
Currently, access to a Kudu table through Sentry is “all or nothing”. You cannot enforce finer-grained permissions such as at the column level, or permissions on certain operations such as
INSERT
.Only users with
ALL
privileges onSERVER
can create external Kudu tables.
Because non-SQL APIs can access Kudu data without going through Sentry authorization, currently the Sentry support is considered preliminary.
Equality and
IN
predicates in Impala queries are pushed to Kudu and evaluated efficiently by the Kudu storage layer.
Security:
- Impala can take advantage of the S3 encrypted credential store, to avoid exposing the secret key when accessing data stored on S3.
[IMPALA-1654] Several kinds of DDL operations can now work on a range of partitions. The partitions can be specified using operators such as
<
,>=
, and!=
rather than just an equality predicate applying to a single partition. This new feature extends the syntax of several clauses of theALTER TABLE
statement (DROP PARTITION
,SET [UN]CACHED
,SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES
), theSHOW FILES
statement, and theCOMPUTE INCREMENTAL STATS
statement. It does not apply to statements that are defined to only apply to a single partition, such asLOAD DATA
,ALTER TABLE ... ADD PARTITION
,SET LOCATION
, andINSERT
with a static partitioning clause.The
instr()
function has optional second and third arguments, representing the character to position to begin searching for the substring, and the Nth occurrence of the substring to find.Improved error handling for malformed Avro data. In particular, incorrect precision or scale for
DECIMAL
types is now handled.Impala debug web UI:
In addition to “inflight” and “finished” queries, the web UI now also includes a section for “queued” queries.
The /sessions tab now clarifies how many of the displayed sections are active, and lets you sort by Expired status to distinguish active sessions from expired ones.
Improved stability when DDL operations such as
CREATE DATABASE
orDROP DATABASE
are run in Hive at the same time as an ImpalaINVALIDATE METADATA
statement.The “out of memory” error report was made more user-friendly, with additional diagnostic information to help identify the spot where the memory limit was exceeded.
Improved disk space usage for Java-based UDFs. Temporary copies of the associated JAR files are removed when no longer needed, so that they do not accumulate across restarts of the catalogd daemon and potentially cause an out-of-space condition. These temporary files are also created in the directory specified by the
local_library_dir
configuration setting, so that the storage for these temporary files can be independent from any capacity limits on the /tmp filesystem.
New Features in Impala 2.7
Performance improvements:
[IMPALA-3206] Speedup for queries against
DECIMAL
columns in Avro tables. The code that parsesDECIMAL
values from Avro now uses native code generation.[IMPALA-3674] Improved efficiency in LLVM code generation can reduce codegen time, especially for short queries.
[IMPALA-2979] Improvements to scheduling on worker nodes, enabled by the
REPLICA_PREFERENCE
query option. See REPLICA_PREFERENCE Query Option (Impala 2.7 or higher only) for details.
[IMPALA-1683] The
REFRESH
statement can be applied to a single partition, rather than the entire table. See REFRESH Statement and Refreshing a Single Partition for details.Improvements to the Impala web user interface:
[IMPALA-2767] You can now force a session to expire by clicking a link in the web UI, on the /sessions tab.
[IMPALA-3715] The /memz tab includes more information about Impala memory usage.
[IMPALA-3716] The Details page for a query now includes a Memory tab.
[IMPALA-3499] Scalability improvements to the catalog server. Impala handles internal communication more efficiently for tables with large numbers of columns and partitions, where the size of the metadata exceeds 2 GiB.
[IMPALA-3677] You can send a
SIGUSR1
signal to any Impala-related daemon to write a Breakpad minidump. For advanced troubleshooting, you can now produce a minidump without triggering a crash. See Breakpad Minidumps for Impala (Impala 2.6 or higher only) for details about the Breakpad minidump feature.[IMPALA-3687] The schema reconciliation rules for Avro tables have changed slightly for
CHAR
andVARCHAR
columns. Now, if the definition of such a column is changed in the Avro schema file, the column retains itsCHAR
orVARCHAR
type as specified in the SQL definition, but the column name and comment from the Avro schema file take precedence. See Creating Avro Tables for details about column definitions in Avro tables.[IMPALA-3575] Some network operations now have additional timeout and retry settings. The extra configuration helps avoid failed queries for transient network problems, to avoid hangs when a sender or receiver fails in the middle of a network transmission, and to make cancellation requests more reliable despite network issues.
New Features in Impala 2.6
Improvements to Impala support for the Amazon S3 filesystem:
Impala can now write to S3 tables through the
INSERT
orLOAD DATA
statements. See Using Impala with Amazon S3 Object Store for general information about using Impala with S3.A new query option,
S3_SKIP_INSERT_STAGING
, lets you trade off between fastINSERT
performance and slowerINSERT
s that are more consistent if a problem occurs during the statement. The new behavior is enabled by default. See S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only) for details about this option.
Performance improvements for the runtime filtering feature:
The default for the
RUNTIME_FILTER_MODE
query option is changed toGLOBAL
(the highest setting). See RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only) for details about this option.The
RUNTIME_BLOOM_FILTER_SIZE
setting is now only used as a fallback if statistics are not available; otherwise, Impala uses the statistics to estimate the appropriate size to use for each filter. See RUNTIME_BLOOM_FILTER_SIZE Query Option (Impala 2.5 or higher only) for details about this option.New query options
RUNTIME_FILTER_MIN_SIZE
andRUNTIME_FILTER_MAX_SIZE
let you fine-tune the sizes of the Bloom filter structures used for runtime filtering. If the filter size derived from Impala internal estimates or from theRUNTIME_FILTER_BLOOM_SIZE
falls outside the size range specified by these options, any too-small filter size is adjusted to the minimum, and any too-large filter size is adjusted to the maximum. See RUNTIME_FILTER_MIN_SIZE Query Option (Impala 2.6 or higher only) and RUNTIME_FILTER_MAX_SIZE Query Option (Impala 2.6 or higher only) for details about these options.Runtime filter propagation now applies to all the operands of
UNION
andUNION ALL
operators.Runtime filters can now be produced during join queries even when the join processing activates the spill-to-disk mechanism.
See Runtime Filtering for Impala Queries (Impala 2.5 or higher only) for general information about the runtime filtering feature.
Admission control and dynamic resource pools are enabled by default. See Admission Control and Query Queuing for details about admission control.
Impala can now manually set column statistics, using the
ALTER TABLE
statement with aSET COLUMN STATS
clause. See impala_perf_stats.html#perf_column_stats_manual for details.Impala can now write lightweight “minidump” files, rather than large core files, to save diagnostic information when any of the Impala-related daemons crash. This feature uses the open source
breakpad
framework. See Breakpad Minidumps for Impala (Impala 2.6 or higher only) for details.New query options improve interoperability with Parquet files:
The
PARQUET_FALLBACK_SCHEMA_RESOLUTION
query option lets Impala locate columns within Parquet files based on column name rather than ordinal position. This enhancement improves interoperability with applications that write Parquet files with a different order or subset of columns than are used in the Impala table. See PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only) for details.The
PARQUET_ANNOTATE_STRINGS_UTF8
query option makes Impala include theUTF-8
annotation metadata forSTRING
,CHAR
, andVARCHAR
columns in Parquet files created byINSERT
orCREATE TABLE AS SELECT
statements. See PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (Impala 2.6 or higher only) for details.
See Using the Parquet File Format with Impala Tables for general information about working with Parquet files.
Improvements to security and reduction in overhead for secure clusters:
Overall performance improvements for secure clusters. (TPC-H queries on a secure cluster were benchmarked at roughly 3x as fast as the previous release.)
Impala now recognizes the
auth_to_local
setting, specified through the HDFS configuration settinghadoop.security.auth_to_local
. This feature is disabled by default; to enable it, specify--load_auth_to_local_rules=true
in the impalad configuration settings. See Mapping Kerberos Principals to Short Names for Impala for details.Timing improvements in the mechanism for the impalad daemon to acquire Kerberos tickets. This feature spreads out the overhead on the KDC during Impala startup, especially for large clusters.
For Kerberized clusters, the Catalog service now uses the Kerberos principal instead of the operating sytem user that runs the catalogd daemon. This eliminates the requirement to configure a
hadoop.user.group.static.mapping.overrides
setting to put the OS user into the Sentry administrative group, on clusters where the principal and the OS user name for this user are different.
Overall performance improvements for join queries, by using a prefetching mechanism while building the in-memory hash table to evaluate join predicates. See PREFETCH_MODE Query Option (Impala 2.6 or higher only) for the query option to control this optimization.
The impala-shell interpreter has a new command,
SOURCE
, that lets you run a set of SQL statements or other impala-shell commands stored in a file. You can run additionalSOURCE
commands from inside a file, to set up flexible sequences of statements for use cases such as schema setup, ETL, or reporting. See impala-shell Command Reference for details and Running Commands and SQL Statements in impala-shell for examples.The
millisecond()
built-in function lets you extract the fractional seconds part of aTIMESTAMP
value. See Impala Date and Time Functions for details.If an Avro table is created without column definitions in the
CREATE TABLE
statement, and columns are later added throughALTER TABLE
, the resulting table is now queryable. Missing values from the newly added columns now default toNULL
. See Using the Avro File Format with Impala Tables for general details about working with Avro files.The mechanism for interpreting
DECIMAL
literals is improved, no longer going through an intermediate conversion step toDOUBLE
:Casting a
DECIMAL
value toTIMESTAMP
DOUBLE
produces a more precise value for theTIMESTAMP
than formerly.Certain function calls involving
DECIMAL
literals now succeed, when formerly they failed due to lack of a function signature with aDOUBLE
argument.Faster runtime performance for
DECIMAL
constant values, through improved native code generation for all combinations of precision and scale.
See DECIMAL Data Type (Impala 3.0 or higher only) for details about the
DECIMAL
type.Improved type accuracy for
CASE
return values. If allWHEN
clauses of theCASE
expression are ofCHAR
type, the final result is alsoCHAR
instead of being converted toSTRING
. See Impala Conditional Functions for details about theCASE
function.Uncorrelated queries using the
NOT EXISTS
operator are now supported. Formerly, theNOT EXISTS
operator was only available for correlated subqueries.Improved performance for reading Parquet files.
Improved performance for top-N queries, that is, those including both
ORDER BY
andLIMIT
clauses.Impala optionally skips an arbitrary number of header lines from text input files on HDFS based on the
skip.header.line.count
value in theTBLPROPERTIES
field of the table metadata. See Data Files for Text Tables for details.Trailing comments are now allowed in queries processed by the impala-shell options
-q
and-f
.Impala can run
COUNT
queries for RCFile tables that include complex type columns. See Complex Types (Impala 2.3 or higher only) for general information about working with complex types, and ARRAY Complex Type (Impala 2.3 or higher only), MAP Complex Type (Impala 2.3 or higher only), and STRUCT Complex Type (Impala 2.3 or higher only) for syntax details of each type.
New Features in Impala 2.5
Dynamic partition pruning. When a query refers to a partition key column in a
WHERE
clause, and the exact set of column values are not known until the query is executed, Impala evaluates the predicate and skips the I/O for entire partitions that are not needed. For example, if a table was partitioned by year, Impala would apply this technique to a query such asSELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table)
. See Dynamic Partition Pruning for details.The dynamic partition pruning optimization technique lets Impala avoid reading data files from partitions that are not part of the result set, even when that determination cannot be made in advance. This technique is especially valuable when performing join queries involving partitioned tables. For example, if a join query includes an
ON
clause and aWHERE
clause that refer to the same columns, the query can find the set of column values that match theWHERE
clause, and only scan the associated partitions when evaluating theON
clause.Dynamic partition pruning is controlled by the same settings as the runtime filtering feature. By default, this feature is enabled at a medium level, because the maximum setting can use slightly more memory for queries than in previous releases. To fully enable this feature, set the query option
RUNTIME_FILTER_MODE=GLOBAL
.Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries. Using the same technique as with dynamic partition pruning, Impala uses the predicates from
WHERE
andON
clauses to determine the subset of column values from one of the joined tables could possibly be part of the result set. Impala sends a compact representation of the filter condition to the hosts in the cluster, instead of the full set of values or the entire table. See Runtime Filtering for Impala Queries (Impala 2.5 or higher only) for details.By default, this feature is enabled at a medium level, because the maximum setting can use slightly more memory for queries than in previous releases. To fully enable this feature, set the query option
RUNTIME_FILTER_MODE=GLOBAL
. See RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only) for details.This feature involves some new query options: RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING. See RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING for details.
More efficient use of the HDFS caching feature, to avoid hotspots and bottlenecks that could occur if heavily used cached data blocks were always processed by the same host. By default, Impala now randomizes which host processes each cached HDFS data block, when cached replicas are available on multiple hosts. (Remember to use the
WITH REPLICATION
clause with theCREATE TABLE
orALTER TABLE
statement when enabling HDFS caching for a table or partition, to cache the same data blocks across multiple hosts.) The new query optionSCHEDULE_RANDOM_REPLICA
lets you fine-tune the interaction with HDFS caching even more. See Using HDFS Caching with Impala (Impala 2.1 or higher only) for details.The
TRUNCATE TABLE
statement now accepts anIF EXISTS
clause, makingTRUNCATE TABLE
easier to use in setup or ETL scripts where the table might or might not exist. See TRUNCATE TABLE Statement (Impala 2.3 or higher only) for details.Improved performance and reliability for the
DECIMAL
data type:Using
DECIMAL
values in aGROUP BY
clause now triggers the native code generation optimization, speeding up queries that group by values such as prices.Checking for overflow in
DECIMAL
multiplication is now substantially faster, makingDECIMAL
a more practical data type in some use cases where formerlyDECIMAL
was much slower thanFLOAT
orDOUBLE
.Multiplying a mixture of
DECIMAL
andFLOAT
orDOUBLE
values now returns theDOUBLE
rather thanDECIMAL
. This change avoids some cases where an intermediate value would underflow or overflow and becomeNULL
unexpectedly.
See DECIMAL Data Type (Impala 3.0 or higher only) for details.
For UDFs written in Java, or Hive UDFs reused for Impala, Impala now allows parameters and return values to be primitive types. Formerly, these things were required to be one of the “Writable” object types. See Using Hive UDFs with Impala for details.
Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the overhead of repeatedly opening the same file.
Performance improvements for queries involving nested complex types. Certain basic query types, such as counting the elements of a complex column, now use an optimized code path.
Improvements to the memory reservation mechanism for the Impala admission control feature. You can specify more settings, such as the timeout period and maximum aggregate memory used, for each resource pool instead of globally for the Impala instance. The default limit for concurrent queries (the max requests setting) is now unlimited instead of 200.
Performance improvements related to code generation. Even in queries where code generation is not performed for some phases of execution (such as reading data from Parquet tables), Impala can still use code generation in other parts of the query, such as evaluating functions in the
WHERE
clause.Performance improvements for queries using aggregation functions on high-cardinality columns. Formerly, Impala could do unnecessary extra work to produce intermediate results for operations such as
DISTINCT
orGROUP BY
on columns that were unique or had few duplicate values. Now, Impala decides at run time whether it is more efficient to do an initial aggregation phase and pass along a smaller set of intermediate data, or to pass raw intermediate data back to next phase of query processing to be aggregated there. This feature is known as streaming pre-aggregation. In case of performance regression, this feature can be turned off using theDISABLE_STREAMING_PREAGGREGATIONS
query option. See DISABLE_STREAMING_PREAGGREGATIONS Query Option (Impala 2.5 or higher only) for details.Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature could be turned off using a pair of configuration settings,
enable_partitioned_aggregation=false
andenable_partitioned_hash_join=false
. The latest improvements in the spill-to-disk mechanism, and related features that interact with it, make this feature robust enough that disabling it is now no longer needed or supported. In particular, some new features in Impala 2.5 and higher do not work when the spill-to-disk feature is disabled.Improvements to scripting capability for the impala-shell command, through user-specified substitution variables that can appear in statements processed by impala-shell:
The
--var
command-line option lets you pass key-value pairs to impala-shell. The shell can substitute the values into queries before executing them, where the query text contains the notation${var:varname}
. For example, you might prepare a SQL file containing a set of DDL statements and queries containing variables for database and table names, and then pass the applicable names as part of theimpala-shell -f filename
command. See Running Commands and SQL Statements in impala-shell for details.The
SET
andUNSET
commands within the impala-shell interpreter now work with user-specified substitution variables, as well as the built-in query options. The two kinds of variables are divided in theSET
output. As with variables defined by the--var
command-line option, you refer to the user-specified substitution variables in queries by using the notation${var:varname}
in the query text. Because the substitution variables are processed by impala-shell instead of the impalad backend, you cannot define your own substitution variables through theSET
statement in a JDBC or ODBC application. See SET Statement for details.
Performance improvements for query startup. Impala better parallelizes certain work when coordinating plan distribution between impalad instances, which improves startup time for queries involving tables with many partitions on large clusters, or complicated queries with many plan fragments.
Performance and scalability improvements for tables with many partitions. The memory requirements on the coordinator node are reduced, making it substantially faster and less resource-intensive to do joins involving several tables with thousands of partitions each.
Whitelisting for access to internal APIs. For applications that need direct access to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can specify a list of Kerberos users who are allowed to call those APIs. By default, the
impala
andhdfs
users are the only ones authorized for this kind of access. Any users not explicitly authorized through theinternal_principals_whitelist
configuration setting are blocked from accessing the APIs. This setting applies to all the Impala-related daemons, although currently it is primarily used for HDFS to control the behavior of the catalog server.Improvements to Impala integration and usability for Hue. (The code changes are actually on the Hue side.)
- The list of tables now refreshes dynamically.
Usability improvements for case-insensitive queries. You can now use the operators
ILIKE
andIREGEXP
to perform case-insensitive wildcard matches or regular expression matches, rather than explicitly converting column values withUPPER
orLOWER
. See ILIKE Operator and IREGEXP Operator for details.Performance and reliability improvements for DDL and insert operations on partitioned tables with a large number of partitions. Impala only re-evaluates metadata for partitions that are affected by a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress, other Impala statements that attempt to modify metadata for the same table wait until the first one finishes.
Reliability improvements for the
LOAD DATA
statement. Previously, this statement would fail if the source HDFS directory contained any subdirectories at all. Now, the statement ignores any hidden subdirectories, for example _impala_insert_staging.A new operator,
IS [NOT] DISTINCT FROM
, lets you compare values and always get atrue
orfalse
result, even if one or both of the values areNULL
. TheIS NOT DISTINCT FROM
operator, or its equivalent<=>
notation, improves the efficiency of join queries that treat key values that areNULL
in both tables as equal. See IS DISTINCT FROM Operator for details.Security enhancements for the impala-shell command. A new option,
--ldap_password_cmd
, lets you specify a command to retrieve the LDAP password. The resulting password is then used to authenticate the impala-shell command with the LDAP server. See impala-shell Configuration Options for details.The
CREATE TABLE AS SELECT
statement now accepts aPARTITIONED BY
clause, which lets you create a partitioned table and insert data into it with a single statement. See CREATE TABLE Statement for details.User-defined functions (UDFs and UDAFs) written in C++ now persist automatically when the catalogd daemon is restarted. You no longer have to run the
CREATE FUNCTION
statements again after a restart.User-defined functions (UDFs) written in Java can now persist when the catalogd daemon is restarted, and can be shared transparently between Impala and Hive. You must do a one-time operation to recreate these UDFs using new
CREATE FUNCTION
syntax, without a signature for arguments or the return value. Afterwards, you no longer have to run theCREATE FUNCTION
statements again after a restart. Although Impala does not have visibility into the UDFs that implement the Hive built-in functions, user-created Hive UDFs are now automatically available for calling through Impala. See CREATE FUNCTION Statement for details.Reliability enhancements for memory management. Some aggregation and join queries that formerly might have failed with an out-of-memory error due to memory contention, now can succeed using the spill-to-disk mechanism.
The
SHOW DATABASES
statement now returns two columns rather than one. The second column includes the associated comment string, if any, for each database. Adjust any application code that examines the list of databases and assumes the result set contains only a single column. See SHOW DATABASES for details.A new optimization speeds up aggregation operations that involve only the partition key columns of partitioned tables. For example, a query such as
SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1
can avoid reading any data files ifT1
is a partitioned table andK
is one of the partition key columns. Because this technique can produce different results in cases where HDFS files in a partition are manually deleted or are empty, you must enable the optimization by setting the query optionOPTIMIZE_PARTITION_KEY_SCANS
. See OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only) for details.The
DESCRIBE
statement can now display metadata about a database, using the syntaxDESCRIBE DATABASE db_name
. See DESCRIBE Statement for details.The
uuid()
built-in function generates an alphanumeric value that you can use as a guaranteed unique identifier. The uniqueness applies even across tables, for cases where an ascending numeric sequence is not suitable. See Impala Miscellaneous Functions for details.
New Features in Impala 2.4
- Impala can be used on the DSSD D5 Storage Appliance. From a user perspective, the Impala features are the same as in Impala 2.3.
New Features in Impala 2.3
The following are the major new features in Impala 2.3.x. This major release contains improvements to SQL syntax (particularly new support for complex types), performance, manageability, security.
Complex data types:
STRUCT
,ARRAY
, andMAP
. These types can encode multiple named fields, positional items, or key-value pairs within a single column. You can combine these types to produce nested types with arbitrarily deep nesting, such as anARRAY
ofSTRUCT
values, aMAP
where each key-value pair is anARRAY
of otherMAP
values, and so on. Currently, complex data types are only supported for the Parquet file format. See Complex Types (Impala 2.3 or higher only) for usage details and ARRAY Complex Type (Impala 2.3 or higher only), STRUCT Complex Type (Impala 2.3 or higher only), and MAP Complex Type (Impala 2.3 or higher only) for syntax.Column-level authorization lets you define access to particular columns within a table, rather than the entire table. This feature lets you reduce the reliance on creating views to set up authorization schemes for subsets of information. See the documentation for Apache Sentry for background details, and GRANT Statement (Impala 2.0 or higher only) and REVOKE Statement (Impala 2.0 or higher only) for Impala-specific syntax.
The
TRUNCATE TABLE
statement removes all the data from a table without removing the table itself. See TRUNCATE TABLE Statement (Impala 2.3 or higher only) for details.Nested loop join queries. Some join queries that formerly required equality comparisons can now use operators such as
<
or>=
. This same join mechanism is used internally to optimize queries that retrieve values from complex type columns. See Joins in Impala SELECT Statements for details about Impala join queries.Reduced memory usage and improved performance and robustness for spill-to-disk feature. See SQL Operations that Spill to Disk for details about this feature.
Performance improvements for querying Parquet data files containing multiple row groups and multiple data blocks:
For files written by Hive, SparkSQL, and other Parquet MR writers and spanning multiple HDFS blocks, Impala now scans the extra data blocks locally when possible, rather than using remote reads.
Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet files written by Hive, MapReduce, and other components. (Impala itself never writes multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.) These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks. The
parquet.writer.max-padding
setting specifies the maximum number of bytes, by default 8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block so that the next row group starts at the beginning of the next block. If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space. Include this setting in the hive-site configuration file to influence Parquet files written by Hive, or the hdfs-site configuration file to influence Parquet files written by all non-Impala components.
See Using the Parquet File Format with Impala Tables for instructions about using Parquet data files with Impala.
Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions.
Math functions (see Impala Mathematical Functions for details):
ATAN2
COSH
COT
DCEIL
DEXP
DFLOOR
DLOG10
DPOW
DROUND
DSQRT
DTRUNC
FACTORIAL
, and corresponding!
operatorFPOW
RADIANS
RANDOM
SINH
TANH
String functions (see Impala String Functions for details):
BTRIM
CHR
REGEXP_LIKE
SPLIT_PART
Date and time functions (see Impala Date and Time Functions for details):
INT_MONTHS_BETWEEN
MONTHS_BETWEEN
TIMEOFDAY
TIMESTAMP_CMP
Bit manipulation functions (see Impala Bit Functions for details):
BITAND
BITNOT
BITOR
BITXOR
COUNTSET
GETBIT
ROTATELEFT
ROTATERIGHT
SETBIT
SHIFTLEFT
SHIFTRIGHT
Type conversion functions (see Impala Type Conversion Functions for details):
TYPEOF
The
effective_user()
function (see Impala Miscellaneous Functions for details).New built-in analytic functions:
PERCENT_RANK
,NTILE
,CUME_DIST
. See Impala Analytic Functions for details.The
DROP DATABASE
statement now works for a non-empty database. When you specify the optionalCASCADE
clause, any tables in the database are dropped before the database itself is removed. See DROP DATABASE Statement for details.The
DROP TABLE
andALTER TABLE DROP PARTITION
statements have a new optional keyword,PURGE
. This keyword causes Impala to immediately remove the relevant HDFS data files rather than sending them to the HDFS trashcan. This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of a problem with the HDFS trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone than the data files. See DROP TABLE Statement and ALTER TABLE Statement for syntax.The impala-shell command has a new feature for live progress reporting. This feature is enabled through the
--live_progress
and--live_summary
command-line options, or during a session through theLIVE_SUMMARY
andLIVE_PROGRESS
query options. See LIVE_PROGRESS Query Option (Impala 2.3 or higher only) and LIVE_SUMMARY Query Option (Impala 2.3 or higher only) for details.The impala-shell command also now displays a random “tip of the day” when it starts.
The impala-shell option
-f
now recognizes a special filename-
to accept input from stdin. See impala-shell Configuration Options for details about the options for running impala-shell in non-interactive mode.Format strings for the
unix_timestamp()
function can now include numeric timezone offsets. See Impala Date and Time Functions for details.Impala can now run a specified command to obtain the password to decrypt a private-key PEM file, rather than having the private-key file be unencrypted on disk. See Configuring TLS/SSL for Impala for details.
Impala components now can use SSL for more of their internal communication. SSL is used for communication between all three Impala-related daemons when the configuration option
ssl_server_certificate
is enabled. SSL is used for communication with client applications when the configuration optionssl_client_ca_certificate
is enabled. See Configuring TLS/SSL for Impala for details.Currently, you can only use one of server-to-server TLS/SSL encryption or Kerberos authentication. This limitation is tracked by the issue IMPALA-2598.
Improved flexibility for intermediate data types in user-defined aggregate functions (UDAFs). See Writing User-Defined Aggregate Functions (UDAFs) for details.
In Impala 2.3.2, the bug fix for IMPALA-2598 removes the restriction on using both Kerberos and SSL for internal communication between Impala components.
New Features in Impala 2.8
The following are the major new features in Impala 2.2. This release contains improvements to performance, manageability, security, and SQL syntax.
Several improvements to date and time features enable higher interoperability with Hive and other database systems, provide more flexibility for handling time zones, and future-proof the handling of
TIMESTAMP
values:The
WITH REPLICATION
clause for theCREATE TABLE
andALTER TABLE
statements lets you control the replication factor for HDFS caching for a specific table or partition. By default, each cached block is only present on a single host, which can lead to CPU contention if the same host processes each cached block. Increasing the replication factor lets Impala choose different hosts to process different cached blocks, to better distribute the CPU load.Startup flags for the impalad daemon enable a higher level of compatibility with
TIMESTAMP
values written by Hive, and more flexibility for working with date and time data using the local time zone instead of UTC. To enable these features, set the impalad startup flags-use_local_tz_for_unix_timestamp_conversions=true
and-convert_legacy_hive_parquet_utc_timestamps=true
.The
-use_local_tz_for_unix_timestamp_conversions
setting controls how theunix_timestamp()
,from_unixtime()
, andnow()
functions handle time zones. By default (when this setting is turned off), Impala considers allTIMESTAMP
values to be in the UTC time zone when converting to or from Unix time values. When this setting is enabled, Impala treatsTIMESTAMP
values passed to or returned from these functions to be in the local time zone. When this setting is enabled, take particular care that all hosts in the cluster have the same timezone settings, to avoid inconsistent results depending on which host reads or writesTIMESTAMP
data.The
-convert_legacy_hive_parquet_utc_timestamps
setting causes Impala to convertTIMESTAMP
values to the local time zone when it reads them from Parquet files written by Hive. This setting only applies to data using the Parquet file format, where Impala can use metadata in the files to reliably determine that the files were written by Hive. If in the future Hive changes the way it writesTIMESTAMP
data in Parquet, Impala will automatically handle that newTIMESTAMP
encoding.See TIMESTAMP Data Type for details about time zone handling and the configuration options for Impala / Hive compatibility with Parquet format.
In Impala 2.2.0 and higher, built-in functions that accept or return integers representing
TIMESTAMP
values use theBIGINT
type for parameters and return values, rather thanINT
. This change lets the date and time functions avoid an overflow error that would otherwise occur on January 19th, 2038 (known as the “Year 2038 problem” or “Y2K38 problem”). This change affects theFROM_UNIXTIME()
andUNIX_TIMESTAMP()
functions. You might need to change application code that interacts with these functions, change the types of columns that store the return values, or addCAST()
calls to SQL statements that call these functions.See Impala Date and Time Functions for the current function signatures.
The
SHOW FILES
statement lets you view the names and sizes of the files that make up an entire table or a specific partition. See SHOW FILES Statement for details.Impala can now run queries against Parquet data containing columns with complex or nested types, as long as the query only refers to columns with scalar types.
Performance improvements for queries that include
IN()
operators and involve partitioned tables.The new
-max_log_files
configuration option specifies how many log files to keep at each severity level. The default value is 10, meaning that Impala preserves the latest 10 log files for each severity level (INFO
,WARNING
, andERROR
) for each Impala-related daemon (impalad, statestored, and catalogd). Impala checks to see if any old logs need to be removed based on the interval specified in thelogbufsecs
setting, every 5 seconds by default. See Rotating Impala Logs for details.Redaction of sensitive data from Impala log files. This feature protects details such as credit card numbers or tax IDs from administrators who see the text of SQL statements in the course of monitoring and troubleshooting a Hadoop cluster. See Redacting Sensitive Information from Impala Log Files for background information for Impala users, and the documentation for your Apache Hadoop distribution for usage details.
Lineage information is available for data created or queried by Impala. This feature lets you track who has accessed data through Impala SQL statements, down to the level of specific columns, and how data has been propagated between tables. See Viewing Lineage Information for Impala Data for background information for Impala users, the documentation for your Apache Hadoop distribution for usage details and how to interpret the lineage information.
Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem, for convenience in cases where data is already located in S3 and you prefer to query it in-place. Queries might have lower performance than when the data files reside on HDFS, because Impala uses some HDFS-specific optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements such as
INSERT
andLOAD DATA
are not available when the destination table or partition is in S3. See Using Impala with Amazon S3 Object Store for details.Important:
Impala query support for Amazon S3 is included in Impala 2.2, but is not supported or recommended for production use in this version.
Improved support for HDFS encryption. The
LOAD DATA
statement now works when the source directory and destination table are in different encryption zones. See the documentation for your Apache Hadoop distribution for details about using HDFS encryption with Impala.Additional arithmetic function
mod()
. See Impala Mathematical Functions for details.Flexibility to interpret
TIMESTAMP
values using the UTC time zone (the traditional Impala behavior) or using the local time zone (for compatibility withTIMESTAMP
values produced by Hive).Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced by these tools (filenames with suffixes
.copying
and.tmp
).The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now been relaxed.
The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 and higher. From this release onward, Impala works on CPUs that have the SSSE3 instruction set. The SSE4 instruction set is no longer required. This relaxed requirement simplifies the upgrade planning from Impala 1.x releases, which also worked on SSSE3-enabled processors.
Enhanced support for
CHAR
andVARCHAR
types in theCOMPUTE STATS
statement.The amount of memory required during setup for “spill to disk” operations is greatly reduced. This enhancement reduces the chance of a memory-intensive join or aggregation query failing with an out-of-memory error.
Several new conditional functions provide enhanced compatibility when porting code that uses industry extensions. The new functions are:
isfalse()
,isnotfalse()
,isnottrue()
,istrue()
,nonnullvalue()
, andnullvalue()
. See Impala Conditional Functions for details.The Impala debug web UI now can display a visual representation of the query plan. On the /queries tab, select Details for a particular query. The Details page includes a Plan tab with a plan diagram that you can zoom in or out (using scroll gestures through mouse wheel or trackpad).
New Features in Impala 2.1
This release contains the following enhancements to query performance and system scalability:
Impala can now collect statistics for individual partitions in a partitioned table, rather than processing the entire table for each
COMPUTE STATS
statement. This feature is known as incremental statistics, and is controlled by theCOMPUTE INCREMENTAL STATS
syntax. (You can still use the originalCOMPUTE STATS
statement for nonpartitioned tables or partitioned tables that are unchanging or whose contents are entirely replaced all at once.) See COMPUTE STATS Statement and Table and Column Statistics for details.Optimization for small queries lets Impala process queries that process very few rows without the unnecessary overhead of parallelizing and generating native code. Reducing this overhead lets Impala clear small queries quickly, keeping YARN resources and admission control slots available for data-intensive queries. The number of rows considered to be a “small” query is controlled by the
EXEC_SINGLE_NODE_ROWS_THRESHOLD
query option. See EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (Impala 2.1 or higher only) for details.An enhancement to the statestore component lets it transmit heartbeat information independently of broadcasting metadata updates. This optimization improves reliability of health checking on large clusters with many tables and partitions.
The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data as it is read, rather than reading the entire gzipped file and decompressing it in memory.
New Features in Impala 2.0
The following are the major new features in Impala 2.0. This major release contains improvements to performance, scalability, security, and SQL syntax.
Queries with joins or aggregation functions involving high volumes of data can now use temporary work areas on disk, reducing the chance of failure due to out-of-memory errors. When the required memory for the intermediate result set exceeds the amount available on a particular node, the query automatically uses a temporary work area on disk. This “spill to disk” mechanism is similar to the
ORDER BY
improvement from Impala 1.4. For details, see SQL Operations that Spill to Disk.Subquery enhancements:
- Subqueries are now allowed in the
WHERE
clause, for example with theIN
operator. - The
EXISTS
andNOT EXISTS
operators are available. They are always used in conjunction with subqueries. - The
IN
andNOT IN
queries can now operate on the result set from a subquery, not just a hardcoded list of values. - Uncorrelated subqueries let you compare against one or more values for equality,
IN
, andEXISTS
comparisons. For example, you might useWHERE
clauses such asWHERE column = (SELECT MAX(some_other_column FROM table)
orWHERE column IN (SELECT some_other_column FROM table WHERE conditions)
. - Correlated subqueries let you cross-reference values from the outer query block and the subquery.
- Scalar subqueries let you substitute the result of single-value aggregate functions such as
MAX()
,MIN()
,COUNT()
, orAVG()
, where you would normally use a numeric value in aWHERE
clause.
For details about subqueries, see Subqueries in Impala SELECT Statements For information about new and improved operators, see EXISTS Operator and IN Operator.
- Subqueries are now allowed in the
Analytic functions such as
RANK()
,LAG()
,LEAD()
, andFIRST_VALUE()
let you analyze sequences of rows with flexible ordering and grouping. Existing aggregate functions such asMAX()
,SUM()
, andCOUNT()
can also be used in an analytic context. See Impala Analytic Functions for details. See Impala Aggregate Functions for enhancements to existing aggregate functions.New data types provide greater compatibility with source code from traditional database systems:
VARCHAR
is like theSTRING
data type, but with a maximum length. See VARCHAR Data Type (Impala 2.0 or higher only) for details.CHAR
is like theSTRING
data type, but with a precise length. Short values are padded with spaces on the right. See CHAR Data Type (Impala 2.0 or higher only) for details.
Security enhancements:
- Formerly, Impala was restricted to using either Kerberos or LDAP / Active Directory authentication within a cluster. Now, Impala can freely accept either kind of authentication request, allowing you to set up some hosts with Kerberos authentication and others with LDAP or Active Directory. See Using Multiple Authentication Methods with Impala for details.
GRANT
statement. See GRANT Statement (Impala 2.0 or higher only) for details.REVOKE
statement. See REVOKE Statement (Impala 2.0 or higher only) for details.CREATE ROLE
statement. See CREATE ROLE Statement (Impala 2.0 or higher only) for details.DROP ROLE
statement. See DROP ROLE Statement (Impala 2.0 or higher only) for details.SHOW ROLES
andSHOW ROLE GRANT
statements. See SHOW Statement for details.- To complement the HDFS encryption feature, a new Impala configuration option,
--disk_spill_encryption
secures sensitive data from being observed or tampered with when temporarily stored on disk.
The new security-related SQL statements work along with the Sentry authorization framework. See Enabling Sentry Authorization for Impala for details.
Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not require any special table settings to work in an Impala text table. Impala recognizes the compression type automatically based on file extensions of
.gz
,.bz2
, and.snappy
respectively. These types of compressed text files are intended for convenience with existing ETL pipelines. Their non-splittable nature means they are not optimal for high-performance parallel queries. See Using bzip2, deflate, gzip, Snappy, or zstd Text Files for details.Query hints can now use comment notation,
/* +hint_name */
or-- +hint_name
, at the same places in the query where the hints enclosed by[ ]
are recognized. This enhancement makes it easier to reuse Impala queries on other database systems. See Optimizer Hints for details.A new query option,
QUERY_TIMEOUT_S
, lets you specify a timeout period in seconds for individual queries.The working of the
--idle_query_timeout
configuration option is extended. If noQUERY_OPTION_S
query option is in effect,--idle_query_timeout
works the same as before, setting the timeout interval. When theQUERY_OPTION_S
query option is specified, its maximum value is capped by the value of the--idle_query_timeout
option.That is, the system administrator sets the default and maximum timeout through the
--idle_query_timeout
startup option, and then individual users or applications can set a lower timeout value if desired through theQUERY_TIMEOUT_S
query option. See Setting Timeout Periods for Daemons, Queries, and Sessions and QUERY_TIMEOUT_S Query Option (Impala 2.0 or higher only) for details.New functions
VAR_SAMP()
andVAR_POP()
are aliases for the existingVARIANCE_SAMP()
andVARIANCE_POP()
functions.A new date and time function,
DATE_PART()
, provides similar functionality toEXTRACT()
. You can also call theEXTRACT()
function using the SQL-99 syntax,EXTRACT(unit FROM timestamp)
. These enhancements simplify the porting process for date-related code from other systems. See Impala Date and Time Functions for details.New approximation features provide a fast way to get results when absolute precision is not required:
- The
APPX_COUNT_DISTINCT
query option lets Impala rewriteCOUNT(DISTINCT)
calls to useNDV()
instead, which speeds up the operation and allows multipleCOUNT(DISTINCT)
operations in a single query. See APPX_COUNT_DISTINCT Query Option (Impala 2.0 or higher only) for details.
The
APPX_MEDIAN()
aggregate function produces an estimate for the median value of a column by using sampling. See APPX_MEDIAN Function for details.- The
Impala now supports a
DECODE()
function. This function works as a shorthand for aCASE()
expression, and improves compatibility with SQL code containing vendor extensions. See Impala Conditional Functions for details.The
STDDEV()
,STDDEV_POP()
,STDDEV_SAMP()
,VARIANCE()
,VARIANCE_POP()
,VARIANCE_SAMP()
, andNDV()
aggregate functions now all returnDOUBLE
results rather thanSTRING
. Formerly, you were required toCAST()
the result to a numeric type before using it in arithmetic operations.The default settings for Parquet block size, and the associated
PARQUET_FILE_SIZE
query option, are changed. Now, Impala writes Parquet files with a size of 256 MB and an HDFS block size of 256 MB. Previously, Impala attempted to write Parquet files with a size of 1 GB and an HDFS block size of 1 GB. In practice, Impala used a conservative estimate of the disk space needed for each Parquet block, leading to files that were typically 512 MB anyway. Thus, this change will make the file size more accurate if you specify a value for thePARQUET_FILE_SIZE
query option. It also reduces the amount of memory reserved duringINSERT
into Parquet tables, potentially avoiding out-of-memory errors and improving scalability when inserting data into Parquet tables.Anti-joins are now supported, expressed using the
LEFT ANTI JOIN
andRIGHT ANTI JOIN
clauses. These clauses returns results from one table that have no match in the other table. You might use this type of join in the same sorts of use cases as theNOT EXISTS
andNOT IN
operators. See Joins in Impala SELECT Statements for details.The
SET
command in impala-shell has been promoted to a real SQL statement. You can now set query options such asPARQUET_FILE_SIZE
,MEM_LIMIT
, andSYNC_DDL
within JDBC, ODBC, or any other kind of application that submits SQL without going through the impala-shell interpreter. See SET Statement for details.The impala-shell interpreter now reads settings from an optional configuration file, named $HOME/.impalarc by default. See impala-shell Configuration File for details.
The library used for regular expression parsing has changed from Boost to Google RE2. This implementation change adds support for non-greedy matches using the
.*?
notation. This and other changes in the way regular expressions are interpreted means you might need to re-test queries that use functions such asregexp_extract()
orregexp_replace()
, or operators such asREGEXP
orRLIKE
. See Incompatible Changes and Limitations in Apache Impala for those details.
New Features in Impala 1.4
The following are the major new features in Impala 1.4:
The
DECIMAL
data type lets you store fixed-precision values, for working with currency or other fractional values where it is important to represent values exactly and avoid rounding errors. This feature includes enhancements to built-in functions, numeric literals, and arithmetic expressions. See DECIMAL Data Type (Impala 3.0 or higher only) for details.Where the underlying HDFS support exists, Impala can take advantage of the HDFS caching feature to “pin” entire tables or individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU overhead of memory-to-memory copying. When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an additional copy of the data in memory. Other Hadoop components that read the same data files also experience a performance benefit.
For background information about HDFS caching, see the documentation for your Apache Hadoop distribution. For performance information about using this feature with Impala, see Using HDFS Caching with Impala (Impala 2.1 or higher only). For the
SET CACHED
andSET UNCACHED
clauses that let you control cached table data through DDL statements, see CREATE TABLE Statement and ALTER TABLE Statement.Impala can now use Sentry-based authorization based either on the original policy file, or on rules defined by
GRANT
andREVOKE
statements issued through Hive. See Enabling Sentry Authorization for Impala for details.For interoperability with Parquet files created through other Hadoop components, such as Pig or MapReduce jobs, you can create an Impala table that automatically sets up the column definitions based on the layout of an existing Parquet data file. See CREATE TABLE Statement for the syntax, and Creating Parquet Tables in Impala for usage information.
ORDER BY
queries no longer require aLIMIT
clause. If the size of the result set to be sorted exceeds the memory available to Impala, Impala uses a temporary work space on disk to perform the sort operation. See ORDER BY Clause for details.LDAP connections can be secured through either SSL or TLS. See Enabling LDAP Authentication for Impala for details.
The following new built-in scalar and aggregate functions are available:
A new built-in function,
EXTRACT()
, returns one date or time field from aTIMESTAMP
value. See Impala Date and Time Functions for details.A new built-in function,
TRUNC()
, truncates date/time values to a particular granularity, such as year, month, day, hour, and so on. See Impala Date and Time Functions for details.ADD_MONTHS()
built-in function, an alias for the existingMONTHS_ADD()
function. See Impala Date and Time Functions for details.A new built-in function,
ROUND()
, roundsDECIMAL
values to a specified number of fractional digits. See Impala Mathematical Functions for details.Several built-in aggregate functions for computing properties for statistical distributions:
STDDEV()
,STDDEV_SAMP()
,STDDEV_POP()
,VARIANCE()
,VARIANCE_SAMP()
, andVARIANCE_POP()
. See STDDEV, STDDEV_SAMP, STDDEV_POP Functions and VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP Functions for details.Several new built-in functions, such as
MAX_INT()
,MIN_SMALLINT()
, and so on, let you conveniently check whether data values are in an expected range. You might be able to switch a column to a smaller type, saving memory during processing. See Impala Mathematical Functions for details.New built-in functions,
IS_INF()
andIS_NAN()
, check for the special values infinity and “not a number”. These values could be specified asinf
ornan
in text data files, or be produced by certain arithmetic expressions. See Impala Mathematical Functions for details.
The
SHOW PARTITIONS
statement displays information about the structure of a partitioned table. See SHOW Statement for details.New configuration options for the impalad daemon let you specify initial memory usage for all queries. The initial resource requests handled by Llama and YARN can be expanded later if needed, avoiding unnecessary over-allocation and reducing the chance of out-of-memory conditions. See Resource Management for details.
The Impala
CREATE TABLE
statement now has aSTORED AS AVRO
clause, allowing you to create Avro tables through Impala. See Using the Avro File Format with Impala Tables for details and examples.New impalad configuration options let you fine-tune the calculations Impala makes to estimate resource requirements for each query. These options can help avoid problems due to overconsumption due to too-low estimates, or underutilization due to too-high estimates. See Resource Management for details.
A new
SUMMARY
command in the impala-shell interpreter provides a high-level summary of the work performed at each stage of the explain plan. The summary is also included in output from thePROFILE
command. See impala-shell Command Reference and Using the SUMMARY Report for Performance Tuning for details.Performance improvements for the
COMPUTE STATS
statement:- The
NDV
function is speeded up through native code generation. - Because the
NULL
count is not currently used by the Impala query planner, in Impala 1.4.0 and higher,COMPUTE STATS
does not count theNULL
values for each column. (The#Nulls
field of the stats table is left as -1, signifying that the value is unknown.)
See COMPUTE STATS Statement for general details about the
COMPUTE STATS
statement, and Table and Column Statistics for how to use the statistics to improve query performance.- The
Performance improvements for partition pruning. This feature reduces the time spent in query planning, for partitioned tables with thousands of partitions. Previously, Impala typically queried tables with up to approximately 3000 partitions. With the performance improvement in partition pruning, now Impala can comfortably handle tables with tens of thousands of partitions. See Partition Pruning for Queries for information about partition pruning.
The documentation provides additional guidance for planning tasks. See Planning for Impala Deployment.
The impala-shell interpreter now supports UTF-8 characters for input and output. You can control whether impala-shell ignores invalid Unicode code points through the
--strict_unicode
option. (Although this option is removed in Impala 2.0.)
New Features in Impala 1.3.2
No new features. This point release is exclusively a bug fix release for the IMPALA-1019 issue related to HDFS caching.
New Features in Impala 1.3.1
This point release is primarily a vehicle to deliver bug fixes. Any new features are minor changes resulting from fixes for performance, reliability, or usability issues.
A new impalad startup option,
--insert_inherit_permissions
, causes ImpalaINSERT
statements to create each new partition with the same HDFS permissions as its parent directory. By default,INSERT
statements create directories for new partitions using default HDFS permissions. See INSERT Statement for examples ofINSERT
statements for partitioned tables.The
SHOW FUNCTIONS
statement now displays the return type of each function, in addition to the types of its arguments. See SHOW Statement for examples.You can now specify the clause
FIELDS TERMINATED BY '\0'
with aCREATE TABLE
statement to use text data files that use ASCII 0 (nul
) characters as a delimiter. See Using Text Data Files with Impala Tables for details.In Impala 1.3.1 and higher, the
REGEXP
andRLIKE
operators now match a regular expression string that occurs anywhere inside the target string, the same as if the regular expression was enclosed on each side by.*
. See REGEXP Operator for examples. Previously, these operators only succeeded when the regular expression matched the entire target string. This change improves compatibility with the regular expression support for popular database systems. There is no change to the behavior of theregexp_extract()
andregexp_replace()
built-in functions.
New Features in Impala 1.3
The admission control feature lets you control and prioritize the volume and resource consumption of concurrent queries. This mechanism reduces spikes in resource usage, helping Impala to run alongside other kinds of workloads on a busy cluster. It also provides more user-friendly conflict resolution when multiple memory-intensive queries are submitted concurrently, avoiding resource contention that formerly resulted in out-of-memory errors. See Admission Control and Query Queuing for details.
Enhanced
EXPLAIN
plans provide more detail in an easier-to-read format. Now there are four levels of verbosity: theEXPLAIN_LEVEL
option can be set from 0 (most concise) to 3 (most verbose). See EXPLAIN Statement for syntax and Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles for usage information.The
TIMESTAMP
data type accepts more kinds of input string formats through theUNIX_TIMESTAMP
function, and produces more varieties of string formats through theFROM_UNIXTIME
function. The documentation now also lists more functions for date arithmetic, used for adding and subtractingINTERVAL
expressions fromTIMESTAMP
values. See Impala Date and Time Functions for details.New conditional functions,
NULLIF()
,NULLIFZERO()
, andZEROIFNULL()
, simplify porting SQL containing vendor extensions to Impala. See Impala Conditional Functions for details.New utility function,
CURRENT_DATABASE()
. See Impala Miscellaneous Functions for details.Integration with the YARN resource management framework. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Resource Management for full details.
On the Impala side, this feature involves some new startup options for the impalad daemon:
-enable_rm
-llama_host
-llama_port
-llama_callback_port
-cgroup_hierarchy_path
For details of these startup options, see Modifying Impala Startup Options.
This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:
MEM_LIMIT
: the function of this existing option changes when Impala resource management is enabled.REQUEST_POOL
: a new option. (Renamed toRESOURCE_POOL
in Impala 1.3.0.)V_CPU_CORES
: a new option.RESERVATION_REQUEST_TIMEOUT
: a new option.
For details of these query options, see impala_resource_management.html#rm_query_options.
New Features in Impala 1.2.4
Note: Impala 1.2.4 is primarily a bug fix release for Impala 1.2.3, plus some performance enhancements for the catalog server to minimize startup and DDL wait times for Impala deployments with large numbers of databases, tables, and partitions.
On Impala startup, the metadata loading and synchronization mechanism has been improved and optimized, to give more responsiveness when starting Impala on a system with a large number of databases, tables, or partitions. The initial metadata loading happens in the background, allowing queries to be run before the entire process is finished. When a query refers to a table whose metadata is not yet loaded, the query waits until the metadata for that table is loaded, and the load operation for that table is prioritized to happen first.
Formerly, if you created a new table in Hive, you had to issue the
INVALIDATE METADATA
statement (with no table name) which was an expensive operation that reloaded metadata for all tables. Impala did not recognize the name of the Hive-created table, so you could not doINVALIDATE METADATA new_table
to get the metadata for just that one table. Now, when you issueINVALIDATE METADATA table_name
, Impala checks to see if that name represents a table created in Hive, and if so recognizes the new table and loads the metadata for it. Additionally, if the new table is in a database that was newly created in Hive, Impala also recognizes the new database.If you issue
INVALIDATE METADATA table_name
and the table has been dropped through Hive, Impala will recognize that the table no longer exists.New startup options let you control the parallelism of the metadata loading during startup for the catalogd daemon:
--load_catalog_in_background
makes Impala load and cache metadata using background threads after startup. It istrue
by default. Previously, a system with a large number of databases, tables, or partitions could be unresponsive or even time out during startup.--num_metadata_loading_threads
determines how much parallelism Impala devotes to loading metadata in the background. The default is 16. You might increase this value for systems with huge numbers of databases, tables, or partitions. You might lower this value for busy systems that are CPU-constrained due to jobs from components other than Impala.
New Features in Impala 1.2.3
Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional fix for compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or MapReduce. If you are upgrading from Impala 1.2.1 or earlier, see New Features in Impala 1.2.2 for the latest added features.
New Features in Impala 1.2.2
Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over 1.2.1 are performance related, primarily for join queries.
New user-visible features include:
Join order optimizations. This highly valuable feature automatically distributes and parallelizes the work for a join query to minimize disk I/O and network traffic. The automatic optimization reduces the need to use query hints or to rewrite join queries with the tables in a specific order based on size or cardinality. The new
COMPUTE STATS
statement gathers statistical information about each table that is crucial for enabling the join optimizations. See Performance Considerations for Join Queries for details.COMPUTE STATS
statement to collect both table statistics and column statistics with a single statement. Intended to be more comprehensive, efficient, and reliable than the corresponding HiveANALYZE TABLE
statement, which collects statistics in multiple phases through MapReduce jobs. These statistics are important for query planning for join queries, queries on partitioned tables, and other types of data-intensive operations. For optimal planning of join queries, you need to collect statistics for each table involved in the join. See COMPUTE STATS Statement for details.Reordering of tables in a join query can be overridden by the
STRAIGHT_JOIN
operator, allowing you to fine-tune the planning of the join query if necessary, by using the original technique of ordering the joined tables in descending order of size. See Overriding Join Reordering with STRAIGHT_JOIN for details.The
CROSS JOIN
clause in the[SELECT]($304128e61b9f6805.md#select)
statement to allow Cartesian products in queries, that is, joins without an equality comparison between columns in both tables. Because such queries must be carefully checked to avoid accidental overconsumption of memory, you must use theCROSS JOIN
operator to explicitly select this kind of join. See Cross Joins and Cartesian Products with the CROSS JOIN Operator for examples.The
ALTER TABLE
statement has new clauses that let you fine-tune table statistics. You can use this technique as a less-expensive way to update specific statistics, in case the statistics become stale, or to experiment with the effects of different data distributions on query planning.LDAP username/password authentication in JDBC/ODBC. See Enabling LDAP Authentication for Impala for details.
GROUP_CONCAT() aggregate function to concatenate column values across all rows of a result set.
The
INSERT
statement now accepts hints,[SHUFFLE]
and[NOSHUFFLE]
, to influence the way work is redistributed duringINSERT...SELECT
operations. The hints are primarily useful for inserting into partitioned Parquet tables, where using the[SHUFFLE]
hint can avoid problems due to memory consumption and simultaneous open files in HDFS, by collecting all the new data for each partition on a specific node.Several built-in functions and operators are now overloaded for more numeric data types, to reduce the requirement to use
CAST()
for type coercion inINSERT
statements. For example, the expression2+2
in anINSERT
statement formerly produced aBIGINT
result, requiring aCAST()
to be stored in anINT
variable. Now, addition, subtraction, and multiplication only produce a result that is one step “bigger” than their arguments, and numeric and conditional functions can returnSMALLINT
,FLOAT
, and other smaller types rather than alwaysBIGINT
orDOUBLE
.New
fnv_hash()
built-in function for constructing hashed values. See Impala Mathematical Functions for details.The clause
STORED AS PARQUET
is accepted as an equivalent forSTORED AS PARQUETFILE
. This more concise form is recommended for new code.
Because Impala 1.2.2 builds on a number of features introduced in 1.2.1, if you are upgrading from an older 1.1.x release straight to 1.2.2, also review New Features in Impala 1.2.1 to see features such as the SHOW TABLE STATS
and SHOW COLUMN STATS
statements, and user-defined functions (UDFs).
New Features in Impala 1.2.1
Note: The Impala 1.2.1 feature set is a superset of features in the Impala 1.2.0 beta, with the exception of resource management, which relies on resource management infrastructure in the underlying Hadoop distribution.
Impala 1.2.1 includes new features for security, performance, and flexibility.
New user-visible features include:
SHOW TABLE STATS table_name
andSHOW COLUMN STATS table_name
statements, to verify that statistics are available and to see the values used during query planning.CREATE TABLE AS SELECT
syntax, to create a new table and transfer data into it in a single operation.OFFSET
clause, for use with theORDER BY
andLIMIT
clauses to produce “paged” result sets such as items 1-10, then 11-20, and so on.NULLS FIRST
andNULLS LAST
clauses to ensure consistent placement ofNULL
values inORDER BY
queries.New built-in functions:
least()
,greatest()
,initcap()
.New aggregate function:
ndv()
, a fast alternative toCOUNT(DISTINCT col)
returning an approximate result.The
LIMIT
clause can now accept a numeric expression as an argument, rather than only a literal constant.The
SHOW CREATE TABLE
statement displays the end result of all theCREATE TABLE
andALTER TABLE
statements for a particular table. You can use the output to produce a simplified setup script for a schema.The
--idle_query_timeout
and--idle_session_timeout
options for impalad control the time intervals after which idle queries are cancelled, and idle sessions expire. See Setting Timeout Periods for Daemons, Queries, and Sessions for details.User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.
You create UDFs through the
CREATE FUNCTION
statement and drop them through theDROP FUNCTION
statement. See User-Defined Functions (UDFs) for instructions about coding, building, and deploying UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.A new service automatically propagates changes to table data and metadata made by one Impala node, sending the new or updated metadata to all the other Impala nodes. The automatic synchronization mechanism eliminates the need to use the
INVALIDATE METADATA
andREFRESH
statements after issuing Impala statements such asCREATE TABLE
,ALTER TABLE
,DROP TABLE
,INSERT
, andLOAD DATA
.For even more precise synchronization, you can enable the
[SYNC_DDL]($57721813ed7d14b3.md#sync_ddl)
query option before issuing a DDL,INSERT
, orLOAD DATA
statement. This option causes the statement to wait, returning only after the catalog service has broadcast the applicable changes to all Impala nodes in the cluster.Note:
Because the catalog service only monitors operations performed through Impala,
INVALIDATE METADATA
andREFRESH
are still needed on the Impala side after creating new tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because the catalog service broadcasts the result of theREFRESH
andINVALIDATE METADATA
statements to all Impala nodes, when you do need to use those statements, you can do so a single time rather than on every Impala node.This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.
The
CREATE TABLE
andALTER TABLE
statements have new clausesTBLPROPERTIES
andWITH SERDEPROPERTIES
. TheTBLPROPERTIES
clause lets you associate arbitrary items of metadata with a particular table as key-value pairs. TheWITH SERDEPROPERTIES
clause lets you specify the serializer/deserializer (SerDes) classes that read and write data for a table; although Impala does not make use of these properties, sometimes particular values are needed for Hive compatibility. See CREATE TABLE Statement and ALTER TABLE Statement for details.Delegation support lets you authorize certain OS users associated with applications (for example,
hue
), to submit requests using the credentials of other users. See Configuring Impala Delegation for Clients for details.Enhancements to
EXPLAIN
output. In particular, when you enable the newEXPLAIN_LEVEL
query option, theEXPLAIN
andPROFILE
statements produce more verbose output showing estimated resource requirements and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN Statement for details.SHOW CREATE TABLE
summarizes the effects of the originalCREATE TABLE
statement and any subsequentALTER TABLE
statements, giving you aCREATE TABLE
statement that will re-create the current structure and layout for a table.The
LIMIT
clause for queries now accepts an arithmetic expression, in addition to numeric literals.
New Features in Impala 1.2.0 (Beta)
The Impala 1.2.0 beta includes new features for security, performance, and flexibility.
New user-visible features include:
User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.
You create UDFs through the
CREATE FUNCTION
statement and drop them through theDROP FUNCTION
statement. See User-Defined Functions (UDFs) for instructions about coding, building, and deploying UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.A new service automatically propagates changes to table data and metadata made by one Impala node, sending the new or updated metadata to all the other Impala nodes. The automatic synchronization mechanism eliminates the need to use the
INVALIDATE METADATA
andREFRESH
statements after issuing Impala statements such asCREATE TABLE
,ALTER TABLE
,DROP TABLE
,INSERT
, andLOAD DATA
.Note:
Because this service only monitors operations performed through Impala,
INVALIDATE METADATA
andREFRESH
are still needed on the Impala side after creating new tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because the catalog service broadcasts the result of theREFRESH
andINVALIDATE METADATA
statements to all Impala nodes, when you do need to use those statements, you can do so a single time rather than on every Impala node.This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.
Integration with the YARN resource management framework. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Resource Management for full details.
On the Impala side, this feature involves some new startup options for the impalad daemon:
-enable_rm
-llama_host
-llama_port
-llama_callback_port
-cgroup_hierarchy_path
For details of these startup options, see Modifying Impala Startup Options.
This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:
MEM_LIMIT
: the function of this existing option changes when Impala resource management is enabled.YARN_POOL
: a new option. (Renamed toRESOURCE_POOL
in Impala 1.3.0.)V_CPU_CORES
: a new option.RESERVATION_REQUEST_TIMEOUT
: a new option.
For details of these query options, see impala_resource_management.html#rm_query_options.
CREATE TABLE ... AS SELECT
syntax, to create a table and copy data into it in a single operation. See CREATE TABLE Statement for details.The
CREATE TABLE
andALTER TABLE
statements have a newTBLPROPERTIES
clause that lets you associate arbitrary items of metadata with a particular table as key-value pairs. See CREATE TABLE Statement and ALTER TABLE Statement for details.Delegation support lets you authorize certain OS users associated with applications (for example,
hue
), to submit requests using the credentials of other users. See Configuring Impala Delegation for Clients for details.Enhancements to
EXPLAIN
output. In particular, when you enable the newEXPLAIN_LEVEL
query option, theEXPLAIN
andPROFILE
statements produce more verbose output showing estimated resource requirements and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN Statement for details.
New Features in Impala 1.1.1
Impala 1.1.1 includes new features for security and stability.
New user-visible features include:
- Additional security feature: auditing. New startup options for impalad let you capture information about Impala queries that succeed or are blocked due to insufficient privileges. For details, see Impala Security.
- Parquet data files generated by Impala 1.1.1 are now compatible with the Parquet support in Hive. See Incompatible Changes and Limitations in Apache Impala for the procedure to update older Impala-created Parquet files to be compatible with the Hive Parquet support.
- Additional improvements to stability and resource utilization for Impala queries.
- Additional enhancements for compatibility with existing file formats.
New Features in Impala 1.1
Impala 1.1 includes new features for security, performance, and usability.
New user-visible features include:
- Extensive new security features, built on top of the Sentry open source project. Impala now supports fine-grained authorization based on roles. A policy file determines which privileges on which schema objects (servers, databases, tables, and HDFS paths) are available to users based on their membership in groups. By assigning privileges for views, you can control access to table data at the column level. For details, see Impala Security.
- Impala can now create, alter, drop, and query views. Views provide a flexible way to set up simple aliases for complex queries; hide query details from applications and users; and simplify maintenance as you rename or reorganize databases, tables, and columns. See the overview section Overview of Impala Views and the statements CREATE VIEW Statement, ALTER VIEW Statement, and DROP VIEW Statement.
- Performance is improved through a number of automatic optimizations. Resource consumption is also reduced for Impala queries. These improvements apply broadly across all kinds of workloads and file formats. The major areas of performance enhancement include:
- Improved disk and thread scheduling, which applies to all queries.
- Improved hash join and aggregation performance, which applies to queries with large build tables or a large number of groups.
- Dictionary encoding with Parquet, which applies to Parquet tables with short string columns.
- Improved performance on systems with SSDs, which applies to all queries and file formats.
- Some new built-in functions are implemented: translate() to substitute characters within strings, user() to check the login ID of the connected user.
- The new
WITH
clause forSELECT
statements lets you simplify complicated queries in a way similar to creating a view. The effects of theWITH
clause only last for the duration of one query, unlike views, which are persistent schema objects that can be used by multiple sessions or applications. See WITH Clause. - An enhancement to
DESCRIBE
statement,DESCRIBE FORMATTED table_name
, displays more detailed information about the table. This information includes the file format, location, delimiter, ownership, external or internal, creation and access times, and partitions. The information is returned as a result set that can be interpreted and used by a management or monitoring application. See DESCRIBE Statement. - You can now insert a subset of columns for a table, with other columns being left as all
NULL
values. Or you can specify the columns in any order in the destination table, rather than having to match the order of the corresponding columns in the source.VALUES
clause. This feature is known as “column permutation”. See INSERT Statement. - The new
LOAD DATA
statement lets you load data into a table directly from an HDFS data file. This technique lets you minimize the number of steps in your ETL process, and provides more flexibility. For example, you can bring data into an Impala table in one step. Formerly, you might have created an external table where the data files are not entirely under your control, or copied the data files to Impala data directories manually, or loaded the original data into one table and then used theINSERT
statement to copy it to a new table with a different file format, partitioning scheme, and so on. See LOAD DATA Statement. - Improvements to Impala-HBase integration:
- New query options for HBase performance:
[HBASE_CACHE_BLOCKS]($746b3df57da415f1.md#hbase_cache_blocks)
and[HBASE_CACHING]($18d89c04cd6f83b2.md#hbase_caching)
. - Support for binary data types in HBase tables. See Supported Data Types for HBase Columns for details.
- New query options for HBase performance:
- You can issue
REFRESH
as a SQL statement through any of the programming interfaces that Impala supports.REFRESH
formerly had to be issued as a command through the impala-shell interpreter, and was not available through a JDBC or ODBC API call. As part of this change, the functionality of theREFRESH
statement is divided between two statements. In Impala 1.1,REFRESH
requires a table name argument and immediately reloads the metadata; the newINVALIDATE METADATA
statement works the same as the Impala 1.0REFRESH
did: the table name argument is optional, and the metadata for one or all tables is marked as stale, but not actually reloaded until the table is queried. When you create a new table in the Hive shell or through a different Impala node, you must enterINVALIDATE METADATA
with no table parameter before you can see the new table in impala-shell. See REFRESH Statement and INVALIDATE METADATA Statement.
New Features in Impala 1.0.1
New user-visible features include:
- The
VALUES
clause lets youINSERT
one or more rows using literals, function return values, or other expressions. For performance and scalability, you should still useINSERT ... SELECT
for bringing large quantities of data into an Impala table. TheVALUES
clause is a convenient way to set up small tables, particularly for initial testing of SQL features that do not require large amounts of data. See VALUES Clause for details. - The
-B
and-o
options of theimpala-shell
command can turn query results into delimited text files and store them in an output file. The plain text results are useful for using with other Hadoop components or Unix tools. In benchmark tests, it is also faster to produce plain rather than pretty-printed results, and write to a file rather than to the screen, giving a more accurate picture of the actual query time. - Several bug fixes. See Issues Fixed in the 1.0.1 Release for details.
New Features in Impala 1.0
This version has multiple performance improvements and adds the following functionality:
- Several bug fixes. See Issues Fixed in the 1.0 GA Release.
[ALTER TABLE]($990fa0038acb5944.md#alter_table)
statement.- Hints to allow specifying a particular join strategy.
[REFRESH]($73b83fc2164747ca.md#refresh)
for a single table.- Dynamic resource management, allowing high concurrency for Impala queries.
New Features in Version 0.7 of the Impala Beta Release
This version has multiple performance improvements and adds the following functionality:
- Several bug fixes. See Issues Fixed in Version 0.7 of the Beta Release.
- Support for the Parquet file format. For more information on file formats, see How Impala Works with Hadoop File Formats.
- Added support for Avro.
- Support for the memory limits. For more information, see the example on modifying memory limits in Modifying Impala Startup Options.
- Bigger and faster joins through the addition of partitioned joins to the already supported broadcast joins.
- Fully distributed aggregations.
- Fully distributed top-n computation.
- Support for creating and altering tables.
- Support for GROUP BY with floats and doubles.
New Features in Version 0.6 of the Impala Beta Release
- Several bug fixes. See Issues Fixed in Version 0.6 of the Beta Release.
- Added support for Impala on SUSE and Debian/Ubuntu. Impala is now supported on:
- RHEL5.7/6.2 and Centos5.7/6.2
- SUSE 11 with Service Pack 1 or higher
- Ubuntu 10.04/12.04 and Debian 6.03
- Support for the RCFile file format. For more information on file formats, see Understanding File Formats.
New Features in Version 0.5 of the Impala Beta Release
- Several bug fixes. See Issues Fixed in Version 0.5 of the Beta Release.
- Added support for a JDBC driver that allows you to access Impala from a Java client. To use this feature, follow the instructions in Configuring Impala to Work with JDBC to install the JDBC driver JARs on the client machine and modify the
CLASSPATH
on the client to include the JARs.
New Features in Version 0.4 of the Impala Beta Release
- Several bug fixes. See Issues Fixed in Version 0.4 of the Beta Release.
- Added support for Impala on RHEL5.7/Centos5.7. Impala is now supported on RHEL5.7/6.2 and Centos5.7/6.2.
- The Impala debug webserver now has the ability to serve static files from
${IMPALA_HOME}/www
. This can be disabled by setting--enable_webserver_doc_root=false
on the command line. As a result, Impala now uses the Twitter Bootstrap library to style its debug webpages, and the/queries
page now tracks the last 25 queries run by each Impala daemon. - Additional metrics available on the Impala Debug Webpage.
New Features in Version 0.3 of the Impala Beta Release
- Several bug fixes. See Issues Fixed in Version 0.3 of the Beta Release.
- The
state-store-service binary
has been renamedstatestored
. - The location of the Impala configuration files has changed from the
/usr/lib/impala/conf
directory to the/etc/impala/conf
directory.
New Features in Version 0.2 of the Impala Beta Release
- Several bug fixes. See Issues Fixed in Version 0.2 of the Beta Release.
Added Default Query Options Default query options override all default QueryOption values when starting
impalad
. The format is:-default_query_options='key=value;key=value'