INVALIDATE METADATA Statement
Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell, before the table is available for Impala queries. The next time the current Impala node performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the REFRESH
statement, so in the common scenario of adding new data files to an existing table, prefer REFRESH
rather than INVALIDATE METADATA
. If you are not familiar with the way Impala uses metadata and how it shares the same metastore database as Hive, see Overview of Impala Metadata and the Metastore for background information.
Syntax:
INVALIDATE METADATA [[db_name.]table_name]
By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for that one table is flushed. Even for a single table, INVALIDATE METADATA
is more expensive than REFRESH
, so prefer REFRESH
in the common case where you add new data files for an existing table.
Internal details:
To accurately respond to queries, Impala must have current metadata about those databases and tables that clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean that all metadata updates require an Impala update.
Note:
In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA
after the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast mechanism faster and more responsive, especially during Impala startup. See New Features in Impala 1.2.4 for details.
In Impala 1.2 and higher, a dedicated daemon (catalogd) broadcasts DDL changes made through Impala to all Impala nodes. Formerly, after you created a database or table while connected to one Impala node, you needed to issue an INVALIDATE METADATA
statement on another Impala node before accessing the new database or table from the other node. Now, newly created or altered objects are picked up automatically by all Impala nodes. You must still use the INVALIDATE METADATA
technique after creating or altering objects through Hive. See The Impala Catalog Service for more information on the catalog service.
The INVALIDATE METADATA
statement is new in Impala 1.1 and higher, and takes over some of the use cases of the Impala 1.0 REFRESH
statement. Because REFRESH
now requires a table name parameter, to flush the metadata for all tables at once, use the INVALIDATE METADATA
statement.
Because REFRESH table_name
only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table
before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name
after you add data files for that table.
INVALIDATE METADATA
and REFRESH
are counterparts: INVALIDATE METADATA
waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH
reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA
to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA
statement works just like the Impala 1.0 REFRESH
statement did, while the Impala 1.1 REFRESH
is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.
Usage notes:
A metadata update for an impalad
instance is required if:
- A metadata change occurs.
- and the change is made from another
impalad
instance in your cluster, or through Hive. - and the change is made to a metastore database to which clients such as the Impala shell or ODBC directly connect.
A metadata update for an Impala node is not required when you issue queries from the same Impala node where you ran ALTER TABLE
, INSERT
, or other table-modifying statement.
Database and table metadata is typically modified by:
- Hive - via
ALTER
,CREATE
,DROP
orINSERT
operations. - Impalad - via
CREATE TABLE
,ALTER TABLE
, andINSERT
operations.
INVALIDATE METADATA
causes the metadata for that table to be marked as stale, and reloaded the next time the table is referenced. For a huge table, that process could take a noticeable amount of time; thus you might prefer to use REFRESH
where practical, to avoid an unpredictable delay later, for example if the next reference to the table is during a benchmark test.
Examples:
The following example shows how you might use the INVALIDATE METADATA
statement after creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the INVALIDATE METADATA
statement was issued, Impala would give a “table not found” error if you tried to refer to those table names. The DESCRIBE
statements cause the latest metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried.
[impalad-host:21000] > invalidate metadata;
[impalad-host:21000] > describe t1;
...
[impalad-host:21000] > describe t2;
...
For more examples of using REFRESH
and INVALIDATE METADATA
with a combination of Impala and Hive operations, see Switching Back and Forth Between Impala and Hive.
HDFS permissions:
The user ID that the impalad daemon runs under, typically the impala
user, must have execute permissions for all the relevant directories holding table data. (A table could have data spread across multiple directories, or in unexpected paths, if it uses partitioning or specifies a LOCATION
attribute for individual partitions or the entire table.) Issues with permissions might not cause an immediate error for this statement, but subsequent statements such as SELECT
or SHOW TABLE STATS
could fail.
HDFS considerations:
By default, the INVALIDATE METADATA
command checks HDFS permissions of the underlying data files and directories, caching this information so that a statement can be cancelled immediately if for example the impala
user does not have permission to write to the data directory for the table. (This checking does not apply when the catalogd configuration option --load_catalog_in_background
is set to false
, which it is by default.) Impala reports any lack of write permissions as an INFO
message in the log file, in case that represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala user, issue another INVALIDATE METADATA
to make Impala aware of the change.
Usage notes:
This example illustrates creating a new database and new table in Hive, then doing an INVALIDATE METADATA
statement in Impala using the fully qualified table name, after which both the new table and the new database are visible to Impala. The ability to specify INVALIDATE METADATA table_name
for a table created in Hive is a new capability in Impala 1.2.4. In earlier releases, that statement would have returned an error indicating an unknown table, requiring you to do INVALIDATE METADATA
with no table name, a more expensive operation that reloaded metadata for all tables and databases.
$ hive
hive> create database new_db_from_hive;
OK
Time taken: 4.118 seconds
hive> create table new_db_from_hive.new_table_from_hive (x int);
OK
Time taken: 0.618 seconds
hive> quit;
$ impala-shell
[localhost:21000] > show databases like 'new*';
[localhost:21000] > refresh new_db_from_hive.new_table_from_hive;
ERROR: AnalysisException: Database does not exist: new_db_from_hive
[localhost:21000] > invalidate metadata new_db_from_hive.new_table_from_hive;
[localhost:21000] > show databases like 'new*';
+--------------------+
| name |
+--------------------+
| new_db_from_hive |
+--------------------+
[localhost:21000] > show tables in new_db_from_hive;
+---------------------+
| name |
+---------------------+
| new_table_from_hive |
+---------------------+
Amazon S3 considerations:
The REFRESH
and INVALIDATE METADATA
statements also cache metadata for tables where the data resides in the Amazon Simple Storage Service (S3). In particular, issue a REFRESH
for a table after adding or removing files in the associated S3 data directory. See Using Impala with the Amazon S3 Filesystem for details about working with S3 tables.
Cancellation: Cannot be cancelled.
Kudu considerations:
Much of the metadata for Kudu tables is handled by the underlying storage layer. Kudu tables have less reliance on the metastore database, and require less metadata caching on the Impala side. For example, information about partitions in Kudu tables is managed by Kudu, and Impala does not cache any block locality metadata for Kudu tables.
The REFRESH
and INVALIDATE METADATA
statements are needed less frequently for Kudu tables than for HDFS-backed tables. Neither statement is needed when data is added to, removed, or updated in a Kudu table, even if the changes are made directly to Kudu through a client program using the Kudu API. Run REFRESH table_name
or INVALIDATE METADATA table_name
for a Kudu table only after making a change to the Kudu table schema, such as adding or dropping a column, by a mechanism other than Impala.
Related information:
Overview of Impala Metadata and the Metastore, REFRESH Statement
Parent topic: Impala SQL Statements