INVALIDATE METADATA Statement

The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.

INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:

  • Metadata of existing tables changes.
  • New tables are added, and Impala will use the tables.
  • The SERVER or DATABASE level Ranger privileges are changed.
  • Block metadata changes, but the files remain the same (HDFS rebalance).
  • UDF jars change.
  • Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.

No INVALIDATE METADATA is needed when the changes are made by impalad.

See Overview of Impala Metadata and the Metastore for the information about the way Impala uses metadata and how it shares the same metastore database as Hive.

Once issued, the INVALIDATE METADATA statement cannot be cancelled.

Syntax:

  1. INVALIDATE METADATA [[db_name.]table_name]

If there is no table specified, the cached metadata for all tables is flushed and synced with Hive Metastore (HMS). If tables were dropped from the HMS, they will be removed from the catalog, and if new tables were added, they will show up in the catalog.

If you specify a table name, only the metadata for that one table is flushed and synced with the HMS.

Usage notes:

To return accurate query results, Impala need to keep the metadata current for the databases and tables queried. Therefore, if some other entity modifies information used by Impala in the metastore, the information cached by Impala must be updated via INVALIDATE METADATA or REFRESH.

INVALIDATE METADATA and REFRESH are counterparts:

  • INVALIDATE METADATA is an asynchronous operations that simply discards the loaded metadata from the catalog and coordinator caches. After that operation, the catalog and all the Impala coordinators only know about the existence of databases and tables and nothing more. Metadata loading for tables is triggered by any subsequent queries.
  • REFRESH reloads the metadata synchronously. REFRESH is more lightweight than doing a full metadata load after a table has been invalidated. REFRESH cannot detect changes in block locations triggered by operations like HDFS balancer, hence causing remote reads during query execution with negative performance implications.

Use REFRESH after invalidating a specific table to separate the metadata load from the first query that’s run against that table.

Examples:

This example illustrates creating a new database and new table in Hive, then doing an INVALIDATE METADATA statement in Impala using the fully qualified table name, after which both the new table and the new database are visible to Impala.

Before the INVALIDATE METADATA statement was issued, Impala would give a “not found” error if you tried to refer to those database or table names.

  1. $ hive
  2. hive> CREATE DATABASE new_db_from_hive;
  3. hive> CREATE TABLE new_db_from_hive.new_table_from_hive (x INT);
  4. hive> quit;
  5. $ impala-shell
  6. > REFRESH new_db_from_hive.new_table_from_hive;
  7. ERROR: AnalysisException: Database does not exist: new_db_from_hive
  8. > INVALIDATE METADATA new_db_from_hive.new_table_from_hive;
  9. > SHOW DATABASES LIKE 'new*';
  10. +--------------------+
  11. | new_db_from_hive |
  12. +--------------------+
  13. > SHOW TABLES IN new_db_from_hive;
  14. +---------------------+
  15. | new_table_from_hive |
  16. +---------------------+

Use the REFRESH statement for incremental metadata update.

  1. > REFRESH new_table_from_hive;

For more examples of using INVALIDATE METADATA with a combination of Impala and Hive operations, see Switching Back and Forth Between Impala and Hive.

HDFS considerations:

By default, the INVALIDATE METADATA command checks HDFS permissions of the underlying data files and directories, caching this information so that a statement can be cancelled immediately if for example the impala user does not have permission to write to the data directory for the table. (This checking does not apply when the catalogd configuration option --load_catalog_in_background is set to false, which it is by default.) Impala reports any lack of write permissions as an INFO message in the log file.

If you change HDFS permissions to make data readable or writeable by the Impala user, issue another INVALIDATE METADATA to make Impala aware of the change.

Kudu considerations:

By default, much of the metadata for Kudu tables is handled by the underlying storage layer. Kudu tables have less reliance on the Metastore database, and require less metadata caching on the Impala side. For example, information about partitions in Kudu tables is managed by Kudu, and Impala does not cache any block locality metadata for Kudu tables. If the Kudu service is not integrated with the Hive Metastore, Impala will manage Kudu table metadata in the Hive Metastore.

The REFRESH and INVALIDATE METADATA statements are needed less frequently for Kudu tables than for HDFS-backed tables. Neither statement is needed when data is added to, removed, or updated in a Kudu table, even if the changes are made directly to Kudu through a client program using the Kudu API. Run REFRESH table_name or INVALIDATE METADATA table_name for a Kudu table only after making a change to the Kudu table schema, such as adding or dropping a column.

Related information:

Overview of Impala Metadata and the Metastore, REFRESH Statement

Parent topic: Impala SQL Statements