Hudi connector

Overview

The Hudi connector enables querying Hudi tables synced to Hive metastore. The connector usesthe metastore only to track partition locations. It makes use of the underlying Hudi filesystem and input formats to list data files. To learn more about the design of the connector, please check out RFC-40.

Requirements

To use Hudi, we need:

  • Network access from the Presto coordinator and workers to the distributed object storage.

  • Access to a Hive metastore service (HMS).

  • Network access from the Presto coordinator to the HMS. Hive metastore access with the Thrift

protocol defaults to using port 9083.

Configuration

Hudi supports the same metastore configuration properties as the Hive connector. At a minimum, following connector properties must be set in the hudi.properties file inside <presto_install_dir> /etc/catalog directory:

  1. connector.name=hudi
  2. hive.metastore.uri=thrift://hms.host:9083

Additionally, following session properties can be set depending on the use-case.

Property Name

Description

Default

hudi.metadata-table-enabled

Fetch the list of file names and sizes from Hudi’s metadata table rather than storage.

false

SQL Support

Currently, the connector only provides read access to data in the Hudi table that has been synced to Hive metastore. Once the catalog has been configured as mentioned above, users can query the tables as usual like Hive tables.

Supported Query Types

Table Type

Supported Query types

Copy On Write

Snapshot Queries

Merge On Read

Snapshot Queries + Read Optimized Queries

Examples Queries

stock_ticks_cow is a Hudi cow table that we refer in the Hudi quickstart document to create.

Here are some sample queries:

  1. USE hudi.default;
  2. select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
  1. symbol | _col1 |
  2. -----------+----------------------+
  3. GOOG | 2018-08-31 10:59:00 |
  4. (1 rows)
  1. select dt, symbol from stock_ticks_cow where symbol = 'GOOG';
  1. dt | symbol |
  2. ------------+--------+
  3. 2018-08-31 | GOOG |
  4. (1 rows)
  1. select dt, count(*) from stock_ticks_cow group by dt;
  1. dt | _col1 |
  2. ------------+--------+
  3. 2018-08-31 | 99 |
  4. (1 rows)