Hive

By connecting to Hive Metastore, or a metadata service compatible with Hive Metatore, Doris can automatically obtain Hive database table information and perform data queries.

In addition to Hive, many other systems also use the Hive Metastore to store metadata. So through Hive Catalog, we can not only access Hive, but also access systems that use Hive Metastore as metadata storage. Such as Iceberg, Hudi, etc.

Terms and Conditions

  1. Need to put core-site.xml, hdfs-site.xml and hive-site.xml in the conf directory of FE and BE. First read the hadoop configuration file in the conf directory, and then read the related to the environment variable HADOOP_CONF_DIR configuration file.
  2. hive supports version 1/2/3.
  3. Support Managed Table and External Table and part of Hive View.
  4. Can identify hive, iceberg, hudi metadata stored in Hive Metastore.

Create Catalog

Hive On HDFS

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'hadoop.username' = 'hive',
  5. 'dfs.nameservices'='your-nameservice',
  6. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  7. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088',
  9. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
  10. );

In addition to the two required parameters of type and hive.metastore.uris, more parameters can be passed to pass the information required for the connection.

If HDFS HA information is provided, the example is as follows:

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'hadoop.username' = 'hive',
  5. 'dfs.nameservices'='your-nameservice',
  6. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  7. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088',
  9. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
  10. );

Provide HDFS HA information and Kerberos authentication information at the same time, examples are as follows:

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'hive.metastore.sasl.enabled' = 'true',
  5. 'hive.metastore.kerberos.principal' = 'your-hms-principal',
  6. 'dfs.nameservices'='your-nameservice',
  7. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088',
  9. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088',
  10. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
  11. 'hadoop.security.authentication' = 'kerberos',
  12. 'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
  13. 'hadoop.kerberos.principal' = 'your-principal@YOUR.COM',
  14. 'yarn.resourcemanager.principal' = 'your-rm-principal'
  15. );

Please place the krb5.conf file and keytab authentication file under all BE and FE nodes. The path of the keytab authentication file is consistent with the configuration. The krb5.conf file is placed in /etc by default /krb5.conf path.

The value of hive.metastore.kerberos.principal needs to be consistent with the property of the same name of the connected hive metastore, which can be obtained from hive-site.xml.

Hive On VIEWFS

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'hadoop.username' = 'hive',
  5. 'dfs.nameservices'='your-nameservice',
  6. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  7. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088',
  9. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
  10. 'fs.defaultFS' = 'viewfs://your-cluster',
  11. 'fs.viewfs.mounttable.your-cluster.link./ns1' = 'hdfs://your-nameservice/',
  12. 'fs.viewfs.mounttable.your-cluster.homedir' = '/ns1'
  13. );

viewfs related parameters can be added to the catalog configuration as above, or added to conf/core-site.xml.

How viewfs works and parameter configuration, please refer to relevant hadoop documents, for example, https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html

Hive On JuiceFS

Data is stored in JuiceFS, examples are as follows:

(Need to put juicefs-hadoop-x.x.x.jar under fe/lib/ and apache_hdfs_broker/lib/)

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'hadoop.username' = 'root',
  5. 'fs.jfs.impl' = 'io.juicefs.JuiceFileSystem',
  6. 'fs.AbstractFileSystem.jfs.impl' = 'io.juicefs.JuiceFS',
  7. 'juicefs.meta' = 'xxx'
  8. );

Hive On S3

  1. CREATE CATALOG hive PROPERTIES (
  2. "type"="hms",
  3. "hive.metastore.uris" = "thrift://172.0.0.1:9083",
  4. "s3.endpoint" = "s3.us-east-1.amazonaws.com",
  5. "s3.region" = "us-east-1",
  6. "s3.access_key" = "ak",
  7. "s3.secret_key" = "sk"
  8. "use_path_style" = "true"
  9. );

Options:

  • s3.connection.maximum: s3 maximum connection number, default 50
  • s3.connection.request.timeout: s3 request timeout, default 3000ms
  • s3.connection.timeout: s3 connection timeout, default 1000ms

Hive On OSS

  1. CREATE CATALOG hive PROPERTIES (
  2. "type"="hms",
  3. "hive.metastore.uris" = "thrift://172.0.0.1:9083",
  4. "oss.endpoint" = "oss.oss-cn-beijing.aliyuncs.com",
  5. "oss.access_key" = "ak",
  6. "oss.secret_key" = "sk"
  7. );

Hive On OBS

  1. CREATE CATALOG hive PROPERTIES (
  2. "type"="hms",
  3. "hive.metastore.uris" = "thrift://172.0.0.1:9083",
  4. "obs.endpoint" = "obs.cn-north-4.myhuaweicloud.com",
  5. "obs.access_key" = "ak",
  6. "obs.secret_key" = "sk"
  7. );

Hive On COS

  1. CREATE CATALOG hive PROPERTIES (
  2. "type"="hms",
  3. "hive.metastore.uris" = "thrift://172.0.0.1:9083",
  4. "cos.endpoint" = "cos.ap-beijing.myqcloud.com",
  5. "cos.access_key" = "ak",
  6. "cos.secret_key" = "sk"
  7. );

Hive With Glue

  1. CREATE CATALOG hive PROPERTIES (
  2. "type"="hms",
  3. "hive.metastore.type" = "glue",
  4. "glue.endpoint" = "https://glue.us-east-1.amazonaws.com",
  5. "glue.access_key" = "ak",
  6. "glue.secret_key" = "sk"
  7. );

Metadata Cache & Refresh

For Hive Catalog, 4 types of metadata are cached in Doris:

  1. Table structure: cache table column information, etc.
  2. Partition value: Cache the partition value information of all partitions of a table.
  3. Partition information: Cache the information of each partition, such as partition data format, partition storage location, partition value, etc.
  4. File information: Cache the file information corresponding to each partition, such as file path location, etc.

The above cache information will not be persisted to Doris, so operations such as restarting Doris’s FE node, switching masters, etc. may cause the cache to become invalid. After the cache expires, Doris will directly access the Hive MetaStore to obtain information and refill the cache.

Metadata cache can be updated automatically, manually, or configured with TTL (Time-to-Live) according to user needs.

Default behavior and TTL

By default, the metadata cache expires 10 minutes after it is first accessed. This time is determined by the configuration parameter external_cache_expire_time_minutes_after_access in fe.conf. (Note that in versions 2.0.1 and earlier, the default value for this parameter was 1 day).

For example, if the user accesses the metadata of table A for the first time at 10:00, then the metadata will be cached and will automatically expire after 10:10. If the user accesses the same metadata again at 10:11, Doris will directly access the Hive MetaStore to obtain information and refill the cache.

external_cache_expire_time_minutes_after_access affects all 4 caches under Catalog.

For the INSERT INTO OVERWRITE PARTITION operation commonly used in Hive, you can also timely update the File Information Cache by configuring the TTL of the File Information Cache:

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'file.meta.cache.ttl-second' = '60'
  5. );

In the above example, file.meta.cache.ttl-second is set to 60 seconds, and the cache will expire after 60 seconds. This parameter will only affect the file information cache.

You can also set this value to 0 to disable file caching, which will fetch file information directly from the Hive MetaStore every time.

Manual refresh

Users need to manually refresh the metadata through the REFRESH command.

  1. REFRESH CATALOG: Refresh the specified Catalog.

    1. REFRESH CATALOG ctl1 PROPERTIES("invalid_cache" = "true");

    This command will refresh the database list, table list, and all cache information of the specified Catalog.

    invalid_cache indicates whether to flush the cache. Defaults to true. If it is false, only the database and table list of the catalog will be refreshed, but the cache information will not be refreshed. This parameter is applicable when the user only wants to synchronize newly added or deleted database/table information.

  2. REFRESH DATABASE: Refresh the specified Database.

    1. REFRESH DATABASE [ctl.]db1 PROPERTIES("invalid_cache" = "true");

    This command will refresh the table list of the specified Database and all cached information under the Database.

    The meaning of the invalid_cache attribute is the same as above. Defaults to true. If false, only the Database’s table list will be refreshed, not cached information. This parameter is suitable for users who only want to synchronize newly added or deleted table information.

  3. REFRESH TABLE: Refresh the specified Table.

    1. REFRESH TABLE [ctl.][db.]tbl1;

    This command will refresh all cache information under the specified Table.

Regular refresh

Users can set the scheduled refresh of the Catalog when creating the Catalog.

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'metadata_refresh_interval_sec' = '600'
  5. );

In the above example, metadata_refresh_interval_sec means refreshing the Catalog every 600 seconds. Equivalent to automatically executing every 600 seconds:

REFRESH CATALOG ctl1 PROPERTIES("invalid_cache" = "true");

The scheduled refresh interval must not be less than 5 seconds.

Auto Refresh

Currently, Doris only supports automatic update of metadata in Hive Metastore (HMS). It perceives changes in metadata by the FE node which regularly reads the notification events from HMS. The supported events are as follows:

EventCorresponding Update Operation
CREATE DATABASECreate a database in the corresponding catalog.
DROP DATABASEDelete a database in the corresponding catalog.
ALTER DATABASESuch alterations mainly include changes in properties, comments, or storage location of databases. They do not affect Doris’ queries in External Catalogs so they will not be synchronized.
CREATE TABLECreate a table in the corresponding database.
DROP TABLEDelete a table in the corresponding database, and invalidate the cache of that table.
ALTER TABLEIf it is a renaming, delete the table of the old name, and then create a new table with the new name; otherwise, invalidate the cache of that table.
ADD PARTITIONAdd a partition to the cached partition list of the corresponding table.
DROP PARTITIONDelete a partition from the cached partition list of the corresponding table, and invalidate the cache of that partition.
ALTER PARTITIONIf it is a renaming, delete the partition of the old name, and then create a new partition with the new name; otherwise, invalidate the cache of that partition.

After data ingestion, changes in partition tables will follow the ALTER PARTITION logic, while those in non-partition tables will follow the ALTER TABLE logic.

If changes are conducted on the file system directly instead of through the HMS, the HMS will not generate an event. As a result, such changes will not be perceived by Doris.

The automatic update feature involves the following parameters in fe.conf:

  1. enable_hms_events_incremental_sync: This specifies whether to enable automatic incremental synchronization for metadata, which is disabled by default.
  2. hms_events_polling_interval_ms: This specifies the interval between two readings, which is set to 10000 by default. (Unit: millisecond)
  3. hms_events_batch_size_per_rpc: This specifies the maximum number of events that are read at a time, which is set to 500 by default.

To enable automatic update(Excluding Huawei MRS), you need to modify the hive-site.xml of HMS and then restart HMS and HiveServer2:

  1. <property>
  2. <name>hive.metastore.event.db.notification.api.auth</name>
  3. <value>false</value>
  4. </property>
  5. <property>
  6. <name>hive.metastore.dml.events</name>
  7. <value>true</value>
  8. </property>
  9. <property>
  10. <name>hive.metastore.transactional.event.listeners</name>
  11. <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
  12. </property>

Huawei’s MRS needs to change hivemetastore-site.xml and restart HMS and HiveServer2:

  1. <property>
  2. <name>metastore.transactional.event.listeners</name>
  3. <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
  4. </property>

Note: Value is appended with commas separated from the original value, not overwritten.For example, the default configuration for MRS 3.1.0 is

  1. <property>
  2. <name>metastore.transactional.event.listeners</name>
  3. <value>com.huawei.bigdata.hive.listener.TableKeyFileManagerListener,org.apache.hadoop.hive.metastore.listener.FileAclListener</value>
  4. </property>

We need to change to

  1. <property>
  2. <name>metastore.transactional.event.listeners</name>
  3. <value>com.huawei.bigdata.hive.listener.TableKeyFileManagerListener,org.apache.hadoop.hive.metastore.listener.FileAclListener,org.apache.hive.hcatalog.listener.DbNotificationListener</value>
  4. </property>

Note: To enable automatic update, whether for existing Catalogs or newly created Catalogs, all you need is to set enable_hms_events_incremental_sync to true, and then restart the FE node. You don’t need to manually update the metadata before or after the restart.

Hive Version

Doris can correctly access the Hive Metastore in different Hive versions. By default, Doris will access the Hive Metastore with a Hive 2.3 compatible interface. You can also specify the hive version when creating the Catalog. If accessing Hive 1.1.0 version:

  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
  4. 'hive.version' = '1.1.0'
  5. );

Column type mapping

For Hive/Iceberge/Hudi

HMS TypeDoris TypeComment
booleanboolean
tinyinttinyint
smallintsmallint
intint
bigintbigint
datedate
timestampdatetime
floatfloat
doubledouble
charchar
varcharvarchar
decimaldecimal
array<type>array<type>support nested type, for example array<array<int>>
map<KeyType, ValueType>map<KeyType, ValueType>support nested type, for example map<string, array<int>>
struct<col1: Type1, col2: Type2, …>struct<col1: Type1, col2: Type2, …>support nested type, for example struct<col1: array<int>, col2: map<int, date>>
otherunsupported

Whether to truncate char or varchar columns according to the schema of the hive table

If the variable truncate_char_or_varchar_columns is enabled, when the maximum length of the char or varchar column in the schema of the hive table is inconsistent with the schema in the underlying parquet or orc file, it will be truncated according to the maximum length of the hive table column.

The variable default is false.

Access HMS with broker

Add following setting when creating an HMS catalog, file splitting and scanning for Hive external table will be completed by broker named test_broker

  1. "broker.name" = "test_broker"

Integrate with Apache Ranger

Apache Ranger is a security framework for monitoring, enabling services, and comprehensive data security access management on the Hadoop platform.

Currently doris supports ranger library, table, and column permissions, but does not support encryption, row permissions, etc.

Settings

To connect to the Hive Metastore with Ranger permission verification enabled, you need to add configuration & configuration environment:

  1. When creating a Catalog, add:
  1. "access_controller.properties.ranger.service.name" = "hive",
  2. "access_controller.class" = "org.apache.doris.catalog.authorizer.RangerHiveAccessControllerFactory",
  1. Configure all FE environments:

    1. Copy the configuration files ranger-hive-audit.xml, ranger-hive-security.xml, and ranger-policymgr-ssl.xml under the HMS conf directory to the FE conf directory.

    2. Modify the properties of ranger-hive-security.xml, the reference configuration is as follows:

      1. <?xml version="1.0" encoding="UTF-8"?>
      2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      3. <configuration>
      4. #The directory for caching permission data, needs to be writable
      5. <property>
      6. <name>ranger.plugin.hive.policy.cache.dir</name>
      7. <value>/mnt/datadisk0/zhangdong/rangerdata</value>
      8. </property>
      9. #The time interval for periodically pulling permission data
      10. <property>
      11. <name>ranger.plugin.hive.policy.pollIntervalMs</name>
      12. <value>30000</value>
      13. </property>
      14. <property>
      15. <name>ranger.plugin.hive.policy.rest.client.connection.timeoutMs</name>
      16. <value>60000</value>
      17. </property>
      18. <property>
      19. <name>ranger.plugin.hive.policy.rest.client.read.timeoutMs</name>
      20. <value>60000</value>
      21. </property>
      22. <property>
      23. <name>ranger.plugin.hive.policy.rest.ssl.config.file</name>
      24. <value></value>
      25. </property>
      26. <property>
      27. <name>ranger.plugin.hive.policy.rest.url</name>
      28. <value>http://172.21.0.32:6080</value>
      29. </property>
      30. <property>
      31. <name>ranger.plugin.hive.policy.source.impl</name>
      32. <value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
      33. </property>
      34. <property>
      35. <name>ranger.plugin.hive.service.name</name>
      36. <value>hive</value>
      37. </property>
      38. <property>
      39. <name>xasecure.hive.update.xapolicies.on.grant.revoke</name>
      40. <value>true</value>
      41. </property>
      42. </configuration>
    3. In order to obtain the log of Ranger authentication itself, add the configuration file log4j.properties in the <doris_home>/conf directory.

    4. Restart FE.

Best Practices

  1. Create user user1 on the ranger side and authorize the query permission of db1.table1.col1

  2. Create role role1 on the ranger side and authorize the query permission of db1.table1.col2

  3. Create a user user1 with the same name in doris, user1 will directly have the query authority of db1.table1.col1

  4. Create role1 with the same name in doris, and assign role1 to user1, user1 will have the query authority of db1.table1.col1 and col2 at the same time