Iceberg

使用限制

  1. 支持 Iceberg V1/V2 表格式。
  2. V2 格式仅支持 Position Delete 方式,不支持 Equality Delete。

创建 Catalog

基于Hive Metastore创建Catalog

和 Hive Catalog 基本一致,这里仅给出简单示例。其他示例可参阅 Hive Catalog

  1. CREATE CATALOG iceberg PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
  4. 'hadoop.username' = 'hive',
  5. 'dfs.nameservices'='your-nameservice',
  6. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  7. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
  9. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
  10. );

基于Iceberg API创建Catalog

使用Iceberg API访问元数据的方式,支持Hadoop File System、Hive、REST、Glue、DLF等服务作为Iceberg的Catalog。

Hadoop Catalog

  1. CREATE CATALOG iceberg_hadoop PROPERTIES (
  2. 'type'='iceberg',
  3. 'iceberg.catalog.type' = 'hadoop',
  4. 'warehouse' = 'hdfs://your-host:8020/dir/key'
  5. );
  1. CREATE CATALOG iceberg_hadoop_ha PROPERTIES (
  2. 'type'='iceberg',
  3. 'iceberg.catalog.type' = 'hadoop',
  4. 'warehouse' = 'hdfs://your-nameservice/dir/key',
  5. 'dfs.nameservices'='your-nameservice',
  6. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  7. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
  9. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
  10. );
  1. CREATE CATALOG iceberg_s3 PROPERTIES (
  2. 'type'='iceberg',
  3. 'iceberg.catalog.type' = 'hadoop',
  4. 'warehouse' = 's3://bucket/dir/key',
  5. 's3.endpoint' = 's3.us-east-1.amazonaws.com',
  6. 's3.access_key' = 'ak',
  7. 's3.secret_key' = 'sk'
  8. );

Hive Metastore

  1. CREATE CATALOG iceberg PROPERTIES (
  2. 'type'='iceberg',
  3. 'iceberg.catalog.type'='hms',
  4. 'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
  5. 'hadoop.username' = 'hive',
  6. 'dfs.nameservices'='your-nameservice',
  7. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
  9. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
  10. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
  11. );

AWS Glue

连接Glue时,如果是在非EC2环境,需要将EC2环境里的 ~/.aws 目录拷贝到当前环境里。也可以下载AWS Cli工具进行配置,这种方式也会在当前用户目录下创建.aws目录。

  1. CREATE CATALOG glue PROPERTIES (
  2. "type"="iceberg",
  3. "iceberg.catalog.type" = "glue",
  4. "glue.endpoint" = "https://glue.us-east-1.amazonaws.com",
  5. "glue.access_key" = "ak",
  6. "glue.secret_key" = "sk"
  7. );
  1. Iceberg 属性详情参见 Iceberg Glue Catalog

  2. 如果在AWS服务(如EC2)中,不填写Credentials相关信息(glue.access_keyglue.secret_key),Doris就会使用默认的DefaultAWSCredentialsProviderChain,它会读取系统环境变量或者InstanceProfile中配置的属性。

阿里云 DLF

参见阿里云DLF Catalog配置

REST Catalog

该方式需要预先提供REST服务,用户需实现获取Iceberg元数据的REST接口。

  1. CREATE CATALOG iceberg PROPERTIES (
  2. 'type'='iceberg',
  3. 'iceberg.catalog.type'='rest',
  4. 'uri' = 'http://172.21.0.1:8181'
  5. );

如果使用HDFS存储数据,并开启了高可用模式,还需在Catalog中增加HDFS高可用配置:

  1. CREATE CATALOG iceberg PROPERTIES (
  2. 'type'='iceberg',
  3. 'iceberg.catalog.type'='rest',
  4. 'uri' = 'http://172.21.0.1:8181',
  5. 'dfs.nameservices'='your-nameservice',
  6. 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
  7. 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.1:8020',
  8. 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.2:8020',
  9. 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
  10. );

Google Dataproc Metastore

  1. CREATE CATALOG iceberg PROPERTIES (
  2. "type"="iceberg",
  3. "iceberg.catalog.type"="hms",
  4. "hive.metastore.uris" = "thrift://172.21.0.1:9083",
  5. "gs.endpoint" = "https://storage.googleapis.com",
  6. "gs.region" = "us-east-1",
  7. "gs.access_key" = "ak",
  8. "gs.secret_key" = "sk",
  9. "use_path_style" = "true"
  10. );

hive.metastore.uris: Dataproc Metastore 服务开放的接口,在 Metastore 管理页面获取 :Dataproc Metastore Services.

Iceberg On Object Storage

若数据存放在S3上,properties中可以使用以下参数:

  1. "s3.access_key" = "ak"
  2. "s3.secret_key" = "sk"
  3. "s3.endpoint" = "s3.us-east-1.amazonaws.com"
  4. "s3.region" = "us-east-1"

数据存放在阿里云OSS上:

  1. "oss.access_key" = "ak"
  2. "oss.secret_key" = "sk"
  3. "oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
  4. "oss.region" = "oss-cn-beijing"

数据存放在腾讯云COS上:

  1. "cos.access_key" = "ak"
  2. "cos.secret_key" = "sk"
  3. "cos.endpoint" = "cos.ap-beijing.myqcloud.com"
  4. "cos.region" = "ap-beijing"

数据存放在华为云OBS上:

  1. "obs.access_key" = "ak"
  2. "obs.secret_key" = "sk"
  3. "obs.endpoint" = "obs.cn-north-4.myhuaweicloud.com"
  4. "obs.region" = "cn-north-4"

列类型映射

和 Hive Catalog 一致,可参阅 Hive Catalog列类型映射 一节。

Time Travel

支持读取 Iceberg 表指定的 Snapshot。

每一次对iceberg表的写操作都会产生一个新的快照。

默认情况下,读取请求只会读取最新版本的快照。

可以使用 FOR TIME AS OFFOR VERSION AS OF 语句,根据快照 ID 或者快照产生的时间读取历史版本的数据。示例如下:

SELECT * FROM iceberg_tbl FOR TIME AS OF "2022-10-07 17:20:37";

SELECT * FROM iceberg_tbl FOR VERSION AS OF 868895038966572;

另外,可以使用 iceberg_meta 表函数查询指定表的 snapshot 信息。