AWS Glue Data Catalog

Hudi tables can sync to AWS Glue Data Catalog directly via AWS SDK. Piggyback on HiveSyncTool , org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool makes use of all the configurations that are taken by HiveSyncTool and send them to AWS Glue.

Configurations

Most of the configurations for AwsGlueCatalogSyncTool are shared with HiveSyncTool. The example showed in Sync to Hive Metastore can be used as is for sync with Glue Data Catalog, provided that the hive metastore URL (either JDBC or thrift URI) can proxied to Glue Data Catalog, which is usually done within AWS EMR or Glue job environment.

For Hudi streamer, users can set

  1. --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool

For Spark data source writers, users can set

  1. hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool

Avoid creating excessive versions

Tables stored in Glue Data Catalog are versioned. And by default, every Hudi commit triggers a sync operation if enabled, regardless of having relevant metadata changes. This can lead to too many versions kept in the catalog and eventually failing the sync operation.

Meta-sync can be set to conditional - only sync when there are schema change or partition change. This can avoid creating excessive versions in the catalog. Users can enable it by setting

  1. hoodie.datasource.meta_sync.condition.sync=true

Glue Data Catalog specific configs

Sync to Glue Data Catalog can be optimized with other configs like

  1. hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism
  2. hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism
  3. hoodie.datasource.meta.sync.glue.partition_change_parallelism

Partition indexes can also be used by setting

  1. hoodie.datasource.meta.sync.glue.partition_index_fields.enable
  2. hoodie.datasource.meta.sync.glue.partition_index_fields

Other references

Running AWS Glue Catalog Sync for Spark DataSource

To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, you can use the options mentioned in the AWS documentation

Running AWS Glue Catalog Sync from EMR

If you’re running HiveSyncTool on an EMR cluster backed by Glue Data Catalog as external metastore, you can simply run the sync from command line like below:

  1. cd /usr/lib/hudi/bin
  2. ./run_sync_tool.sh --base-path s3://<bucket_name>/<prefix>/<table_name> --database <database_name> --table <table_name> --partitioned-by <column_name>