Key Generation

Hudi needs some way to point to records in the table, so that base/log files can be merged efficiently for updates/deletes, index entries can reference these rows and records can move around within the table from clustering without side effects. In fact, most databases adopt similar techniques. Every record in Hudi is uniquely identified a pair of record key and an optional partition path that can limit the scope of the key’s uniqueness (non-global indexing). For tables with a global index, records are identified by just the record key such that uniqueness is applied across partitions.

Using keys, Hudi can impose partition/table level uniqueness integrity constraint as well as enable fast updates and deletes on records. Record keys are materialized in a special _hoodie_record_key field in the table, to ensure key uniqueness is maintained even when the record generation is changed during the table’s lifetime. Without materialization, there are no guarantees that the past data written for a new key is unique across the table.

Hudi offers many ways to generate record keys from the input data during writes.

  • For Java client/Spark/Flink writers, Hudi provides built-in key generator classes (described below) as well as an interface to write custom implementations.

  • SQL engines offer options to pass in key fields and use PARTITIONED BY clauses to control partitioning.

By default, Hudi auto-generates keys for INSERT, BULK_INSERT write operations, that are efficient for compute, storage and read to meet the uniqueness requirements of the primary key. Auto generated keys are highly compressible compared to UUIDs costing about $0.023 per GB in cloud storage and 3-10x computationally lighter to generate than base64/uuid encoded keys.

Key Generators

Hudi provides several key generators out of the box for JVM users can use based on their need, while having a pluggable interface for users to implement and use their own.

Before diving into different types of key generators, let’s go over some of the common configs relevant to key generators.

Config NameDefaultDescription
hoodie.datasource.write.recordkey.fieldN/A (Optional)Record key field. Value to be used as the recordKey component of HoodieKey.
  • When configured, actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c.
  • When not configured record key will be automatically generated by Hudi. This feature is handy for use cases like log ingestion that do not have a naturally present record key.

Config Param: RECORDKEY_FIELD_NAME
hoodie.datasource.write.partitionpath.fieldN/A (Optional)Partition path field. Value to be used at the partitionPath component of HoodieKey. This needs to be specified if a partitioned table is desired. Actual value obtained by invoking .toString()
Config Param: PARTITIONPATH_FIELD_NAME
hoodie.datasource.write.keygenerator.typeSIMPLEString representing key generator type

Config Param: KEYGENERATOR_TYPE
hoodie.datasource.write.keygenerator.classN/A (Optional)Key generator class, that implements org.apache.hudi.keygen.KeyGenerator extract a key out of incoming records.
  • When set, the configured value takes precedence to be in effect and automatic inference is not triggered.
  • When not configured, if hoodie.datasource.write.keygenerator.type is set, the configured value is used else automatic inference is triggered.
  • In case of auto generated record keys, if neither the key generator class nor type are configured, Hudi will also auto infer the partitioning. for eg, if partition field is not configured, hudi will assume its non-partitioned.

Config Param: KEYGENERATOR_CLASS_NAME
hoodie.datasource.write.hive_style_partitioningfalse (Optional)Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)

Config Param: HIVE_STYLE_PARTITIONING_ENABLE
hoodie.datasource.write.partitionpath.urlencodefalse (Optional)Should we url encode the partition path value, before creating the folder structure.

Config Param: URL_ENCODE_PARTITIONING

For all advanced configs refer here.

SIMPLE

This is the most commonly used option. Record key is generated from two fields from the schema, one for record key and one for partition path. Values are interpreted as is from dataframe and converted to string.

COMPLEX

Both record key and partition paths comprise one or more than one field by name(combination of multiple fields). Fields are expected to be comma separated in the config value. For example "Hoodie.datasource.write.recordkey.field" : “col1,col4”

NON_PARTITION

If your hudi dataset is not partitioned, you could use this “NonpartitionedKeyGenerator” which will return an empty partition for all records. In other words, all records go to the same partition (which is empty “”)

CUSTOM

This is a generic implementation of KeyGenerator where users are able to leverage the benefits of SimpleKeyGenerator, ComplexKeyGenerator and TimestampBasedKeyGenerator all at the same time. One can configure record key and partition paths as a single field or a combination of fields.

  1. hoodie.datasource.write.recordkey.field
  2. hoodie.datasource.write.partitionpath.field
  3. hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator

This keyGenerator is particularly useful if you want to define complex partition paths involving regular fields and timestamp based fields. It expects value for prop "hoodie.datasource.write.partitionpath.field" in a specific format. The format should be “field1:PartitionKeyType1,field2:PartitionKeyType2…”

The complete partition path is created as <value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2> and so on. Each partition key type could either be SIMPLE or TIMESTAMP.

Example config value: “field_3:simple,field_5:timestamp”

RecordKey config value is either single field incase of SimpleKeyGenerator or a comma separate field names if referring to ComplexKeyGenerator. Example:

  1. hoodie.datasource.write.recordkey.field=field1,field2

This will create your record key in the format field1:value1,field2:value2 and so on, otherwise you can specify only one field in case of simple record keys. CustomKeyGenerator class defines an enum PartitionKeyType for configuring partition paths. It can take two possible values - SIMPLE and TIMESTAMP. The value for hoodie.datasource.write.partitionpath.field property in case of partitioned tables needs to be provided in the format field1:PartitionKeyType1,field2:PartitionKeyType2 and so on. For example, if you want to create partition path using 2 fields country and date where the latter has timestamp based values and needs to be customised in a given format, you can specify the following

  1. hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:TIMESTAMP

This will create the partition path in the format <country_name>/<date> or country=<country_name>/date=<date> depending on whether you want hive style partitioning or not.

TIMESTAMP

This key generator relies on timestamps for the partition field. The field values are interpreted as timestamps and not just converted to string while generating partition path value for records. Record key is same as before where it is chosen by field name. Users are expected to set few more configs to use this KeyGenerator.

Configs to be set:

Config NameDefaultDescription
hoodie.keygen.timebased.timestamp.typeN/A (Required)Required only when the key generator is TimestampBasedKeyGenerator. One of the timestamp types supported(UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR)
hoodie.keygen.timebased.output.dateformat“” (Optional)Output date format such as yyyy-MM-dd’T’HH:mm:ss.SSSZ
hoodie.keygen.timebased.timezone“UTC” (Optional)Timezone of both input and output timestamp if they are the same, such as UTC. Please use hoodie.keygen.timebased.input.timezone and hoodie.keygen.timebased.output.timezone instead if the input and output timezones are different.
hoodie.keygen.timebased.input.dateformat“” (Optional)Input date format such as yyyy-MM-dd’T’HH:mm:ss.SSSZ.

Let’s go over some example values for TimestampBasedKeyGenerator.

Timestamp is GMT

Config NameValue
hoodie.streamer.keygen.timebased.timestamp.type“EPOCHMILLISECONDS”
hoodie.streamer.keygen.timebased.output.dateformat“yyyy-MM-dd hh”
hoodie.streamer.keygen.timebased.timezone“GMT+8:00”

Input Field value: “1578283932000L”
Partition path generated from key generator: “2020-01-06 12”

If input field value is null for some rows.
Partition path generated from key generator: “1970-01-01 08”

Timestamp is DATE_STRING

Config NameValue
hoodie.streamer.keygen.timebased.timestamp.type“DATE_STRING”
hoodie.streamer.keygen.timebased.output.dateformat“yyyy-MM-dd hh”
hoodie.streamer.keygen.timebased.timezone“GMT+8:00”
hoodie.streamer.keygen.timebased.input.dateformat“yyyy-MM-dd hh:mm:ss”

Input field value: “2020-01-06 12:12:12”
Partition path generated from key generator: “2020-01-06 12”

If input field value is null for some rows.
Partition path generated from key generator: “1970-01-01 12:00:00”

Scalar examples

Config NameValue
hoodie.streamer.keygen.timebased.timestamp.type“SCALAR”
hoodie.streamer.keygen.timebased.output.dateformat“yyyy-MM-dd hh”
hoodie.streamer.keygen.timebased.timezone“GMT”
hoodie.streamer.keygen.timebased.timestamp.scalar.time.unit“days”

Input field value: “20000L”
Partition path generated from key generator: “2024-10-04 12”

If input field value is null.
Partition path generated from key generator: “1970-01-02 12”

ISO8601WithMsZ with Single Input format

Config NameValue
hoodie.streamer.keygen.timebased.timestamp.type“DATE_STRING”
hoodie.streamer.keygen.timebased.input.dateformat“yyyy-MM-dd’T’HH:mm:ss.SSSZ”
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex“”
hoodie.streamer.keygen.timebased.input.timezone“”
hoodie.streamer.keygen.timebased.output.dateformat“yyyyMMddHH”
hoodie.streamer.keygen.timebased.output.timezone“GMT”

Input field value: “2020-04-01T13:01:33.428Z”
Partition path generated from key generator: “2020040113”

ISO8601WithMsZ with Multiple Input formats

Config NameValue
hoodie.streamer.keygen.timebased.timestamp.type“DATE_STRING”
hoodie.streamer.keygen.timebased.input.dateformat“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ”
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex“”
hoodie.streamer.keygen.timebased.input.timezone“”
hoodie.streamer.keygen.timebased.output.dateformat“yyyyMMddHH”
hoodie.streamer.keygen.timebased.output.timezone“UTC”

Input field value: “2020-04-01T13:01:33.428Z”
Partition path generated from key generator: “2020040113”

ISO8601NoMs with offset using multiple input formats

Config NameValue
hoodie.streamer.keygen.timebased.timestamp.type“DATE_STRING”
hoodie.streamer.keygen.timebased.input.dateformat“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ”
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex“”
hoodie.streamer.keygen.timebased.input.timezone“”
hoodie.streamer.keygen.timebased.output.dateformat“yyyyMMddHH”
hoodie.streamer.keygen.timebased.output.timezone“UTC”

Input field value: “2020-04-01T13:01:33-05:00
Partition path generated from key generator: “2020040118”

Input as short date string and expect date in date format

Config NameValue
hoodie.streamer.keygen.timebased.timestamp.type“DATE_STRING”
hoodie.streamer.keygen.timebased.input.dateformat“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ,yyyyMMdd”
hoodie.streamer.keygen.timebased.input.dateformat.list.delimiter.regex“”
hoodie.streamer.keygen.timebased.input.timezone“UTC”
hoodie.streamer.keygen.timebased.output.dateformat“MM/dd/yyyy”
hoodie.streamer.keygen.timebased.output.timezone“UTC”

Input field value: “20200401”
Partition path generated from key generator: “04/01/2020”