Exporter

Introduction

HoodieSnapshotExporter allows you to copy data from one location to another for backups or other purposes. You can write data as Hudi, Json, Orc, or Parquet file formats. In addition to copying data, you can also repartition data with a provided field or implement custom repartitioning by extending a class shown in detail below.

Arguments

HoodieSnapshotExporter accepts a reference to a source path and a destination path. The utility will issue a query, perform any repartitioning if required and will write the data as Hudi, parquet, or json format.

ArgumentDescriptionRequiredNote
—source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
—target-output-pathOutput path for storing a particular snapshotrequired
—output-formatOutput format for the exported dataset; accept these values: json,parquet,hudirequired
—output-partition-fieldA field to be used by Spark repartitioningoptionalIgnored when “Hudi” or when —output-partitioner is specified.The output dataset’s default partition field will inherent from the source Hudi dataset.
—output-partitionerA class to facilitate custom repartitioningoptionalIgnored when using output-format “Hudi”
—transformer-classA subclass of org.apache.hudi.utilities.transform.Transformer. Allows transforming raw source Dataset to a target Dataset (conforming to target schema) before writing.optionalIgnored when using output-format “Hudi”. Available transformers: org.apache.hudi.utilities.transform.SqlQueryBasedTransformer, org.apache.hudi.utilities.transform.SqlFileBasedTransformer, org.apache.hudi.utilities.transform.FlatteningTransformer, org.apache.hudi.utilities.transform.AWSDmsTransformer.
—transformer-sqlsql-query template be used to transform the source before writing. The query should reference the source as a table named “<SRC>”.optionalIs required for SqlQueryBasedTransformer transformer class, ignored in other cases
—transformer-sqlFile with a SQL query to be executed during write. The query should reference the source as a table named “<SRC>”.optionalIs required for SqlFileBasedTransformer, ignored in other cases

Examples

Copy a Hudi dataset

Exporter scans the source dataset and then makes a copy of it to the target output path.

  1. spark-submit \
  2. --jars "packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
  3. --deploy-mode "client" \
  4. --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
  5. packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
  6. --source-base-path "/tmp/" \
  7. --target-output-path "/tmp/exported/hudi/" \
  8. --output-format "hudi"

Export to json or parquet dataset

The Exporter can also convert the source dataset into other formats. Currently only “json” and “parquet” are supported.

  1. spark-submit \
  2. --jars "packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
  3. --deploy-mode "client" \
  4. --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
  5. packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
  6. --source-base-path "/tmp/" \
  7. --target-output-path "/tmp/exported/json/" \
  8. --output-format "json" # or "parquet"

Export to json or parquet dataset with transformation/filtering

The Exporter supports custom transformation/filtering on records before writing to json or parquet dataset. This is done by supplying implementation of org.apache.hudi.utilities.transform.Transformer via --transformer-class option.

  1. spark-submit \
  2. --jars "packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
  3. --deploy-mode "client" \
  4. --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
  5. packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
  6. --source-base-path "/tmp/" \
  7. --target-output-path "/tmp/exported/json/" \
  8. --transformer-class "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
  9. --transformer-sql "SELECT substr(rider,1,10) as rider, trip_type as tripType FROM <SRC> WHERE trip_type = 'BLACK' LIMIT 10" \
  10. --output-format "json" # or "parquet"

Re-partitioning

When exporting to a different format, the Exporter takes the --output-partition-field parameter to do some custom re-partitioning. Note: All _hoodie_* metadata fields will be stripped during export, so make sure to use an existing non-metadata field as the output partitions.

By default, if no partitioning parameters are given, the output dataset will have no partition.

Example:

  1. spark-submit \
  2. --jars "packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
  3. --deploy-mode "client" \
  4. --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
  5. packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
  6. --source-base-path "/tmp/" \
  7. --target-output-path "/tmp/exported/json/" \
  8. --output-format "json" \
  9. --output-partition-field "symbol" # assume the source dataset contains a field `symbol`

The output directory will look like this

  1. `_SUCCESS symbol=AMRS symbol=AYX symbol=CDMO symbol=CRC symbol=DRNA ...`

Custom Re-partitioning

--output-partitioner parameter takes in a fully-qualified name of a class that implements HoodieSnapshotExporter.Partitioner. This parameter takes higher precedence than --output-partition-field, which will be ignored if this is provided.

An example implementation is shown below:

MyPartitioner.java

  1. package com.foo.bar;
  2. public class MyPartitioner implements HoodieSnapshotExporter.Partitioner {
  3. private static final String PARTITION_NAME = "date";
  4. @Override
  5. public DataFrameWriter<Row> partition(Dataset<Row> source) {
  6. // use the current hoodie partition path as the output partition
  7. return source
  8. .withColumnRenamed(HoodieRecord.PARTITION_PATH_METADATA_FIELD, PARTITION_NAME)
  9. .repartition(new Column(PARTITION_NAME))
  10. .write()
  11. .partitionBy(PARTITION_NAME);
  12. }
  13. }

After putting this class in my-custom.jar, which is then placed on the job classpath, the submit command will look like this:

  1. spark-submit \
  2. --jars "packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar,my-custom.jar" \
  3. --deploy-mode "client" \
  4. --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
  5. packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
  6. --source-base-path "/tmp/" \
  7. --target-output-path "/tmp/exported/json/" \
  8. --output-format "json" \
  9. --output-partitioner "com.foo.bar.MyPartitioner"