Batch Reads

Spark DataSource API

The hudi-spark module offers the DataSource API to read a Hudi table into a Spark DataFrame.

A time-travel query example:

  1. val tripsDF = spark.read.
  2. option("as.of.instant", "2021-07-28 14:11:08.000").
  3. format("hudi").
  4. load(basePath)
  5. tripsDF.where(tripsDF.fare > 20.0).show()

Daft

Daft supports reading Hudi tables using daft.read_hudi() function.

  1. # Read Apache Hudi table into a Daft DataFrame.
  2. import daft
  3. df = daft.read_hudi("some-table-uri")
  4. df = df.where(df["foo"] > 5)
  5. df.show()

Check out the Daft docs for Hudi integration.