Data Quality

Data quality refers to the overall accuracy, completeness, consistency, and validity of data. Ensuring data quality is vital for accurate analysis and reporting, as well as for compliance with regulations and maintaining trust in your organization’s data infrastructure.

Hudi offers Pre-Commit Validators that allow you to ensure that your data meets certain data quality expectations as you are writing with Hudi Streamer or Spark Datasource writers.

To configure pre-commit validators, use this setting hoodie.precommit.validators=<comma separated list of validator class names>.

Example:

  1. spark.write.format("hudi")
  2. .option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator")

Today you can use any of these validators and even have the flexibility to extend your own:

SQL Query Single Result

org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator

The SQL Query Single Result validator can be used to validate that a query on the table results in a specific value. This validator allows you to run a SQL query and abort the commit if it does not match the expected output.

Multiple queries can be separated by ; delimiter. Include the expected result as part of the query separated by #.

Syntax: query1#result1;query2#result2

Example:

  1. // In this example, we set up a validator that expects there is no row with `col` column as `null`
  2. import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
  3. df.write.format("hudi").mode(Overwrite).
  4. option("hoodie.table.name", tableName).
  5. option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
  6. option("hoodie.precommit.validators.single.value.sql.queries", "select count(*) from <TABLE_NAME> where col is null#0").
  7. save(basePath)

SQL Query Equality

org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator

The SQL Query Equality validator runs a query before ingesting the data, then runs the same query after ingesting the data and confirms that both outputs match. This allows you to validate for equality of rows before and after the commit.

This validator is useful when you want to verify that your query does not change a specific subset of the data. Some examples:

  • Validate that the number of null fields is the same before and after your query
  • Validate that there are no duplicate records after your query runs
  • Validate that you are only updating the data, and no inserts slip through

Example:

  1. // In this example, we set up a validator that expects no change of null rows with the new commit
  2. import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
  3. df.write.format("hudi").mode(Overwrite).
  4. option("hoodie.table.name", tableName).
  5. option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator").
  6. option("hoodie.precommit.validators.equality.sql.queries", "select count(*) from <TABLE_NAME> where col is null").
  7. save(basePath)

SQL Query Inequality

org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator

The SQL Query Inquality validator runs a query before ingesting the data, then runs the same query after ingesting the data and confirms that both outputs DO NOT match. This allows you to confirm changes in the rows before and after the commit.

Example:

  1. // In this example, we set up a validator that expects a change of null rows with the new commit
  2. import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
  3. df.write.format("hudi").mode(Overwrite).
  4. option("hoodie.table.name", tableName).
  5. option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator").
  6. option("hoodie.precommit.validators.inequality.sql.queries", "select count(*) from <TABLE_NAME> where col is null").
  7. save(basePath)

Extend Custom Validator

Users can also provide their own implementations by extending the abstract class SparkPreCommitValidator and overriding this method

  1. void validateRecordsBeforeAndAfter(Dataset<Row> before,
  2. Dataset<Row> after,
  3. Set<String> partitionsAffected)

Additional Monitoring with Notifications

Hudi offers a commit notification service that can be configured to trigger notifications about write commits.

The commit notification service can be combined with pre-commit validators to send a notification when a commit fails a validation. This is possible by passing details about the validation as a custom value to the HTTP endpoint.

Videos