4.9. Presto Verifier

Presto Verifier is a tool to run queries and verify correctness. It can be used to test whether anew Presto version produces the correct query results, or to test if pairs of Presto queries havethe same semantics.

During each Presto release, Verifier is run to ensure that there is no correctness regression.

Using Verifier

In a MySQL database, create the following table and load it with the queries you would like to run:

  1. CREATE TABLE verifier_queries (
  2. id int(11) unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  3. suite varchar(256) NOT NULL,
  4. name varchar(256) DEFAULT NULL,
  5. control_catalog varchar(256) NOT NULL,
  6. control_schema varchar(256) NOT NULL,
  7. control_query text NOT NULL,
  8. control_username varchar(256) DEFAULT NULL,
  9. control_password varchar(256) DEFAULT NULL,
  10. control_session_properties text DEFAULT NULL,
  11. test_catalog varchar(256) NOT NULL,
  12. test_schema varchar(256) NOT NULL,
  13. test_query text NOT NULL,
  14. test_username varchar(256) DEFAULT NULL,
  15. test_password varchar(256) DEFAULT NULL,
  16. test_session_properties text DEFAULT NULL)

Next, create a config.properties file:

  1. source-query.suite=my_suite
  2. source-query.database=jdbc:mysql://localhost:3306/my_database?user=my_username&password=my_password
  3. control.gateway=jdbc:presto://localhost:8080
  4. test.gateway=jdbc:presto://localhost:8081
  5. test-id=1

Download presto-verifier-0.234-executable.jar and rename it to verifier. To run the Verifier:

  1. chmod +x verifier
  2. ./verifier verify config.properties

Verifier Procedures

The following steps summarize the workflow of Verifier.

    • Importing Source Queries
      • Reads the list of source queries (query pairs with configuration) from the MySQL table.
    • Query Pre-processing and Filtering
      • Applies overrides to the catalog, schema, username, and password of each query.
      • Filters queries according to whitelist and blacklist. Whitelist is applied before blacklist.
      • Filters out queries with invalid syntax.
      • Filters out queries not supported for validation. Select, Insert, andCreateTableAsSelect are supported.
    • Query rewriting
      • Rewrites queries before execution to ensure that production data is not modified.
        • Rewrites Select queries to CreateTableAsSelect
          • Column names are determined by running the Select query with LIMIT 0.
          • Artificial names are used for unnamed columns.
        • Rewrites Insert and CreateTableAsSelect queries to have their table names replaced.
          • Constructs a setup query to create the table necessary for an Insert query.
    • Query Execution
      • Conceptually, Verifier is configured with a control cluster and a test cluster. However, theymay be pointed to the same Presto cluster for certain tests.
        • For each source query, executes the following queries in order.
          • Control setup queries
          • Control query
          • Test setup queries
          • Test query
          • Control and test teardown queries
        • Queries are subject to timeouts and retries.
          • Cluster connection failures and transient Presto failures are retried.
          • Query retries may conceal reliability issues, and therefore Verifier records alloccurred Presto query failures, including the retries.
      • Certain query failures are automatically submitted for re-validation, such as partitiondropped or table dropped during query.
      • See Failure Resolution for auto-resolving of query failures.
    • Results Comparison
      • For Select, Insert, and CreateTableAsSelect queries, results are written intotemporary tables.
      • Constructs and runs the checksum queries for both control and test.
      • Verifies table schema and row count are the same for the control and the test result table.
      • Verifies checksums are matching for each column. See Column Checksums for special handlingof different column types.
      • See Determinism for handling of non-deterministic queries.
    • Emitting Results
      • Verification results can be exported as JSON, or human readable text.

Column Checksums

For each column in the control/test query, one or more columns are generated in the checksumqueries.

    • Floating Point Columns
        • For DOUBLE and REAL columns, 4 columns are generated for verification:
          • Sum of the finite values of the column
          • NAN count of the column
          • Positive infinity count of the column
          • Negative infinity count of the column
      • Checks if NAN count, positive and negative infinity count matches.
      • Checks the nullity of control sum and test sum.
      • If either control mean or test mean very close 0, checks if both are close to 0.
      • Checks the relative error between control sum and test sum.
    • Array Columns
        • 2 columns are generated for verification:
          • Sum of the cardinality
          • Array checksum
        • For an array column arr of type array(E):
          • If E is not orderable, array checksum is checksum(arr).
          • If E is orderable, array checksum coalesce(checksum(try(array_sort(arr))), checksum(arr)).
    • Map Columns
        • 4 columns are generated for verification:
          • Sum of the cardinality
          • Checksum of the map
          • Array checksum of the key set
          • Array checksum of the value set
    • Row Columns
      • Checksums row fields recursively according to the type of the fields.
  • For all other column types, generates a simple checksum using the checksum() function.

Determinism

A result mismatch, either a row count mismatch or a column mismatch, can be caused bynon-deterministic query features. To avoid false alerts, we perform determinism analysisfor the control query. If a query is found non-deterministic, we skip the verification as itdoes not provide insights.

Determinism analysis follows the following steps. If a query is found non-deterministic at anypoint, the analysis will conclude.

  • Non-deterministic catalogs can be specified with determinism.non-deterministic-catalog.If a query references any table from those catalogs, the query is considered non-deterministic.
  • Runs the control query again and compares the results with the initial control query run.
    • If a query has a LIMIT n clause but no ORDER BY clause at the top level:
      • Runs a query to count the number of rows produced by the control query without the LIMITclause.
      • If the resulting row count is greater than n, treats the control query asnon-deterministic.

Failure Resolution

The differences in configuration, including cluster size, can cause a query to succeed on thecontrol cluster but fail on the test cluster. A checksum query can also fail, which may be due tolimitation of Presto or Presto Verifier. Thus, we allow Verifier to automatically resolve certainquery failures.

  • EXCEEDED_GLOBAL_MEMORY_LIMIT: Resolves if the control query uses more memory than the testquery.
  • EXCEEDED_TIME_LIMIT: Resolves unconditionally.
  • TOO_MANY_HIVE_PARTITIONS: Resolves if the test cluster does not have enough workers to makesure the number of partitions assigned to each worker stays within the limit.
  • COMPILER_ERROR: Resolves if checksum fails with this error. If a control query has too manycolumns, generated checksum query might be too large in certain cases.

Extending Verifier

Verifier can be extended for further behavioral changes in addition to configuration properties.

AbstractVerifyCommandshows the components that be extended. Implement the abstract class and create a command line wrapper similar toPrestoVerifier.

Configuration Reference

General Configuration

NameDescription
whitelistA comma-separated list specifying the names of the queries within the suiteto verify.
blacklistA comma-separated list specifying the names of the queries to be excludedfrom the suite. blacklist is applied after whitelist.
source-query-supplierThe name of the source query supplier. Supports mysql.
source-query.table-nameThe name of the table that holds verifier queries. Available only whensource-query-supplier is mysql.
event-clientsA comma-separated list specifying where the output events should be emitted.Supports json and human-readable.
json.log-fileThe output files of JSON events. If not set, JSON events are emitted tostdout.
human-readable.log-fileThe output files for human-readable events. If not set, human-readable eventsare emitted to stdout.
control.table-prefixThe table prefix to be appended to the control target table.
test.table-prefixThe table prefix to be appended to the test target table.
test-idA string to be attached to output events.
max-concurrencyMaximum number of concurrent verifications.
suite-repetitionHow many times a suite is verified.
query-repetitionHow many times a source query is verified.
relative-error-marginMaximum tolerable relative error between control sum and test sum of afloating point column.
absolute-error-marginFloating point averages that are below this threshold are treated as 0.
run-teardown-on-result-mismatchWhether to run teardown query in case of result mismatch.
verification-resubmission.limitA limit on how many times a source query can be re-submitted for verification.

Query Override Configuration

The following configurations control the behavior of query metadata modification before verification starts.Counterparts are also available for test queries with prefix control being replaced with test.

NameDescription
control.catalog-overrideThe catalog to be applied to all queries if specified.
control.schema-overrideThe schema to be applied to all queries if specified.
control.username-overrideThe username to be applied to all queries if specified.
control.password-overrideThe password to be applied to all queries if specified.
control.session-properties-override-strategySupports 3 values. NO_ACTION: Use the session properties as specified foreach query. OVERRIDE: Merge the session properties of each query with theoverride, with override being the dominant. SUBSTITUTE, The sessionproperties of each query is replaced with the override.
control.session-properties-overrideThe session property to be applied to all queries.

Query Execution Configuration

The following configurations control the behavior of query execution on the control cluster.Counterparts are also available for test clusters with prefix control being replaced with test.

NameDescription
control.hostHost of the control cluster.
control.jdbc-portJDBC port of the control cluster.
control.http-hostHTTP port of the control cluster.
control.jdbc-url-parametersA JSON map representing the additional URL parameters for control JDBC.
control.query-timeoutThe execution time limit of the control and the test queries.
control.metadata-timeoutThe execution time limit of DESC queries and LIMIT 0 queries.
control.checksum-timeoutThe execution time limit of checksum queries.

Determinism Analyzer Configuration

NameDescription
determinism.run-teardownWhether to run teardown queries for tables produced in determinism analysis.
determinism.max-analysis-runsMaximum number of additional control runs to check for the determinism of thecontrol query.
determinism.handle-limit-queryWhether to enable the special handling for queries with a top level LIMITclause.
determinism.non-deterministic-catalogsA comma-separated list of non-deterministic catalogs. Queries referencing tablefrom those catalogs are treated as non-deterministic.

Failure Resolution Configuration

NameDescription
exceeded-global-memory-limit.failure-resolver.enabledWhether to enable the failure resolver for test query failures withEXCEEDED_GLOBAL_MEMORY_LIMIT.
exceeded-time-limit.failure-resolver.enabledWhether to enable the failure resolver for test query failures withEXCEEDED_TIME_LIMIT.
verifier-limitation.failure-resolver.enabledWhether to enable the failure resolver for failures due to Verifierlimitations.
too-many-open-partitions.failure-resolver.enabledWhether to enable the failure resolver for test query failures withHIVE_TOO_MANY_OPEN_PARTITIONS.
too-many-open-partitions.max-buckets-per-writerThe maximum buckets count per writer configured on the control and thetest cluster.
too-many-open-partitions.cluster-size-expirationThe time limit of the test cluster size being cached.