Continuous Data Validation in DM

This document describes how to use continuous data validation in DM, its working principles, and its limitations.

User scenario

In the process of incrementally migrating data from the upstream database to the downstream database, there is a small probability that the flow of data leads to data corruption or data loss. For scenarios where data consistency is required, such as the credit and securities industries, after the migration is complete, you can perform full data validation to ensure data consistency.

However, in incremental migration scenarios, the upstream and downstream are continuously writing data. Because data is constantly changing in the upstream and downstream, it is difficult to perform full data validation (for example, use sync-diff-inspector) to all data in the tables.

In incremental migration scenarios, you can use the continuous data validation feature in DM. This feature ensures data integrity and consistency during incremental migration where data is continuously written into the downstream.

Enable continuous data validation

You can enable continuous data validation using either of the following methods:

  • Enable in the task configuration file.
  • Enable using dmctl.

Method 1: Enable in the task configuration file

To enable continuous data validation, add the following configuration items to the task configuration file:

  1. # Add the following configuration items to the upstream database that needs to be validated:
  2. mysql-instances:
  3. - source-id: "mysql1"
  4. block-allow-list: "bw-rule-1"
  5. validator-config-name: "global"
  6. validators:
  7. global:
  8. mode: full # "fast" is also allowed. "none" is the default mode, which means no validation is performed.
  9. worker-count: 4 # The number of validation workers in the background. The default value is 4.
  10. row-error-delay: 30m # If a row cannot pass the validation within the specified time, it will be marked as an error row. The default value is 30m, which means 30 minutes.

The configuration items are described as follows:

  • mode: validation mode. The possible values are none, full, and fast.
    • none: the default value, which means no validation is performed.
    • full: compares the changed row and the row obtained in the downstream database.
    • fast: only checks if the changed row exists in the downstream database.
  • worker-count: the number of validation workers in the background. Each worker is a goroutine.
  • row-error-delay: if a row cannot pass the validation within the specified time, it will be marked as an error row. The default value is 30 minutes.

For the complete configuration, refer to DM Advanced Task Configuration File.

Method 2: Enable using dmctl

To enable continuous data validation, run the dmctl validation start command:

  1. Usage:
  2. dmctl validation start [--all-task] [task-name] [flags]
  3. Flags:
  4. --all-task whether applied to all tasks
  5. -h, --help help for start
  6. --mode string specify the mode of validation: full (default), fast; this flag will be ignored if the validation task has been ever enabled but currently paused (default "full")
  7. --start-time string specify the start time of binlog for validation, e.g. '2021-10-21 00:01:00' or 2021-10-21T00:01:00
  • --mode: specify the validation mode. The possible values are fast and full.
  • --start-time: specify the start time for validation. The format follows 2021-10-21 00:01:00 or 2021-10-21T00:01:00.
  • task: specify the name of the task to enable continuous validation for. You can use --all-task to enable validation for all tasks.

For example:

  1. dmctl --master-addr=127.0.0.1:8261 validation start --start-time 2021-10-21T00:01:00 --mode full my_dm_task

Use continuous data validation

When you use continuous data validation, you can use dmctl to view the status of the validation and to handle the error rows. “Error rows” refers to the rows that are found to be inconsistent between the upstream and downstream databases.

View the validation status

You can view the validation status using either of the following methods:

Method 1: run the dmctl query-status <task-name> command. If continuous data validation is enabled, the validation result is displayed in the validation field of each subtask. Example output:

  1. "subTaskStatus": [
  2. {
  3. "name": "test",
  4. "stage": "Running",
  5. "unit": "Sync",
  6. "result": null,
  7. "unresolvedDDLLockID": "",
  8. "sync": {
  9. ...
  10. },
  11. "validation": {
  12. "task": "test", // Task name
  13. "source": "mysql-01", // Source id
  14. "mode": "full", // Validation mode
  15. "stage": "Running", // Current stage. "Running" or "Stopped".
  16. "validatorBinlog": "(mysql-bin.000001, 5989)", // The binlog position of the validation
  17. "validatorBinlogGtid": "1642618e-cf65-11ec-9e3d-0242ac110002:1-30", // The GTID position of the validation
  18. "cutoverBinlogPos": "", // The specified binlog position for cutover
  19. "cutoverBinlogGTID": "1642618e-cf65-11ec-9e3d-0242ac110002:1-30", // The specified GTID position for cutover
  20. "result": null, // When the validation is abnormal, show the error message
  21. "processedRowsStatus": "insert/update/delete: 0/0/0", // Statistics of the processed binlog rows.
  22. "pendingRowsStatus": "insert/update/delete: 0/0/0", // Statistics of the binlog rows that are not validated yet or that fail to be validated but are not marked as "error rows"
  23. "errorRowsStatus": "new/ignored/resolved: 0/0/0" // Statistics of the error rows. The three statuses are explained in the next section.
  24. }
  25. }
  26. ]

Method 2: run the dmctl validation status <taskname> command.

  1. dmctl validation status [--table-stage stage] <task-name> [flags]
  2. Flags:
  3. -h, --help help for status
  4. --table-stage string filter validation tables by stage: running/stopped

In the preceding command, you can use --table-stage to filter the tables that are being validated or stop validation. Example output:

  1. {
  2. "result": true,
  3. "msg": "",
  4. "validators": [
  5. {
  6. "task": "test",
  7. "source": "mysql-01",
  8. "mode": "full",
  9. "stage": "Running",
  10. "validatorBinlog": "(mysql-bin.000001, 6571)",
  11. "validatorBinlogGtid": "",
  12. "cutoverBinlogPos": "(mysql-bin.000001, 6571)",
  13. "cutoverBinlogGTID": "",
  14. "result": null,
  15. "processedRowsStatus": "insert/update/delete: 2/0/0",
  16. "pendingRowsStatus": "insert/update/delete: 0/0/0",
  17. "errorRowsStatus": "new/ignored/resolved: 0/0/0"
  18. }
  19. ],
  20. "tableStatuses": [
  21. {
  22. "source": "mysql-01", // Source id
  23. "srcTable": "`db`.`test1`", // Source table name
  24. "dstTable": "`db`.`test1`", // Target table name
  25. "stage": "Running", // Validation status
  26. "message": "" // Error message
  27. }
  28. ]
  29. }

If you want to view the details of the error rows, such as error types and error time, run the dmctl validation show-error command:

  1. Usage:
  2. dmctl validation show-error [--error error-state] <task-name> [flags]
  3. Flags:
  4. --error string filtering type of error: all, ignored, or unprocessed (default "unprocessed")
  5. -h, --help help for show-error

Example output:

  1. {
  2. "result": true,
  3. "msg": "",
  4. "error": [
  5. {
  6. "id": "1", // Error row id, which will be used in processing error rows
  7. "source": "mysql-replica-01", // Source id
  8. "srcTable": "`validator_basic`.`test`", // Source table of the error row
  9. "srcData": "[0, 0]", // Data of the error row in the source table
  10. "dstTable": "`validator_basic`.`test`", // Target table of the error row
  11. "dstData": "[]", // Data of the error row in the target table
  12. "errorType": "Expected rows not exist", // Error type
  13. "status": "NewErr", // Error status
  14. "time": "2022-07-04 13:33:02", // Discovery time of the error row
  15. "message": "" // Additional information
  16. }
  17. ]
  18. }

Handle error rows

After continuous data validation returns error rows, you need to manually handle the error rows.

When continuous data validation finds error rows, the validation does not stop immediately. Instead, it records the error rows for you to handle. Before the error rows are processed, the default status is unprocessed. If you manually correct the error rows in the downstream, the validation does not automatically retrieve the latest status of the corrected data. The error rows are still recorded in the error field.

If you do not want to see an error row in the validation status, or if you want to mark an error row as resolved, you can locate the error row id using the validation show-error command and subsequently handle it with the given error id:

dmctl provides three error handling commands:

  • clear-error: clear the error row. The show-error command does not show the error row anymore.

    1. Usage:
    2. dmctl validation clear-error <task-name> <error-id|--all> [flags]
    3. Flags:
    4. --all all errors
    5. -h, --help help for clear-error
  • ignore-error: ignore the error row. This error row is marked as “ignored”.

    1. Usage:
    2. dmctl validation ignore-error <task-name> <error-id|--all> [flags]
    3. Flags:
    4. --all all errors
    5. -h, --help help for ignore-error
  • resolve-error: the error row is manually handled and marked as “resolved”.

    1. Usage:
    2. dmctl validation resolve-error <task-name> <error-id|--all> [flags]
    3. Flags:
    4. --all all errors
    5. -h, --help help for resolve-error

Stop continuous data validation

To stop the continuous data validation, run the validation stop command:

  1. Usage:
  2. dmctl validation stop [--all-task] [task-name] [flags]
  3. Flags:
  4. --all-task whether applied to all tasks
  5. -h, --help help for stop

For detailed usage, refer to dmctl validation start.

Set the cutover point for continuous data validation

Before switching the application to another database, you might need to perform continuous data validation immediately after the data is replicated to a specific position to ensure data integrity. To achieve this, you can set this specific position as the cutover point for continuous validation.

To set the cutover point for continuous data validation, use the validation update command:

  1. Usage:
  2. dmctl validation update <task-name> [flags]
  3. Flags:
  4. --cutover-binlog-gtid string specify the cutover binlog gtid for validation, only valid when source config's gtid is enabled, e.g. '1642618e-cf65-11ec-9e3d-0242ac110002:1-30'
  5. --cutover-binlog-pos string specify the cutover binlog name for validation, should include binlog name and pos in brackets, e.g. '(mysql-bin.000001, 5989)'
  6. -h, --help help for update
  • --cutover-binlog-gtid: specifies the cutover position for validation, in the format of 1642618e-cf65-11ec-9e3d-0242ac110002:1-30. Only valid when GTID is enabled in the upstream cluster.
  • --cutover-binlog-pos: specifies the cutover position for validation, in the format of (mysql-bin.000001, 5989).
  • task-name: the name of the task for continuous data validation. This parameter is required.

Implementation

The architecture of continuous data validation (validator) in DM is as follows:

validator summary

The lifecycle of continuous data validation is as follows:

validator lifecycle

The detailed implementation of continuous data validation is as follows:

  1. The validator pulls a binlog event from the upstream and gets the changed rows:
    • The validator only checks a event that has been incrementally migrated by the syncer. If the event has not been processed by the syncer, the validator pauses and waits for the syncer to complete processing.
    • If the event has been processed by the syncer, the validator moves on to the following steps.
  2. The validator parses the binlog event and filters out the rows based on the block and allow lists, the table filters, and table routing. After that, the validator submits the changed rows to the validation worker that runs in the background.
  3. The validation worker merges the changed rows that affect the same table and the same primary key to avoid validating “expired” data. The changed rows are cached in memory.
  4. When the validation worker has accumulated a certain number of changed rows or when a certain time interval is passed, the validation worker queries the downstream database using the primary keys to get the current data and compares it with the changed rows.
  5. The validation worker performs the data validation. If the validation mode is full, the validation worker compares data of the changed rows with data of the downstream database. if the validation mode is fast, the validation worker only checks the existence of the changed rows.
    • If the changed rows pass the validation, the changed row is removed from the memory.
    • If the changed rows fail the validation, the validator does not report an error immediately but waits for a certain time interval before validating the row again.
    • If a changed row cannot pass the validation within the specified time (specified by the user), the validator marks the row as an error row and writes it to the meta database in the downstream. You can view the information of error rows by querying the migration task. For details, refer to View the validation status and Handle error rows.

Limitations

  • The source table to be validated must have a primary key or a not-null unique key.
  • When DM migrates DDL from the upstream database, the following limitations apply:
    • The DDL must not change the primary key, or change the order of columns, or delete existing columns.
    • The table must not be dropped.
  • Does not support tasks that use expressions to filter events.
  • The precision of floating-point numbers is different between TiDB and MySQL. Differences smaller than 10^-6 are considered equal.
  • Does not support the following data types:
    • JSON
    • Binary data