TiDB Lightning Error Resolution

Starting from v5.4.0, you can configure TiDB Lightning to skip errors like invalid type conversion and unique key conflicts, and to continue the data processing as if those wrong row data does not exist. A report will be generated for you to read and manually fix errors afterward. This is ideal for importing from a slightly dirty data source, where locating the errors manually is difficult and restarting TiDB Lightning on every encounter is costly.

This document introduces how to use the type error feature (lightning.max-error) and the duplicate resolution feature (tikv-importer.duplicate-resolution). It also introduces the database where these errors are stored (lightning.task-info-schema-name). At the end of this document, an example is provided.

Type error

You can use the lightning.max-error configuration to increase the tolerance of errors related to data types. If this configuration is set to N, TiDB Lightning allows and skips up to N errors from the data source before it exists. The default value 0 means that no error is allowed.

These errors are recorded in a database. After the import is completed, you can view the errors in the database and process them manually. For more information, see Error Report.

  1. [lightning]
  2. max-error = 0

The above configuration covers the following errors:

  • Invalid values (example: set 'Text' to an INT column).
  • Numeric overflow (example: set 500 to a TINYINT column)
  • String overflow (example: set 'Very Long Text' to a VARCHAR(5) column).
  • Zero date-time (namely '0000-00-00' and '2021-12-00').
  • Set NULL to a NOT NULL column.
  • Failed to evaluate a generated column expression.
  • Column count mismatch. The number of values in the row does not match the number of columns of the table.
  • Unique/Primary key conflict in TiDB-backend, when on-duplicate = "error".
  • Any other SQL errors.

The following errors are always fatal, and cannot be skipped by changing max-error:

  • Syntax error (such as unclosed quotation marks) in the original CSV, SQL or Parquet file.
  • I/O, network or system permission errors.

Unique/Primary key conflict in the physical import mode is handled separately and explained in the next section.

Error report

If TiDB Lightning encounters errors during the import, it outputs a statistics summary about these errors in both your terminal and the log file when it exits.

  • The error report in the terminal is similar to the following table:

    #ERROR TYPEERROR COUNTERROR DATA TABLE
    1Data Type1000lightning_task_info.type_error_v1
  • The error report in the TiDB Lightning log file is as follows:

    1. [2022/03/13 05:33:57.736 +08:00] [WARN] [errormanager.go:459] ["Detect 1000 data type errors in total, please refer to table `lightning_task_info`.`type_error_v1` for more details"]

All errors are written to tables in the lightning_task_info database in the downstream TiDB cluster. After the import is completed, if the error data is collected, you can view the errors in the database and process them manually.

You can change the database name by configuring lightning.task-info-schema-name.

  1. [lightning]
  2. task-info-schema-name = 'lightning_task_info'

TiDB Lightning creates 3 tables in this database:

  1. CREATE TABLE syntax_error_v1 (
  2. task_id bigint NOT NULL,
  3. create_time datetime(6) NOT NULL DEFAULT now(6),
  4. table_name varchar(261) NOT NULL,
  5. path varchar(2048) NOT NULL,
  6. offset bigint NOT NULL,
  7. error text NOT NULL,
  8. context text
  9. );
  10. CREATE TABLE type_error_v1 (
  11. task_id bigint NOT NULL,
  12. create_time datetime(6) NOT NULL DEFAULT now(6),
  13. table_name varchar(261) NOT NULL,
  14. path varchar(2048) NOT NULL,
  15. offset bigint NOT NULL,
  16. error text NOT NULL,
  17. row_data text NOT NULL
  18. );
  19. CREATE TABLE conflict_error_v1 (
  20. task_id bigint NOT NULL,
  21. create_time datetime(6) NOT NULL DEFAULT now(6),
  22. table_name varchar(261) NOT NULL,
  23. index_name varchar(128) NOT NULL,
  24. key_data text NOT NULL,
  25. row_data text NOT NULL,
  26. raw_key mediumblob NOT NULL,
  27. raw_value mediumblob NOT NULL,
  28. raw_handle mediumblob NOT NULL,
  29. raw_row mediumblob NOT NULL,
  30. KEY (task_id, table_name)
  31. );

type_error_v1 records all type errors managed by the max-error configuration. There is one row per error.

conflict_error_v1 records all unique/primary key conflict in the Local-backend. There are 2 rows per pair of conflicts.

ColumnSyntaxTypeConflictDescription
task_idThe TiDB Lightning task ID that generates this error
create_timeThe time at which the error is recorded
table_nameThe name of the table that contains the error, in the form of db.tbl
pathThe path of the file that contains the error
offsetThe byte position in the file where the error is found
errorThe error message
contextThe text that surrounds the error
index_nameThe name of the unique key in conflict. It is ‘PRIMARY’ for primary key conflicts.
key_dataThe formatted key handle of the row that causes the error. The content is for human reference only, and not intended to be machine-readable.
row_dataThe formatted row data that causes the error. The content is for human reference only, and not intended to be machine-readable
raw_keyThe key of the conflicted KV pair
raw_valueThe value of the conflicted KV pair
raw_handleThe row handle of the conflicted row
raw_rowThe encoded value of the conflicted row

Error Resolution - 图1

Note

The error report records the file offset, not line/column number which is inefficient to obtain. You can quickly jump near a byte position (using 183 as example) using the following commands:

  • shell, printing the first several lines.

    1. head -c 183 file.csv | tail
  • shell, printing the next several lines:

    1. tail -c +183 file.csv | head
  • vim — :goto 183 or 183go

Example

In this example, a data source is prepared with some known errors.

  1. Prepare the database and table schema.

    1. mkdir example && cd example
    2. echo 'CREATE SCHEMA example;' > example-schema-create.sql
    3. echo 'CREATE TABLE t(a TINYINT PRIMARY KEY, b VARCHAR(12) NOT NULL UNIQUE);' > example.t-schema.sql
  2. Prepare the data.

    1. cat <<EOF > example.t.1.sql
    2. INSERT INTO t (a, b) VALUES
    3. (0, NULL), -- column is NOT NULL
    4. (1, 'one'),
    5. (2, 'two'),
    6. (40, 'forty'), -- conflicts with the other 40 below
    7. (54, 'fifty-four'), -- conflicts with the other 'fifty-four' below
    8. (77, 'seventy-seven'), -- the string is longer than 12 characters
    9. (600, 'six hundred'), -- the number overflows TINYINT
    10. (40, 'forty'), -- conflicts with the other 40 above
    11. (42, 'fifty-four'); -- conflicts with the other 'fifty-four' above
    12. EOF
  3. Configure TiDB Lightning to enable strict SQL mode, use the Local-backend to import data, delete duplicates, and skip up to 10 errors.

    1. cat <<EOF > config.toml
    2. [lightning]
    3. max-error = 10
    4. [tikv-importer]
    5. backend = 'local'
    6. sorted-kv-dir = '/tmp/lightning-tmp/'
    7. duplicate-resolution = 'remove'
    8. [mydumper]
    9. data-source-dir = '.'
    10. [tidb]
    11. host = '127.0.0.1'
    12. port = 4000
    13. user = 'root'
    14. password = ''
    15. sql-mode = 'STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE'
    16. EOF
  4. Run TiDB Lightning. This command will exit successfully because all errors are skipped.

    1. tiup tidb-lightning -c config.toml
  5. Verify that the imported table only contains the two normal rows:

    1. $ mysql -u root -h 127.0.0.1 -P 4000 -e 'select * from example.t'
    2. +---+-----+
    3. | a | b |
    4. +---+-----+
    5. | 1 | one |
    6. | 2 | two |
    7. +---+-----+
  6. Check whether the type_error_v1 table has caught the three rows involving type conversion:

    1. $ mysql -u root -h 127.0.0.1 -P 4000 -e 'select * from lightning_task_info.type_error_v1;' -E
    2. *************************** 1. row ***************************
    3. task_id: 1635888701843303564
    4. create_time: 2021-11-02 21:31:42.620090
    5. table_name: `example`.`t`
    6. path: example.t.1.sql
    7. offset: 46
    8. error: failed to cast value as varchar(12) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin for column `b` (#2): [table:1048]Column 'b' cannot be null
    9. row_data: (0,NULL)
    10. *************************** 2. row ***************************
    11. task_id: 1635888701843303564
    12. create_time: 2021-11-02 21:31:42.627496
    13. table_name: `example`.`t`
    14. path: example.t.1.sql
    15. offset: 183
    16. error: failed to cast value as varchar(12) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin for column `b` (#2): [types:1406]Data Too Long, field len 12, data len 13
    17. row_data: (77,'seventy-seven')
    18. *************************** 3. row ***************************
    19. task_id: 1635888701843303564
    20. create_time: 2021-11-02 21:31:42.629929
    21. table_name: `example`.`t`
    22. path: example.t.1.sql
    23. offset: 253
    24. error: failed to cast value as tinyint(4) for column `a` (#1): [types:1690]constant 600 overflows tinyint
    25. row_data: (600,'six hundred')
  7. Check whether the conflict_error_v1 table has caught the four rows that have unique/primary key conflicts:

    1. $ mysql -u root -h 127.0.0.1 -P 4000 -e 'select * from lightning_task_info.conflict_error_v1;' --binary-as-hex -E
    2. *************************** 1. row ***************************
    3. task_id: 1635888701843303564
    4. create_time: 2021-11-02 21:31:42.669601
    5. table_name: `example`.`t`
    6. index_name: PRIMARY
    7. key_data: 40
    8. row_data: (40, "forty")
    9. raw_key: 0x7480000000000000C15F728000000000000028
    10. raw_value: 0x800001000000020500666F727479
    11. raw_handle: 0x7480000000000000C15F728000000000000028
    12. raw_row: 0x800001000000020500666F727479
    13. *************************** 2. row ***************************
    14. task_id: 1635888701843303564
    15. create_time: 2021-11-02 21:31:42.674798
    16. table_name: `example`.`t`
    17. index_name: PRIMARY
    18. key_data: 40
    19. row_data: (40, "forty")
    20. raw_key: 0x7480000000000000C15F728000000000000028
    21. raw_value: 0x800001000000020600666F75727479
    22. raw_handle: 0x7480000000000000C15F728000000000000028
    23. raw_row: 0x800001000000020600666F75727479
    24. *************************** 3. row ***************************
    25. task_id: 1635888701843303564
    26. create_time: 2021-11-02 21:31:42.680332
    27. table_name: `example`.`t`
    28. index_name: b
    29. key_data: 54
    30. row_data: (54, "fifty-four")
    31. raw_key: 0x7480000000000000C15F6980000000000000010166696674792D666FFF7572000000000000F9
    32. raw_value: 0x0000000000000036
    33. raw_handle: 0x7480000000000000C15F728000000000000036
    34. raw_row: 0x800001000000020A0066696674792D666F7572
    35. *************************** 4. row ***************************
    36. task_id: 1635888701843303564
    37. create_time: 2021-11-02 21:31:42.681073
    38. table_name: `example`.`t`
    39. index_name: b
    40. key_data: 42
    41. row_data: (42, "fifty-four")
    42. raw_key: 0x7480000000000000C15F6980000000000000010166696674792D666FFF7572000000000000F9
    43. raw_value: 0x000000000000002A
    44. raw_handle: 0x7480000000000000C15F72800000000000002A
    45. raw_row: 0x800001000000020A0066696674792D666F7572