Troubleshooting
Overview
Here we will cover some common issues seen during a migration — how to avoid them, how to detect them, and how to address them.
This guide follows on from the Get Started guides. Please make sure that you have a Kubernetes Operator or local installation ready. Make sure you have run the “101” and “201” steps of the examples, for example the “101” step is 101_initial_cluster.sh
in the local example. The commands in this guide also assume you have setup the shell aliases from the example, e.g. env.sh
in the local example.
General and Precautionary Info
Execute a Dry Run
The SwitchTraffic
/ReverseTraffic
and Complete
actions support a dry run using the --dry_run
flag where no actual steps are taken but instead the command logs all the steps that would be taken. This command will also verify that the cluster is generally in a state where it can perform the action successfully without potentially timing out along the way. Given that traffic cutovers can potentially cause read/write pauses or outages this can be particularly helpful during the final cutover stage.
DDL Handling
If you expect DDL to be executed on the source table(s) while the workflow runs and you want those DDL statements to be replicated to the target keyspace then you will need to use one of the EXEC*
options for the workflow’s on-ddl flag. Please see the on-ddl flag documentation for additional details and related considerations.
Running a Diff
In most cases you should run a VDiff before switching traffic to ensure that nothing unexpected happened which caused the data to diverge during the migration.
Performance Notes
- VReplication workflows (including VDiff) can have a major impact on the tablet so it’s recommended to use non-PRIMARY tablets whenever possible to limit any impact on production traffic
- You can see the key related tablet flags and when/why you may want to set them in the VReplication tablet flag docs
- You can further control any impact on the source and target tablets using the tablet throttler
- VReplication workflows can generate a lot of network traffic
- You should strive to keep the source and target tablets in the same cell whenever possible to limit performance and cost impacts
Monitoring
It’s important to properly monitor your VReplication workflows in order to detect any issues. Your primary tools for this are:
- The Workflow show command
- The
Progress
/Show
action (e.g. MoveTables — Progress) - The VReplication related metrics
- Note that in most production systems the tablet endpoints would be scraped and stored in something like Prometheus where you can build dashboards and alerting on the data
Save Routing Rules
The Create
, SwitchTraffic
/ReverseTraffic
, and Cancel
/Complete
actions modify the routing rules. You may want to save the routing rules before taking an action just in case you want to restore them for any reason (note that e.g. the ReverseTraffic
action will automatically revert the routing rules):
$ vtctldclient GetRoutingRules > /tmp/routingrules.backup.json
Those can later be applied this way:
$ vtctldclient ApplyRoutingRules --rules-file=/tmp/routingrules.backup.json
Specific Errors and Issues
Stream Never Starts
This can be exhibited in one of two ways:
- This error is shown in the
Progress
/Show
action output or theWorkflow show
output:Error picking tablet: context has expired
- The stream never starts, which can be seen in the following ways:
- The
Workflow show
output is showing an empty value in thePos
field for the stream - The
Progress
/Show
action output is showingVStream has not started
for the stream
- The
When a VReplication workflow starts or restarts the tablet selection process runs to find a viable source tablet for the stream. The cells
and tablet_types
play a key role in this process and if we cannot ever find a viable source tablet for the stream then you may want to expand the cells and/or tablet types made available for the selection process.
Corrective Action
If the workflow was only created and has not yet made any progress then you should Cancel
the workflow and Create
a new one using different values for the --cells
and --tablet_types
flags. If, however, this workflow has made significant progress that you do not wish you lose, you can update the underlying workflow record directly to modify either of those values. For example:
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:
id=1 on 0/zone1-0000000200: Status: Running. VStream has not started.
$ vtctlclient Workflow -- --tablet-types="replica,primary" customer.commerce2customer update
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:
id=1 on 0/zone1-0000000201: Status: Running. VStream Lag: 0s.
Workflow Has SQL Errors
We can encounter persistent SQL errors when applying replicated events on the target for a variety of reasons, but the most common cause is incompatible DDL having been executed against the source table while the workflow is running. You would see this error in the Show
/Progress
or Workflow show
output. For example:
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:
id=1 on 0/zone1-0000000201: Status: Error: Unknown column 'notes' in 'field list' (errno 1054) (sqlstate 42S22) during query: insert into customer(customer_id,email,notes) values (100,'test@tester.com','Lots of notes').
# OR a variant
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:
id=1 on 0/zone1-0000000201: Status: Error: vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ a2d90338-916d-11ed-820a-498bdfbb0b03:1-90: cannot determine table columns for customer: event has [8 15 15], schema has [name:"customer_id" type:INT64 table:"customer" org_table:"customer" database:"vt_commerce" org_name:"customer_id" column_length:20 charset:63 flags:49667 name:"email" type:VARBINARY table:"customer" org_table:"customer" database:"vt_commerce" org_name:"email" column_length:128 charset:63 flags:128].
This can be caused by a DDL executed on the source table as by default — controlled by the on-ddl flag value — DDL is ignored in the stream.
Corrective Action
If you want the same or similar DDL to be applied on the target then you can apply that DDL on the target keyspace and then restart the workflow. For example, using the example above:
$ vtctlclient ApplySchema -- --allow_long_unavailability --ddl_strategy=direct --sql="alter table customer add notes varchar(100) not null" customer
$ vtctlclient Workflow -- customer.commerce2customer start
If the tables are not very large or the workflow has not made much progress, you can alternatively Cancel
the current worfklow and Create
another. For example:
$ vtctlclient MoveTables -- Cancel customer.commerce2customer
Cancel was successful for workflow customer.commerce2customer
Start State: Reads Not Switched. Writes Not Switched
Current State: Workflow Not Found
$ vtctlclient MoveTables -- --source commerce --tables 'customer,corder' Create customer.commerce2customer
Waiting for workflow to start:
Workflow started successfully with 1 stream(s)
The following vreplication streams exist for workflow customer.commerce2customer:
id=2 on 0/zone1-0000000201: Status: Copying. VStream Lag: 0s.
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:
id=2 on 0/zone1-0000000201: Status: Running. VStream Lag: 0s.
Switching Traffic Fails
You can encounter a variety of failures during the SwitchTraffic
/ReverseTraffic
step as a number of operations are performed. To demonstrate that we can look at an example dry run output:
$ vtctlclient MoveTables -- --dry_run SwitchTraffic customer.commerce2customer
Dry Run results for SwitchTraffic run at 11 Jan 23 08:51 EST
Parameters: --dry_run SwitchTraffic customer.commerce2customer
Lock keyspace commerce
Switch reads for tables [corder,customer] to keyspace customer for tablet types [RDONLY,REPLICA]
Routing rules for tables [corder,customer] will be updated
Unlock keyspace commerce
Lock keyspace commerce
Lock keyspace customer
Stop writes on keyspace commerce, tables [corder,customer]:
Keyspace commerce, Shard 0 at Position MySQL56/a2d90338-916d-11ed-820a-498bdfbb0b03:1-94
Wait for VReplication on stopped streams to catchup for up to 30s
Create reverse replication workflow commerce2customer_reverse
Create journal entries on source databases
Enable writes on keyspace customer tables [corder,customer]
Switch routing from keyspace commerce to keyspace customer
Routing rules for tables [corder,customer] will be updated
Switch writes completed, freeze and delete vreplication streams on:
tablet 201
Start reverse replication streams on:
tablet 101
Mark vreplication streams frozen on:
Keyspace customer, Shard 0, Tablet 201, Workflow commerce2customer, DbName vt_customer
Unlock keyspace customer
Unlock keyspace commerce
disallowed due to rule: enforce denied tables
If your queries start failing with this error then you most likely had some leftover artifacts from a previous MoveTables
operation that were not properly cleaned up by running MoveTables — Cancel. For MoveTables
operations, shard query serving control records (denied tables lists) are used in addition to routing rules to ensure that all query traffic is managed by the correct keyspace as you are often only moving some tables from one keyspace to another. If those control records are not properly cleaned up then queries may be incorrectly denied when traffic is switched. If you e.g. were to see the following error for queries after switching traffic for the customer table from the commerce keyspace to the customer keyspace:
code = FailedPrecondition desc = disallowed due to rule: enforce denied tables (CallerID: matt) for query SELECT * FROM customer WHERE customer_id = 1
Then you can remove those unwanted/errant denied table rules from the customer keyspace this way:
$ for type in primary replica rdonly; do
vtctldclient SetShardTabletControl --remove customer/0 ${type}
done
# Ensure that these changes are in place everywhere
$ vtctldclient RefreshStateByShard customer/0
Completion and Cleanup Failures
The completion action performs a number of steps that could potentially fail. We can again use the dry run output to demonstrate the various actions that are taken:
$ vtctlclient MoveTables -- --dry_run Complete customer.commerce2customer
Dry Run results for Complete run at 11 Jan 23 10:22 EST
Parameters: --dry_run Complete customer.commerce2customer
Lock keyspace commerce
Lock keyspace customer
Dropping these tables from the database and removing them from the vschema for keyspace commerce:
Keyspace commerce Shard 0 DbName vt_commerce Tablet 101 Table corder
Keyspace commerce Shard 0 DbName vt_commerce Tablet 101 Table customer
Denied tables [corder,customer] will be removed from:
Keyspace commerce Shard 0 Tablet 101
Delete reverse vreplication streams on source:
Keyspace commerce Shard 0 Workflow commerce2customer_reverse DbName vt_commerce Tablet 101
Delete vreplication streams on target:
Keyspace customer Shard 0 Workflow commerce2customer DbName vt_customer Tablet 201
Routing rules for participating tables will be deleted
Unlock keyspace customer
Unlock keyspace commerce