Scale a TiDB Cluster Using TiUP

The capacity of a TiDB cluster can be increased or decreased without interrupting the online services.

This document describes how to scale the TiDB, TiKV, PD, TiCDC, or TiFlash cluster using TiUP. If you have not installed TiUP, refer to the steps in Step 2. Deploy TiUP on the control machine.

To view the current cluster name list, run tiup cluster list.

For example, if the original topology of the cluster is as follows:

Host IPService
10.0.1.3TiDB + TiFlash
10.0.1.4TiDB + PD
10.0.1.5TiKV + Monitor
10.0.1.1TiKV
10.0.1.2TiKV

Scale out a TiDB/PD/TiKV cluster

This section exemplifies how to add a TiDB node to the 10.0.1.5 host.

Use TiUP (Recommended) - 图1

Note

You can take similar steps to add a PD node. Before you add a TiKV node, it is recommended that you adjust the PD scheduling parameters in advance according to the cluster load.

  1. Configure the scale-out topology:

    Use TiUP (Recommended) - 图2

    Note

    • The port and directory information is not required by default.
    • If multiple instances are deployed on a single machine, you need to allocate different ports and directories for them. If the ports or directories have conflicts, you will receive a notification during deployment or scaling.
    • Since TiUP v1.0.0, the scale-out configuration inherits the global configuration of the original cluster.

    Add the scale-out topology configuration in the scale-out.yml file:

    1. vi scale-out.yml
    1. tidb_servers:
    2. - host: 10.0.1.5
    3. ssh_port: 22
    4. port: 4000
    5. status_port: 10080
    6. deploy_dir: /tidb-deploy/tidb-4000
    7. log_dir: /tidb-deploy/tidb-4000/log

    Here is a TiKV configuration file template:

    1. tikv_servers:
    2. - host: 10.0.1.5
    3. ssh_port: 22
    4. port: 20160
    5. status_port: 20180
    6. deploy_dir: /tidb-deploy/tikv-20160
    7. data_dir: /tidb-data/tikv-20160
    8. log_dir: /tidb-deploy/tikv-20160/log

    Here is a PD configuration file template:

    1. pd_servers:
    2. - host: 10.0.1.5
    3. ssh_port: 22
    4. name: pd-1
    5. client_port: 2379
    6. peer_port: 2380
    7. deploy_dir: /tidb-deploy/pd-2379
    8. data_dir: /tidb-data/pd-2379
    9. log_dir: /tidb-deploy/pd-2379/log

    To view the configuration of the current cluster, run tiup cluster edit-config <cluster-name>. Because the parameter configuration of global and server_configs is inherited by scale-out.yml and thus also takes effect in scale-out.yml.

  2. Run the scale-out command:

    Before you run the scale-out command, use the check and check --apply commands to detect and automatically repair potential risks in the cluster:

    1. Check for potential risks:

      1. tiup cluster check <cluster-name> scale-out.yml --cluster --user root [-p] [-i /home/root/.ssh/gcp_rsa]
    2. Enable automatic repair:

      1. tiup cluster check <cluster-name> scale-out.yml --cluster --apply --user root [-p] [-i /home/root/.ssh/gcp_rsa]
    3. Run the scale-out command:

      1. tiup cluster scale-out <cluster-name> scale-out.yml [-p] [-i /home/root/.ssh/gcp_rsa]

    In the preceding commands:

    • scale-out.yml is the scale-out configuration file.
    • --user root indicates logging in to the target machine as the root user to complete the cluster scale out. The root user is expected to have ssh and sudo privileges to the target machine. Alternatively, you can use other users with ssh and sudo privileges to complete the deployment.
    • [-i] and [-p] are optional. If you have configured login to the target machine without password, these parameters are not required. If not, choose one of the two parameters. [-i] is the private key of the root user (or other users specified by --user) that has access to the target machine. [-p] is used to input the user password interactively.

    If you see Scaled cluster <cluster-name> out successfully, the scale-out operation succeeds.

  3. Refresh the cluster configuration.

    Use TiUP (Recommended) - 图3

    Note

    This operation is only required after you add PD nodes. If you only add TiDB or TiKV nodes, this operation is unnecessary.

    1. Refresh the cluster configuration:

      1. tiup cluster reload <cluster-name> --skip-restart
    2. Refresh the Prometheus configuration and restart Prometheus:

      Use TiUP (Recommended) - 图4

      Note

      If you are using TiUP v1.15.0 or a later version, skip this step. If you are using a TiUP version earlier than v1.15.0, execute the following command to update the Prometheus configuration and restart Prometheus.

      1. tiup cluster reload <cluster-name> -R prometheus
  4. Check the cluster status:

    1. tiup cluster display <cluster-name>

    Access the monitoring platform at http://10.0.1.5:3000 using your browser to monitor the status of the cluster and the new node.

After the scale-out, the cluster topology is as follows:

Host IPService
10.0.1.3TiDB + TiFlash
10.0.1.4TiDB + PD
10.0.1.5TiDB + TiKV + Monitor
10.0.1.1TiKV
10.0.1.2TiKV

Scale out a TiFlash cluster

This section exemplifies how to add a TiFlash node to the 10.0.1.4 host.

Use TiUP (Recommended) - 图5

Note

When adding a TiFlash node to an existing TiDB cluster, note the following:

  • Confirm that the current TiDB version supports using TiFlash. Otherwise, upgrade your TiDB cluster to v5.0 or later versions.
  • Run the tiup ctl:v<CLUSTER_VERSION> pd -u http://<pd_ip>:<pd_port> config set enable-placement-rules true command to enable the Placement Rules feature. Or run the corresponding command in pd-ctl.
  1. Add the node information to the scale-out.yml file:

    Create the scale-out.yml file to add the TiFlash node information.

    1. tiflash_servers:
    2. - host: 10.0.1.4

    Currently, you can only add IP addresses but not domain names.

  2. Run the scale-out command:

    1. tiup cluster scale-out <cluster-name> scale-out.yml

    Use TiUP (Recommended) - 图6

    Note

    The preceding command is based on the assumption that the mutual trust has been configured for the user to run the command and the new machine. If the mutual trust cannot be configured, use the -p option to enter the password of the new machine, or use the -i option to specify the private key file.

  3. View the cluster status:

    1. tiup cluster display <cluster-name>

    Access the monitoring platform at http://10.0.1.5:3000 using your browser, and view the status of the cluster and the new node.

After the scale-out, the cluster topology is as follows:

Host IPService
10.0.1.3TiDB + TiFlash
10.0.1.4TiDB + PD + TiFlash
10.0.1.5TiDB+ TiKV + Monitor
10.0.1.1TiKV
10.0.1.2TiKV

Scale out a TiCDC cluster

This section exemplifies how to add two TiCDC nodes to the 10.0.1.3 and 10.0.1.4 hosts.

  1. Add the node information to the scale-out.yml file:

    Create the scale-out.yml file to add the TiCDC node information.

    1. cdc_servers:
    2. - host: 10.0.1.3
    3. gc-ttl: 86400
    4. data_dir: /tidb-data/cdc-8300
    5. - host: 10.0.1.4
    6. gc-ttl: 86400
    7. data_dir: /tidb-data/cdc-8300
  2. Run the scale-out command:

    1. tiup cluster scale-out <cluster-name> scale-out.yml

    Use TiUP (Recommended) - 图7

    Note

    The preceding command is based on the assumption that the mutual trust has been configured for the user to run the command and the new machine. If the mutual trust cannot be configured, use the -p option to enter the password of the new machine, or use the -i option to specify the private key file.

  3. View the cluster status:

    1. tiup cluster display <cluster-name>

    Access the monitoring platform at http://10.0.1.5:3000 using your browser, and view the status of the cluster and the new nodes.

After the scale-out, the cluster topology is as follows:

Host IPService
10.0.1.3TiDB + TiFlash + TiCDC
10.0.1.4TiDB + PD + TiFlash + TiCDC
10.0.1.5TiDB+ TiKV + Monitor
10.0.1.1TiKV
10.0.1.2TiKV

Scale in a TiDB/PD/TiKV cluster

This section exemplifies how to remove a TiKV node from the 10.0.1.5 host.

Use TiUP (Recommended) - 图8

Note

  • You can take similar steps to remove a TiDB or PD node.
  • Because the TiKV, TiFlash, and TiDB Binlog components are taken offline asynchronously and the stopping process takes a long time, TiUP takes them offline in different methods. For details, see Particular handling of components’ offline process.
  • The PD Client in TiKV caches the list of PD nodes. The current version of TiKV has a mechanism to automatically and regularly update PD nodes, which can help mitigate the issue of an expired list of PD nodes cached by TiKV. However, after scaling out PD, you should try to avoid directly removing all PD nodes at once that exist before the scaling. If necessary, before making all the previously existing PD nodes offline, make sure to switch the PD leader to a newly added PD node.
  1. View the node ID information:

    1. tiup cluster display <cluster-name>
    1. Starting /root/.tiup/components/cluster/v1.12.3/cluster display <cluster-name>
    2. TiDB Cluster: <cluster-name>
    3. TiDB Version: v8.1.0
    4. ID Role Host Ports Status Data Dir Deploy Dir
    5. -- ---- ---- ----- ------ -------- ----------
    6. 10.0.1.3:8300 cdc 10.0.1.3 8300 Up data/cdc-8300 deploy/cdc-8300
    7. 10.0.1.4:8300 cdc 10.0.1.4 8300 Up data/cdc-8300 deploy/cdc-8300
    8. 10.0.1.4:2379 pd 10.0.1.4 2379/2380 Healthy data/pd-2379 deploy/pd-2379
    9. 10.0.1.1:20160 tikv 10.0.1.1 20160/20180 Up data/tikv-20160 deploy/tikv-20160
    10. 10.0.1.2:20160 tikv 10.0.1.2 20160/20180 Up data/tikv-20160 deploy/tikv-20160
    11. 10.0.1.5:20160 tikv 10.0.1.5 20160/20180 Up data/tikv-20160 deploy/tikv-20160
    12. 10.0.1.3:4000 tidb 10.0.1.3 4000/10080 Up - deploy/tidb-4000
    13. 10.0.1.4:4000 tidb 10.0.1.4 4000/10080 Up - deploy/tidb-4000
    14. 10.0.1.5:4000 tidb 10.0.1.5 4000/10080 Up - deploy/tidb-4000
    15. 10.0.1.3:9000 tiflash 10.0.1.3 9000/8123/3930/20170/20292/8234 Up data/tiflash-9000 deploy/tiflash-9000
    16. 10.0.1.4:9000 tiflash 10.0.1.4 9000/8123/3930/20170/20292/8234 Up data/tiflash-9000 deploy/tiflash-9000
    17. 10.0.1.5:9090 prometheus 10.0.1.5 9090 Up data/prometheus-9090 deploy/prometheus-9090
    18. 10.0.1.5:3000 grafana 10.0.1.5 3000 Up - deploy/grafana-3000
    19. 10.0.1.5:9093 alertmanager 10.0.1.5 9093/9294 Up data/alertmanager-9093 deploy/alertmanager-9093
  2. Run the scale-in command:

    1. tiup cluster scale-in <cluster-name> --node 10.0.1.5:20160

    The --node parameter is the ID of the node to be taken offline.

    If you see Scaled cluster <cluster-name> in successfully, the scale-in operation succeeds.

  3. Refresh the cluster configuration.

    Use TiUP (Recommended) - 图9

    Note

    This operation is only required after you remove PD nodes. If you only remove TiDB or TiKV nodes, this operation is unnecessary.

    1. Refresh the cluster configuration:

      1. tiup cluster reload <cluster-name> --skip-restart
    2. Refresh the Prometheus configuration and restart Prometheus:

      Use TiUP (Recommended) - 图10

      Note

      If you are using TiUP v1.15.0 or a later version, skip this step. If you are using a TiUP version earlier than v1.15.0, execute the following command to update the Prometheus configuration and restart Prometheus.

      1. tiup cluster reload <cluster-name> -R prometheus
  4. Check the cluster status:

    The scale-in process takes some time. You can run the following command to check the scale-in status:

    1. tiup cluster display <cluster-name>

    If the node to be scaled in becomes Tombstone, the scale-in operation succeeds.

    Access the monitoring platform at http://10.0.1.5:3000 using your browser, and view the status of the cluster.

The current topology is as follows:

Host IPService
10.0.1.3TiDB + TiFlash + TiCDC
10.0.1.4TiDB + PD + TiFlash + TiCDC
10.0.1.5TiDB + Monitor (TiKV is deleted)
10.0.1.1TiKV
10.0.1.2TiKV

Scale in a TiFlash cluster

This section exemplifies how to remove a TiFlash node from the 10.0.1.4 host.

1. Adjust the number of replicas of the tables according to the number of remaining TiFlash nodes

  1. Query whether any table has TiFlash replicas more than the number of TiFlash nodes after scale-in. tobe_left_nodes means the number of TiFlash nodes after scale-in. If the query result is empty, you can start scaling in TiFlash. If the query result is not empty, you need to modify the number of TiFlash replicas of the related table(s).

    1. SELECT * FROM information_schema.tiflash_replica WHERE REPLICA_COUNT > 'tobe_left_nodes';
  2. Execute the following statement for all tables with TiFlash replicas more than the number of TiFlash nodes after scale-in. new_replica_num must be less than or equal to tobe_left_nodes:

    1. ALTER TABLE <db-name>.<table-name> SET tiflash replica 'new_replica_num';
  3. Perform step 1 again and make sure that there is no table with TiFlash replicas more than the number of TiFlash nodes after scale-in.

2. Perform the scale-in operation

Perform the scale-in operation with one of the following solutions.

Solution 1. Use TiUP to remove a TiFlash node

  1. Confirm the name of the node to be taken down:

    1. tiup cluster display <cluster-name>
  2. Remove the TiFlash node (assume that the node name is 10.0.1.4:9000 from Step 1):

    1. tiup cluster scale-in <cluster-name> --node 10.0.1.4:9000

Solution 2. Manually remove a TiFlash node

In special cases (such as when a node needs to be forcibly taken down), or if the TiUP scale-in operation fails, you can manually remove a TiFlash node with the following steps.

  1. Use the store command of pd-ctl to view the store ID corresponding to this TiFlash node.

    • Enter the store command in pd-ctl (the binary file is under resources/bin in the tidb-ansible directory).

    • If you use TiUP deployment, replace pd-ctl with tiup ctl:v<CLUSTER_VERSION> pd:

    1. tiup ctl:v<CLUSTER_VERSION> pd -u http://<pd_ip>:<pd_port> store

    Use TiUP (Recommended) - 图11

    Note

    If multiple PD instances exist in the cluster, you only need to specify the IP address:port of an active PD instance in the above command.

  2. Remove the TiFlash node in pd-ctl:

    • Enter store delete <store_id> in pd-ctl (<store_id> is the store ID of the TiFlash node found in the previous step.

    • If you use TiUP deployment, replace pd-ctl with tiup ctl:v<CLUSTER_VERSION> pd:

      1. tiup ctl:v<CLUSTER_VERSION> pd -u http://<pd_ip>:<pd_port> store delete <store_id>

    Use TiUP (Recommended) - 图12

    Note

    If multiple PD instances exist in the cluster, you only need to specify the IP address:port of an active PD instance in the above command.

  3. Wait for the store of the TiFlash node to disappear or for the state_name to become Tombstone before you stop the TiFlash process.

  4. Manually delete TiFlash data files (the location can be found in the data_dir directory under the TiFlash configuration of the cluster topology file).

  5. Delete information about the TiFlash node that goes down from the cluster topology using the following command:

    1. tiup cluster scale-in <cluster-name> --node <pd_ip>:<pd_port> --force

Use TiUP (Recommended) - 图13

Note

Before all TiFlash nodes in the cluster stop running, if not all tables replicated to TiFlash are canceled, you need to manually clean up the replication rules in PD, or the TiFlash node cannot be taken down successfully.

The steps to manually clean up the replication rules in PD are below:

  1. View all data replication rules related to TiFlash in the current PD instance:

    1. curl http://<pd_ip>:<pd_port>/pd/api/v1/config/rules/group/tiflash
    1. [
    2. {
    3. "group_id": "tiflash",
    4. "id": "table-45-r",
    5. "override": true,
    6. "start_key": "7480000000000000FF2D5F720000000000FA",
    7. "end_key": "7480000000000000FF2E00000000000000F8",
    8. "role": "learner",
    9. "count": 1,
    10. "label_constraints": [
    11. {
    12. "key": "engine",
    13. "op": "in",
    14. "values": [
    15. "tiflash"
    16. ]
    17. }
    18. ]
    19. }
    20. ]
  2. Remove all data replication rules related to TiFlash. Take the rule whose id is table-45-r as an example. Delete it by the following command:

    1. curl -v -X DELETE http://<pd_ip>:<pd_port>/pd/api/v1/config/rule/tiflash/table-45-r
  3. View the cluster status:

    1. tiup cluster display <cluster-name>

    Access the monitoring platform at http://10.0.1.5:3000 using your browser, and view the status of the cluster and the new nodes.

After the scale-out, the cluster topology is as follows:

Host IPService
10.0.1.3TiDB + TiFlash + TiCDC
10.0.1.4TiDB + PD + TiCDC (TiFlash is deleted)
10.0.1.5TiDB+ Monitor
10.0.1.1TiKV
10.0.1.2TiKV

Scale in a TiCDC cluster

This section exemplifies how to remove the TiCDC node from the 10.0.1.4 host.

  1. Take the node offline:

    1. tiup cluster scale-in <cluster-name> --node 10.0.1.4:8300
  2. View the cluster status:

    1. tiup cluster display <cluster-name>

    Access the monitoring platform at http://10.0.1.5:3000 using your browser, and view the status of the cluster.

The current topology is as follows:

Host IPService
10.0.1.3TiDB + TiFlash + TiCDC
10.0.1.4TiDB + PD + (TiCDC is deleted)
10.0.1.5TiDB + Monitor
10.0.1.1TiKV
10.0.1.2TiKV