Schedule Replicas by Topology Labels

Schedule Replicas by Topology Labels - 图1

Note

TiDB v5.3.0 introduces Placement Rules in SQL. This offers a more convenient way to configure the placement of tables and partitions. Placement Rules in SQL might replace placement configuration with PD in future releases.

To improve the high availability and disaster recovery capability of TiDB clusters, it is recommended that TiKV nodes are physically scattered as much as possible. For example, TiKV nodes can be distributed on different racks or even in different data centers. According to the topology information of TiKV, the PD scheduler automatically performs scheduling at the background to isolate each replica of a Region as much as possible, which maximizes the capability of disaster recovery.

To make this mechanism effective, you need to properly configure TiKV and PD so that the topology information of the cluster, especially the TiKV location information, is reported to PD during deployment. Before you begin, see Deploy TiDB Using TiUP first.

Configure labels based on the cluster topology

Configure labels for TiKV and TiFlash

You can use the command-line flag or set the TiKV or TiFlash configuration file to bind some attributes in the form of key-value pairs. These attributes are called labels. After TiKV and TiFlash are started, they report their labels to PD so users can identify the location of TiKV and TiFlash nodes.

Assume that the topology has four layers: zone > data center (dc) > rack > host, and you can use these labels (zone, dc, rack, host) to set location of the TiKV and TiFlash. To set labels for TiKV and TiFlash, you can use one of the following methods:

  • Use the command-line flag to start a TiKV instance:

    1. tikv-server --labels zone=<zone>,dc=<dc>,rack=<rack>,host=<host>
  • Configure in the TiKV configuration file:

    1. [server]
    2. [server.labels]
    3. zone = "<zone>"
    4. dc = "<dc>"
    5. rack = "<rack>"
    6. host = "<host>"

To set labels for TiFlash, you can use the tiflash-learner.toml file, which is the configuration file of tiflash-proxy:

  1. [server]
  2. [server.labels]
  3. zone = "<zone>"
  4. dc = "<dc>"
  5. rack = "<rack>"
  6. host = "<host>"

(Optional) Configure labels for TiDB

When Follower read is enabled, if you want TiDB to prefer to read data from the same region, you need to configure labels for TiDB nodes.

You can set labels for TiDB using the configuration file:

  1. [labels]
  2. zone = "<zone>"
  3. dc = "<dc>"
  4. rack = "<rack>"
  5. host = "<host>"

Schedule Replicas by Topology Labels - 图2

Note

Currently, TiDB depends on the zone label to match and select replicas that are in the same region. To use this feature, you need to include zone when configuring location-labels for PD, and configure zone when configuring labels for TiDB, TiKV, and TiFlash. For more details, see Configure labels for TiKV and TiFlash.

Configure location-labels for PD

According to the description above, the label can be any key-value pair used to describe TiKV attributes. But PD cannot identify the location-related labels and the layer relationship of these labels. Therefore, you need to make the following configuration for PD to understand the TiKV node topology.

Defined as an array of strings, location-labels is the configuration for PD. Each item of this configuration corresponds to the key of TiKV labels. Besides, the sequence of each key represents the layer relationship of different labels (the isolation levels decrease from left to right).

You can customize the value of location-labels, such as zone, rack, or host, because the configuration does not have default values. Also, this configuration has no restriction in the number of label levels (not mandatory for 3 levels) as long as they match with TiKV server labels.

Schedule Replicas by Topology Labels - 图3

Note

  • To make configurations take effect, you must configure location-labels for PD and labels for TiKV at the same time. Otherwise, PD does not perform scheduling according to the topology.
  • If you use Placement Rules in SQL, you only need to configure labels for TiKV. Currently, Placement Rules in SQL is incompatible with the location-labels configuration of PD and ignores this configuration. It is not recommended to use location-labels and Placement Rules in SQL at the same time; otherwise, unexpected results might occur.

To configure location-labels, choose one of the following methods according to your cluster situation:

  • If the PD cluster is not initialized, configure location-labels in the PD configuration file:

    1. [replication]
    2. location-labels = ["zone", "rack", "host"]
  • If the PD cluster is already initialized, use the pd-ctl tool to make online changes:

    1. pd-ctl config set location-labels zone,rack,host

Configure isolation-level for PD

If location-labels has been configured, you can further enhance the topological isolation requirements on TiKV clusters by configuring isolation-level in the PD configuration file.

Assume that you have made a three-layer cluster topology by configuring location-labels according to the instructions above: zone -> rack -> host, you can configure the isolation-level to zone as follows:

  1. [replication]
  2. isolation-level = "zone"

If the PD cluster is already initialized, you need to use the pd-ctl tool to make online changes:

  1. pd-ctl config set isolation-level zone

The location-level configuration is an array of strings, which needs to correspond to a key of location-labels. This parameter limits the minimum and mandatory isolation level requirements on TiKV topology clusters.

Schedule Replicas by Topology Labels - 图4

Note

isolation-level is empty by default, which means there is no mandatory restriction on the isolation level. To set it, you need to configure location-labels for PD and ensure that the value of isolation-level is one of location-labels names.

When using TiUP to deploy a cluster, you can configure the TiKV location in the initialization configuration file. TiUP will generate the corresponding configuration files for TiKV, PD, and TiFlash during deployment.

In the following example, a two-layer topology of zone/host is defined. The TiKV nodes and TiFlash nodes of the cluster are distributed among three zones, z1, z2, and z3.

  • In each zone, there are two hosts that have TiKV instances deployed. In z1, each host has two TiKV instances deployed. In z2 and z3, each host has a separate TiKV instance deployed.
  • In each zone, there are two hosts that have TiFlash instances deployed, and each host has a separate TiFlash instance deployed.

In the following example, tikv-host-machine-n represents the IP address of the nth TiKV node, and tiflash-host-machine-n represents the IP address of the nth TiFlash node.

  1. server_configs:
  2. pd:
  3. replication.location-labels: ["zone", "host"]
  4. tikv_servers:
  5. # z1
  6. # machine-1 on z1
  7. - host: tikv-host-machine-1
  8. port20160
  9. config:
  10. server.labels:
  11. zone: z1
  12. host: tikv-host-machine-1
  13. - host: tikv-host-machine-1
  14. port20161
  15. config:
  16. server.labels:
  17. zone: z1
  18. host: tikv-host-machine-1
  19. # machine-2 on z1
  20. - host: tikv-host-machine-2
  21. port20160
  22. config:
  23. server.labels:
  24. zone: z1
  25. host: tikv-host-machine-2
  26. - host: tikv-host-machine-2
  27. port20161
  28. config:
  29. server.labels:
  30. zone: z1
  31. host: tikv-host-machine-2
  32. # z2
  33. - host: tikv-host-machine-3
  34. config:
  35. server.labels:
  36. zone: z2
  37. host: tikv-host-machine-3
  38. - host: tikv-host-machine-4
  39. config:
  40. server.labels:
  41. zone: z2
  42. host: tikv-host-machine-4
  43. # z3
  44. - host: tikv-host-machine-5
  45. config:
  46. server.labels:
  47. zone: z3
  48. host: tikv-host-machine-5
  49. - host: tikv-host-machine-6
  50. config:
  51. server.labels:
  52. zone: z3
  53. host: tikv-host-machine-6
  54. tiflash_servers:
  55. # z1
  56. - host: tiflash-host-machine-1
  57. learner_config:
  58. server.labels:
  59. zone: z1
  60. host: tiflash-host-machine-1
  61. - host: tiflash-host-machine-2
  62. learner_config:
  63. server.labels:
  64. zone: z1
  65. host: tiflash-host-machine-2
  66. # z2
  67. - host: tiflash-host-machine-3
  68. learner_config:
  69. server.labels:
  70. zone: z2
  71. host: tiflash-host-machine-3
  72. - host: tiflash-host-machine-4
  73. learner_config:
  74. server.labels:
  75. zone: z2
  76. host: tiflash-host-machine-4
  77. # z3
  78. - host: tiflash-host-machine-5
  79. learner_config:
  80. server.labels:
  81. zone: z3
  82. host: tiflash-host-machine-5
  83. - host: tiflash-host-machine-6
  84. learner_config:
  85. server.labels:
  86. zone: z3
  87. host: tiflash-host-machine-6

For details, see Geo-distributed Deployment topology.

Schedule Replicas by Topology Labels - 图5

Note

If you have not configured replication.location-labels in the configuration file, when you deploy a cluster using this topology file, an error might occur. It is recommended that you confirm replication.location-labels is configured in the configuration file before deploying a cluster.

PD schedules based on topology label

PD schedules replicas according to the label layer to make sure that different replicas of the same data are scattered as much as possible.

Take the topology in the previous section as an example.

Assume that the number of cluster replicas is 3 (max-replicas=3). Because there are 3 zones in total, PD ensures that the 3 replicas of each Region are respectively placed in z1, z2, and z3. In this way, the TiDB cluster is still available when one zone fails.

Then, assume that the number of cluster replicas is 5 (max-replicas=5). Because there are only 3 zones in total, PD cannot guarantee the isolation of each replica at the zone level. In this situation, the PD scheduler will ensure replica isolation at the host level. In other words, multiple replicas of a Region might be distributed in the same zone but not on the same host.

In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by max-store-down-time), PD will make up the 5 replicas through scheduling. At this time, only 4 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. But if the isolation-level value is set to zone instead of being left empty, this specifies the minimum physical isolation requirements for Region replicas. That is to say, PD will ensure that replicas of the same Region are scattered among different zones. PD will not perform corresponding scheduling even if following this isolation restriction does not meet the requirement of max-replicas for multiple replicas.

If the isolation-level setting is set to zone, this specifies the minimum isolation requirement for Region replicas at the physical level. In this case, PD will always guarantee that replicas of the same Region are distributed across different zones. Even if following this isolation restriction would not meet the multi-replica requirements of max-replicas, PD will not schedule accordingly. Taking a TiKV cluster distributed across three data zones (z1, z2, and z3) as an example, if each Region requires three replicas, PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time (30 minutes by default, controlled by max-store-down-time), PD determines that the Region replicas in z1 are no longer available. However, because isolation-level is set to zone, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled to the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of isolation-level, even if there are only two replicas at this moment.

Similarly, when isolation-level is set to rack, the minimum isolation level applies to different racks in the same data center. With this configuration, the isolation at the zone layer is guaranteed first if possible. When the isolation at the zone level cannot be guaranteed, PD tries to avoid scheduling different replicas to the same rack in the same zone. The scheduling works similarly when isolation-level is set to host where PD first guarantees the isolation level of rack, and then the level of host.

In summary, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to achieve a certain level of disaster recovery, deploy more machines on different sites according to the topology than the number of max-replicas. TiDB also provides mandatory configuration items such as isolation-level for you to more flexibly control the topological isolation level of data according to different scenarios.