Daily Check

As a distributed database, TiDB is more complicated than the stand-alone database in terms of the mechanism, and monitoring items. To help operate and maintain TiDB in a more convenient way, this document introduces some key performance indicators.

Key indicators of TiDB Dashboard

Starting from v4.0, TiDB provides a new operation and maintenance management tool, TiDB Dashboard. This tool is integrated into the PD component. You can access TiDB Dashboard at the default address http://${pd-ip}:${pd_port}/dashboard.

TiDB Dashboard simplifies the operation and maintenance of the TiDB database. You can view the running status of the entire TiDB cluster through one interface. The following are descriptions of some performance indicators.

Instance panel

Instance panel

  • Status: This indicator is used to check whether the status is normal. For an online node, this can be ignored.
  • Up Time: The key indicator. If you find that the Up Time is changed, you need to locate the reason why the component is restarted.
  • Version, Deployment Directory, Git Hash: These indicators need to be checked to avoid inconsistent or even incorrect version/deployment directory.

Host panel

Host panel

You can view the usage of CPU, memory, and disk. When the usage of any resource exceeds 80%, it is recommended to scale out the capacity accordingly.

SQL analysis panel

SQL analysis panel

You can locate the slow SQL statement executed in the cluster. Then you can optimize the specific SQL statement.

Region panel

Region panel

  • down-peer-region-count: The number of Regions with an unresponsive peer reported by the Raft leader.
  • empty-region-count: The number of empty Regions, with a size of smaller than 1 MiB. These Regions are generated by executing the TRUNCATE TABLE/DROP TABLE statement. If this number is large, you can consider enabling Region Merge to merge Regions across tables.
  • extra-peer-region-count: The number of Regions with extra replicas. These Regions are generated during the scheduling process.
  • learner-peer-region-count: The number of Regions with the learner peer. The sources of learner peers can be various, for example, the learner peers in TiFlash, and the learner peers included in the configured Placement Rules.
  • miss-peer-region-count: The number of Regions without enough replicas. This value is not always greater than 0.
  • offline-peer-region-count: The number of Regions during the peer offline process.
  • oversized-region-count: The number of Regions with a size larger than region-max-size or region-max-keys.
  • pending-peer-region-count: The number of Regions with outdated Raft logs. It is normal that a few pending peers are generated in the scheduling process. However, it is not normal if this value is large for a period of time (longer than 30 minutes).
  • undersized-region-count: The number of Regions with a size smaller than max-merge-region-size or max-merge-region-keys.

Generally, it is normal that these values are not 0. However, it is not normal that they are not 0 for quite a long time.

KV Request Duration

TiKV request duration

The KV request duration 99 in TiKV. If you find nodes with a long duration, check whether there are hot spots, or whether there are nodes with poor performance.

PD TSO Wait Duration

TiDB TSO Wait Duration

The time it takes for TiDB to obtain TSO from PD. The following are reasons for the long wait duration:

  • High network latency from TiDB to PD. You can manually execute the ping command to test the network latency.
  • High load for the TiDB server.
  • High load for the PD server.

Overview panel

Overview panel

You can view the load, memory available, network traffic, and I/O utilities. When a bottleneck is found, it is recommended to scale out the capacity, or to optimize the cluster topology, SQL, and cluster parameters.

Exceptions

Exceptions

You can view the errors triggered by the execution of SQL statements on each TiDB instance. These include syntax error and primary key conflicts.

GC status

GC status

You can check whether the GC (Garbage Collection) status is normal by viewing the time when the last GC happens. If the GC is abnormal, it might lead to excessive historical data, thereby decreasing the access efficiency.