Metrics Schema

The METRICS_SCHEMA is a set of views on top of TiDB metrics that are stored in Prometheus. The source of the PromQL (Prometheus Query Language) for each of the tables is available in INFORMATION_SCHEMA.METRICS_TABLES.

  1. USE metrics_schema;
  2. SELECT * FROM uptime;
  3. SELECT * FROM information_schema.metrics_tables WHERE table_name='uptime'\G
  1. +----------------------------+-----------------+------------+--------------------+
  2. | time | instance | job | value |
  3. +----------------------------+-----------------+------------+--------------------+
  4. | 2020-07-06 15:26:26.203000 | 127.0.0.1:10080 | tidb | 123.60300016403198 |
  5. | 2020-07-06 15:27:26.203000 | 127.0.0.1:10080 | tidb | 183.60300016403198 |
  6. | 2020-07-06 15:26:26.203000 | 127.0.0.1:20180 | tikv | 123.60300016403198 |
  7. | 2020-07-06 15:27:26.203000 | 127.0.0.1:20180 | tikv | 183.60300016403198 |
  8. | 2020-07-06 15:26:26.203000 | 127.0.0.1:2379 | pd | 123.60300016403198 |
  9. | 2020-07-06 15:27:26.203000 | 127.0.0.1:2379 | pd | 183.60300016403198 |
  10. | 2020-07-06 15:26:26.203000 | 127.0.0.1:9090 | prometheus | 123.72300004959106 |
  11. | 2020-07-06 15:27:26.203000 | 127.0.0.1:9090 | prometheus | 183.72300004959106 |
  12. +----------------------------+-----------------+------------+--------------------+
  13. 8 rows in set (0.00 sec)
  14. *************************** 1. row ***************************
  15. TABLE_NAME: uptime
  16. PROMQL: (time() - process_start_time_seconds{$LABEL_CONDITIONS})
  17. LABELS: instance,job
  18. QUANTILE: 0
  19. COMMENT: TiDB uptime since last restart(second)
  20. 1 row in set (0.00 sec)
  1. SHOW TABLES;
  1. +---------------------------------------------------+
  2. | Tables_in_metrics_schema |
  3. +---------------------------------------------------+
  4. | abnormal_stores |
  5. | etcd_disk_wal_fsync_rate |
  6. | etcd_wal_fsync_duration |
  7. | etcd_wal_fsync_total_count |
  8. | etcd_wal_fsync_total_time |
  9. | go_gc_count |
  10. | go_gc_cpu_usage |
  11. | go_gc_duration |
  12. | go_heap_mem_usage |
  13. | go_threads |
  14. | goroutines_count |
  15. | node_cpu_usage |
  16. | node_disk_available_size |
  17. | node_disk_io_util |
  18. | node_disk_iops |
  19. | node_disk_read_latency |
  20. | node_disk_size |
  21. ..
  22. | tikv_storage_async_request_total_time |
  23. | tikv_storage_async_requests |
  24. | tikv_storage_async_requests_total_count |
  25. | tikv_storage_command_ops |
  26. | tikv_store_size |
  27. | tikv_thread_cpu |
  28. | tikv_thread_nonvoluntary_context_switches |
  29. | tikv_thread_voluntary_context_switches |
  30. | tikv_threads_io |
  31. | tikv_threads_state |
  32. | tikv_total_keys |
  33. | tikv_wal_sync_duration |
  34. | tikv_wal_sync_max_duration |
  35. | tikv_worker_handled_tasks |
  36. | tikv_worker_handled_tasks_total_num |
  37. | tikv_worker_pending_tasks |
  38. | tikv_worker_pending_tasks_total_num |
  39. | tikv_write_stall_avg_duration |
  40. | tikv_write_stall_max_duration |
  41. | tikv_write_stall_reason |
  42. | up |
  43. | uptime |
  44. +---------------------------------------------------+
  45. 626 rows in set (0.00 sec)

The METRICS_SCHEMA is used as a data source for monitoring-related summary tables such as (metrics_summary, metrics_summary_by_label and inspection_summary.

Additional Examples

Taking the tidb_query_duration monitoring table in metrics_schema as an example, this section illustrates how to use this monitoring table and how it works. The working principles of other monitoring tables are similar to tidb_query_duration.

Query the information related to the tidb_query_duration table on information_schema.metrics_tables:

  1. SELECT * FROM information_schema.metrics_tables WHERE table_name='tidb_query_duration';
  1. +---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+
  2. | TABLE_NAME | PROMQL | LABELS | QUANTILE | COMMENT |
  3. +---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+
  4. | tidb_query_duration | histogram_quantile($QUANTILE, sum(rate(tidb_server_handle_query_duration_seconds_bucket{$LABEL_CONDITIONS}[$RANGE_DURATION])) by (le,sql_type,instance)) | instance,sql_type | 0.9 | The quantile of TiDB query durations(second) |
  5. +---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+

Field description:

  • TABLE_NAME: Corresponds to the table name in metrics_schema . In this example, the table name is tidb_query_duration.
  • PROMQL: The working principle of the monitoring table is to first map SQL statements to PromQL, then to request data from Prometheus, and to convert Prometheus results into SQL query results. This field is the expression template of PromQL. When you query the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression.
  • LABELS: The label for the monitoring item. tidb_query_duration has two labels: instance and sql_type.
  • QUANTILE: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is 0, it means that the monitoring item corresponding to the monitoring table is not a histogram.
  • COMMENT: Explanations for the monitoring table. You can see that the tidb_query_duration table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second.

To query the schema of the tidb_query_duration table, execute the following statement:

  1. SHOW CREATE TABLE metrics_schema.tidb_query_duration;
  1. +---------------------+--------------------------------------------------------------------------------------------------------------------+
  2. | Table | Create Table |
  3. +---------------------+--------------------------------------------------------------------------------------------------------------------+
  4. | tidb_query_duration | CREATE TABLE `tidb_query_duration` ( |
  5. | | `time` datetime unsigned DEFAULT CURRENT_TIMESTAMP, |
  6. | | `instance` varchar(512) DEFAULT NULL, |
  7. | | `sql_type` varchar(512) DEFAULT NULL, |
  8. | | `quantile` double unsigned DEFAULT '0.9', |
  9. | | `value` double unsigned DEFAULT NULL |
  10. | | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT='The quantile of TiDB query durations(second)' |
  11. +---------------------+--------------------------------------------------------------------------------------------------------------------+
  • time: The time of the monitoring item.
  • instance and sql_type: The labels of the tidb_query_duration monitoring item. instance means the monitoring address. sql_type means the type of the executed SQL statement.
  • quantile: The percentile. The monitoring item of the histogram type has this column, which indicates the percentile time of the query. For example, quantile = 0.9 means to query the time of P90.
  • value: The value of the monitoring item.

The following statement queries the P99 time within the range of [2020-03-25 23:40:00, 2020-03-25 23:42:00].

  1. SELECT * FROM metrics_schema.tidb_query_duration WHERE value is not null AND time>='2020-03-25 23:40:00' AND time <= '2020-03-25 23:42:00' AND quantile=0.99;
  1. +---------------------+-------------------+----------+----------+----------------+
  2. | time | instance | sql_type | quantile | value |
  3. +---------------------+-------------------+----------+----------+----------------+
  4. | 2020-03-25 23:40:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.509929485256 |
  5. | 2020-03-25 23:41:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.494690793986 |
  6. | 2020-03-25 23:42:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.493460506934 |
  7. | 2020-03-25 23:40:00 | 172.16.5.40:10089 | Select | 0.99 | 0.152058493415 |
  8. | 2020-03-25 23:41:00 | 172.16.5.40:10089 | Select | 0.99 | 0.152193879678 |
  9. | 2020-03-25 23:42:00 | 172.16.5.40:10089 | Select | 0.99 | 0.140498483232 |
  10. | 2020-03-25 23:40:00 | 172.16.5.40:10089 | internal | 0.99 | 0.47104 |
  11. | 2020-03-25 23:41:00 | 172.16.5.40:10089 | internal | 0.99 | 0.11776 |
  12. | 2020-03-25 23:42:00 | 172.16.5.40:10089 | internal | 0.99 | 0.11776 |
  13. +---------------------+-------------------+----------+----------+----------------+

The first row of the query result above means that at the time of 2020-03-25 23:40:00, on the TiDB instance 172.16.5.40:10089, the P99 execution time of the Insert type statement is 0.509929485256 seconds. The meanings of other rows are similar. Other values of the sql_type column are described as follows:

  • Select: The select type statement is executed.
  • internal: The internal SQL statement of TiDB, which is used to update the statistical information and get the global variables.

To view the execution plan of the statement above, execute the following statement:

  1. DESC SELECT * FROM metrics_schema.tidb_query_duration WHERE value is not null AND time>='2020-03-25 23:40:00' AND time <= '2020-03-25 23:42:00' AND quantile=0.99;
  1. +------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  2. | id | estRows | task | access object | operator info |
  3. +------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  4. | Selection_5 | 8000.00 | root | | not(isnull(Column#5)) |
  5. | └─MemTableScan_6 | 10000.00 | root | table:tidb_query_duration | PromQL:histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket{}[60s])) by (le,sql_type,instance)), start_time:2020-03-25 23:40:00, end_time:2020-03-25 23:42:00, step:1m0s |
  6. +------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

From the result above, you can see that PromQL, start_time, end_time, and step are in the execution plan. During the execution process, TiDB calls the query_range HTTP API of Prometheus to query the monitoring data.

You might find that in the range of [2020-03-25 23:40:00, 2020-03-25 23:42:00], each label only has three time values. In the execution plan, the value of step is 1 minute, which means that the interval of these values is 1 minute. step is determined by the following two session variables:

  • tidb_metric_query_step: The query resolution step width. To get the query_range data from Prometheus, you need to specify start_time, end_time, and step. step uses the value of this variable.
  • tidb_metric_query_range_duration: When the monitoring data is queried, the value of the $ RANGE_DURATION field in PROMQL is replaced with the value of this variable. The default value is 60 seconds.

To view the values of monitoring items with different granularities, you can modify the two session variables above before querying the monitoring table. For example:

  1. Modify the values of the two session variables and set the time granularity to 30 seconds.

    METRICS_SCHEMA - 图1

    Note

    The minimum granularity supported by Prometheus is 30 seconds.

    1. set @@tidb_metric_query_step=30;
    2. set @@tidb_metric_query_range_duration=30;
  2. Query the tidb_query_duration monitoring item as follows. From the result, you can see that within the 3-minute time range, each label has 6 time values, and the interval between each value is 30 seconds.

    1. select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99;
    1. +---------------------+-------------------+----------+----------+-----------------+
    2. | time | instance | sql_type | quantile | value |
    3. +---------------------+-------------------+----------+----------+-----------------+
    4. | 2020-03-25 23:40:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.483285651924 |
    5. | 2020-03-25 23:40:30 | 172.16.5.40:10089 | Insert | 0.99 | 0.484151462113 |
    6. | 2020-03-25 23:41:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.504576 |
    7. | 2020-03-25 23:41:30 | 172.16.5.40:10089 | Insert | 0.99 | 0.493577384561 |
    8. | 2020-03-25 23:42:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.49482474311 |
    9. | 2020-03-25 23:40:00 | 172.16.5.40:10089 | Select | 0.99 | 0.189253402185 |
    10. | 2020-03-25 23:40:30 | 172.16.5.40:10089 | Select | 0.99 | 0.184224951851 |
    11. | 2020-03-25 23:41:00 | 172.16.5.40:10089 | Select | 0.99 | 0.151673410553 |
    12. | 2020-03-25 23:41:30 | 172.16.5.40:10089 | Select | 0.99 | 0.127953838989 |
    13. | 2020-03-25 23:42:00 | 172.16.5.40:10089 | Select | 0.99 | 0.127455434547 |
    14. | 2020-03-25 23:40:00 | 172.16.5.40:10089 | internal | 0.99 | 0.0624 |
    15. | 2020-03-25 23:40:30 | 172.16.5.40:10089 | internal | 0.99 | 0.12416 |
    16. | 2020-03-25 23:41:00 | 172.16.5.40:10089 | internal | 0.99 | 0.0304 |
    17. | 2020-03-25 23:41:30 | 172.16.5.40:10089 | internal | 0.99 | 0.06272 |
    18. | 2020-03-25 23:42:00 | 172.16.5.40:10089 | internal | 0.99 | 0.0629333333333 |
    19. +---------------------+-------------------+----------+----------+-----------------+
  3. View the execution plan. From the result, you can also see that the values of PromQL and step in the execution plan have been changed to 30 seconds.

    1. desc select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99;
    1. +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    2. | id | estRows | task | access object | operator info |
    3. +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    4. | Selection_5 | 8000.00 | root | | not(isnull(Column#5)) |
    5. | └─MemTableScan_6 | 10000.00 | root | table:tidb_query_duration | PromQL:histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket{}[30s])) by (le,sql_type,instance)), start_time:2020-03-25 23:40:00, end_time:2020-03-25 23:42:00, step:30s |
    6. +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+