TiFlash 集群监控

使用 TiUP 部署 TiDB 集群时，一键部署监控系统 (Prometheus & Grafana)，监控架构参见 TiDB 监控框架概述。

目前 Grafana Dashboard 整体分为 PD、TiDB、TiKV、Node_exporter、Overview 等。

TiFlash 面板一共包括 TiFlash-Summary、TiFlash-Proxy-Summary、TiFlash-Proxy-Details。通过面板上的指标，可以了解 TiFlash 当前的状态。其中 TiFlash-Proxy-Summary、TiFlash-Proxy-Details 主要为 TiFlash 的 Raft 层信息，其监控指标信息可参考 TiKV 监控指标详解。

注意：

低版本的 TiFlash 监控信息较不完善，如有需要推荐使用 v4.0.5 或更高版本的 TiDB 集群。

以下为 TiFlash-Summary 默认的监控信息：

Server

Store size：每个 TiFlash 实例的使用的存储空间的大小。
Available size：每个 TiFlash 实例的可用的存储空间的大小。
Capacity size：每个 TiFlash 实例的存储容量的大小。
Uptime：自上次重启以来 TiFlash 正常运行的时间。
Memory：每个 TiFlash 实例内存的使用情况。
CPU Usage：每个 TiFlash 实例 CPU 的使用率。
FSync OPS：每个 TiFlash 实例每秒进行 fsync 操作的次数。
File Open OPS：每个 TiFlash 实例每秒进行 open 操作的次数。
Opened File Count：当前每个 TiFlash 实例打开的文件句柄数。

注意：

Store size、FSync OPS、File Open OPS、Opened File Count 目前仅包含了 TiFlash 存储层的统计指标，未包括 TiFlash-Proxy 内的信息。

Coprocessor

Request QPS：所有 TiFlash 实例收到的 coprocessor 请求数量。其中 batch 是 batch 请求数量，batch_cop 是 batch 请求中的 coprocessor 请求数量，cop 是直接通过 coprocessor 接口发送的 coprocessor 请求数量，cop_dag 是所有 coprocessor 请求中 dag 请求数量，super_batch 是开启 super batch 特性的请求数量。
Executor QPS：所有 TiFlash 实例收到的请求中，每种 dag 算子的数量，其中 table_scan 是扫表算子，selection 是过滤算子，aggregation 是聚合算子，top_n 是 TopN 算子，limit 是 limit 算子。
Request Duration：所有 TiFlash 实例处理 coprocessor request 总时间，总时间为接收到该 coprocessor 请求至请求应答完毕的时间。
Error QPS：所有 TiFlash 实例处理 coprocessor 请求的错误数量。其中 meet_lock 为读取的数据有锁，region_not_found 为 Region 不存在，epoch_not_match 为读取的 Region epoch 与本地不一致，kv_client_error 为与 TiKV 通信产生的错误，internal_error 为 TiFlash 内部系统错误，other 为其他错误。
Request Handle Duration：所有 TiFlash 实例处理 coprocessor 请求处理时间，处理时间为该 coprocessor 请求开始执行到执行结束的时间。
Response Bytes/Seconds：所有 TiFlash 实例应答总字节数。
Cop task memory usage：所有 TiFlash 实例处理 coprocessor 请求占用的总内存。
Handling Request Number：所有 TiFlash 实例正在处理的 coprocessor 请求数量之和。请求的分类与 Request QPS 中的相同.

DDL

Schema Version：每个 TiFlash 实例目前缓存的 schema 版本。
Schema Apply OPM：所有 TiFlash 实例每分钟 apply 同步 TiDB schema diff 的次数。diff apply 是正常的单次 apply 过程，如果 diff apply 失败，则 failed apply +1，并回退到 full apply，拉取最新的 schema 信息以更新 TiFlash 的 schema 版本。
Schema Internal DDL OPM：所有 TiFlash 实例每分钟执行的内部 DDL 次数。
Schema Apply Duration：所有 TiFlash 实例 apply schema 消耗的时间。

Storage

Write Command OPS：所有 TiFlash 实例存储层每秒收到的写请求数量。
Write Amplification：每个 TiFlash 实例写放大倍数（实际磁盘写入量/逻辑数据写入量）。total 为自此次启动以来的写放大倍数，5min 为最近 5 分钟内的写放大倍数。
Read Tasks OPS：每个 TiFlash 实例每秒存储层内部读取任务的数量。
Rough Set Filter Rate：每个 TiFlash 实例最近 1 分钟内读取的 packet 数被存储层粗糙索引过滤的比例。
Internal Tasks OPS：所有 TiFlash 实例每秒进行内部数据整理任务的次数。
Internal Tasks Duration：所有 TiFlash 实例进行内部数据整理任务消耗的时间。
Page GC Tasks OPM：所有 TiFlash 实例每分钟进行 Delta 部分数据整理任务的次数。
Page GC Tasks Duration：所有 TiFlash 实例进行 Delta 部分数据整理任务消耗的时间分布。
Disk Write OPS：所有 TiFlash 实例每秒进行磁盘写入的次数。
Disk Read OPS：所有 TiFlash 实例每秒进行磁盘读取的次数。
Write flow：所有 TiFlash 实例磁盘写操作的流量。
Read flow：所有 TiFlash 实例磁盘读操作的流量。

注意：

目前这部分监控指标仅包含了 TiFlash 存储层的统计指标，未包括 TiFlash-Proxy 内的信息。

Storage Write Stall

Write & Delta Management Throughput：所有实例写入及数据整理的吞吐量。
- throughput_write 表示通过 Raft 进行数据同步的吞吐量。
- throughput_delta-management 表示数据整理的吞吐量。
- total_write 表示自上次启动以来的总写入字节数。
- total_delta-management 表示自上次启动以来数据整理的总字节数。
Write Stall Duration：每个实例写入和移除 Region 数据产生的卡顿时长。
Write Throughput By Instance：每个实例写入数据的吞吐量，包括 apply Raft 数据日志以及 Raft 快照的写入吞吐量。
Write Command OPS By Instance：每个实例收到各种命令的总计数。
- write block 表示通过 Raft 同步数据日志。
- delete_range 表示从该实例中删除一些 Region 或移动一些 Region 到该实例中。
- ingest 表示这些 Region 的快照被应用到这个实例中。

Raft

Read Index OPS：每个 TiFlash 实例每秒触发 read_index 请求的次数，等于请求触发的 Region 总数。
Read Index Duration：所有 TiFlash 实例在进行 read_index 消耗的时间，主要消耗在于和 Region leader 的交互和重试时间。
Wait Index Duration：所有 TiFlash 实例在进行 wait_index 消耗的时间，即拿到 read_index 请求后，等待本地的 Region index >= read_index 所花费的时间。