健康检查与资源监控

在 MatrixOne 分布式集群中,包含了多个组件和对象,为了确保其正常运行并排除故障,我们需要进行一系列的健康检查和资源监控。

本篇文档所介绍到的健康检查与资源监控环境将基于 MatrixOne 分布式集群部署的环境。

检查对象

  • 物理资源层:包括三台虚拟机的 CPU、内存和磁盘资源。有关监控这些资源的成熟方案,可以参考监控方案。在此不做过多介绍。

  • 逻辑资源层:包括 MinIO 的容量使用情况,Kubernetes 的各个节点和 Pod 的 CPU 和内存资源使用情况,以及 MatrixOne 的整体状态和各个组件(如 LogService、CN、TN)的状态。

资源监控

MinIO 容量使用监控

MinIO 自带了一个管理界面,通过该界面我们可以以可视化的方式监控其容量使用情况,包括剩余空间的数量等。有关详细信息,请参考官方文档

健康检查与资源监控 - 图1

Node/Pod 资源监控

为了确定 MatrixOne 服务是否需要扩缩容,用户往往需要针对 MatrixOne 集群所在 Node 和组件对应 Pod 所使用的资源进行监控。

你可以使用 kubectl top 命令完成,详细的命令可以参考对应版本的 Kubernetes 官网文档

Node 监控

  1. 使用如下命令查看 MatrixOne 集群节点详情:

    1. kubectl get node
    1. [root@master0 ~]# kubectl get node
    2. NAME STATUS ROLES AGE VERSION
    3. master0 Ready control-plane,master 22h v1.23.17
    4. node0 Ready <none> 22h v1.23.17
  2. 根据上述返回结果,使用以下命令来查看特定节点的资源使用情况。根据之前的部署方案,可以查看到 MatrixOne 集群位于名为 node0 的节点上:

    1. NODE="[待监控节点]" # 根据上述结果,有可能是 ip、也可能是主机名、或者别名,例如 10.0.0.1、host-10-0-0-1、node01
    2. kubectl top node ${NODE}
    1. [root@master0 ~]# kubectl top node
    2. NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
    3. master0 179m 9% 4632Mi 66%
    4. node0 292m 15% 4115Mi 56%
    5. [root@master0 ~]# kubectl top node node0
    6. NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
    7. node0 299m 15% 4079Mi 56%
  3. 你还可以查看节点的资源分配和资源上限。请注意,已分配的资源并不等于已使用的资源。

  1. [root@master0 ~]# kubectl describe node node0
  2. Name: master0
  3. Roles: control-plane,master
  4. Labels: beta.kubernetes.io/arch=amd64
  5. beta.kubernetes.io/os=linux
  6. kubernetes.io/arch=amd64
  7. kubernetes.io/hostname=master0
  8. kubernetes.io/os=linux
  9. node-role.kubernetes.io/control-plane=
  10. node-role.kubernetes.io/master=
  11. node.kubernetes.io/exclude-from-external-load-balancers=
  12. Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
  13. node.alpha.kubernetes.io/ttl: 0
  14. projectcalico.org/IPv4Address: 10.206.134.8/24
  15. projectcalico.org/IPv4VXLANTunnelAddr: 10.234.166.0
  16. volumes.kubernetes.io/controller-managed-attach-detach: true
  17. CreationTimestamp: Sun, 07 May 2023 12:28:57 +0800
  18. Taints: node-role.kubernetes.io/master:NoSchedule
  19. Unschedulable: false
  20. Lease:
  21. HolderIdentity: master0
  22. AcquireTime: <unset>
  23. RenewTime: Mon, 08 May 2023 10:56:08 +0800
  24. Conditions:
  25. Type Status LastHeartbeatTime LastTransitionTime Reason Message
  26. ---- ------ ----------------- ------------------ ------ -------
  27. NetworkUnavailable False Sun, 07 May 2023 12:30:08 +0800 Sun, 07 May 2023 12:30:08 +0800 CalicoIsUp Calico is running on this node
  28. MemoryPressure False Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 12:28:55 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
  29. DiskPressure False Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 12:28:55 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
  30. PIDPressure False Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 12:28:55 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
  31. Ready True Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 20:47:39 +0800 KubeletReady kubelet is posting ready status
  32. Addresses:
  33. InternalIP: 10.206.134.8
  34. Hostname: master0
  35. Capacity:
  36. cpu: 2
  37. ephemeral-storage: 51473868Ki
  38. hugepages-1Gi: 0
  39. hugepages-2Mi: 0
  40. memory: 7782436Ki
  41. pods: 110
  42. Allocatable:
  43. cpu: 1800m
  44. ephemeral-storage: 47438316671
  45. hugepages-1Gi: 0
  46. hugepages-2Mi: 0
  47. memory: 7155748Ki
  48. pods: 110
  49. System Info:
  50. Machine ID: fb436be013b5415799d27abf653585d3
  51. System UUID: FB436BE0-13B5-4157-99D2-7ABF653585D3
  52. Boot ID: 552bd576-56c8-4d22-9549-d950069a5a77
  53. Kernel Version: 3.10.0-1160.88.1.el7.x86_64
  54. OS Image: CentOS Linux 7 (Core)
  55. Operating System: linux
  56. Architecture: amd64
  57. Container Runtime Version: docker://20.10.23
  58. Kubelet Version: v1.23.17
  59. Kube-Proxy Version: v1.23.17
  60. PodCIDR: 10.234.0.0/23
  61. PodCIDRs: 10.234.0.0/23
  62. Non-terminated Pods: (12 in total)
  63. Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
  64. --------- ---- ------------ ---------- --------------- ------------- ---
  65. default netchecker-agent-7xnwb 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  66. default netchecker-agent-hostnet-bw85f 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  67. kruise-system kruise-daemon-xvl8t 0 (0%) 50m (2%) 0 (0%) 128Mi (1%) 20h
  68. kube-system calico-node-sbzfc 150m (8%) 300m (16%) 64M (0%) 500M (6%) 22h
  69. kube-system dns-autoscaler-7874cf6bcf-l55q4 20m (1%) 0 (0%) 10Mi (0%) 0 (0%) 22h
  70. kube-system kube-apiserver-master0 250m (13%) 0 (0%) 0 (0%) 0 (0%) 22h
  71. kube-system kube-controller-manager-master0 200m (11%) 0 (0%) 0 (0%) 0 (0%) 22h
  72. kube-system kube-proxy-lfkhk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22h
  73. kube-system kube-scheduler-master0 100m (5%) 0 (0%) 0 (0%) 0 (0%) 22h
  74. kube-system metrics-server-7bd47f88c4-knh9b 100m (5%) 100m (5%) 200Mi (2%) 200Mi (2%) 22h
  75. kube-system nodelocaldns-dcffl 100m (5%) 0 (0%) 70Mi (1%) 170Mi (2%) 14h
  76. kuboard kuboard-v3-master0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22h
  77. Allocated resources:
  78. (Total limits may be over 100 percent, i.e., overcommitted.)
  79. Resource Requests Limits
  80. -------- -------- ------
  81. cpu 950m (52%) 510m (28%)
  82. memory 485601280 (6%) 1222190848 (16%)
  83. ephemeral-storage 0 (0%) 0 (0%)
  84. hugepages-1Gi 0 (0%) 0 (0%)
  85. hugepages-2Mi 0 (0%) 0 (0%)
  86. Events: <none>
  87. Name: node0
  88. Roles: <none>
  89. Labels: beta.kubernetes.io/arch=amd64
  90. beta.kubernetes.io/os=linux
  91. kubernetes.io/arch=amd64
  92. kubernetes.io/hostname=node0
  93. kubernetes.io/os=linux
  94. Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
  95. node.alpha.kubernetes.io/ttl: 0
  96. projectcalico.org/IPv4Address: 10.206.134.14/24
  97. projectcalico.org/IPv4VXLANTunnelAddr: 10.234.60.0
  98. volumes.kubernetes.io/controller-managed-attach-detach: true
  99. CreationTimestamp: Sun, 07 May 2023 12:29:46 +0800
  100. Taints: <none>
  101. Unschedulable: false
  102. Lease:
  103. HolderIdentity: node0
  104. AcquireTime: <unset>
  105. RenewTime: Mon, 08 May 2023 10:56:06 +0800
  106. Conditions:
  107. Type Status LastHeartbeatTime LastTransitionTime Reason Message
  108. ---- ------ ----------------- ------------------ ------ -------
  109. NetworkUnavailable False Sun, 07 May 2023 12:30:08 +0800 Sun, 07 May 2023 12:30:08 +0800 CalicoIsUp Calico is running on this node
  110. MemoryPressure False Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 12:29:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
  111. DiskPressure False Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 12:29:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
  112. PIDPressure False Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 12:29:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
  113. Ready True Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 20:48:36 +0800 KubeletReady kubelet is posting ready status
  114. Addresses:
  115. InternalIP: 10.206.134.14
  116. Hostname: node0
  117. Capacity:
  118. cpu: 2
  119. ephemeral-storage: 51473868Ki
  120. hugepages-1Gi: 0
  121. hugepages-2Mi: 0
  122. memory: 7782444Ki
  123. pods: 110
  124. Allocatable:
  125. cpu: 1900m
  126. ephemeral-storage: 47438316671
  127. hugepages-1Gi: 0
  128. hugepages-2Mi: 0
  129. memory: 7417900Ki
  130. pods: 110
  131. System Info:
  132. Machine ID: a6600151884b44fb9f0bc9af490e44b7
  133. System UUID: A6600151-884B-44FB-9F0B-C9AF490E44B7
  134. Boot ID: b7f3357f-44e6-425e-8c90-6ada14e92703
  135. Kernel Version: 3.10.0-1160.88.1.el7.x86_64
  136. OS Image: CentOS Linux 7 (Core)
  137. Operating System: linux
  138. Architecture: amd64
  139. Container Runtime Version: docker://20.10.23
  140. Kubelet Version: v1.23.17
  141. Kube-Proxy Version: v1.23.17
  142. PodCIDR: 10.234.2.0/23
  143. PodCIDRs: 10.234.2.0/23
  144. Non-terminated Pods: (20 in total)
  145. Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
  146. --------- ---- ------------ ---------- --------------- ------------- ---
  147. default netchecker-agent-6v8rl 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  148. default netchecker-agent-hostnet-fb2jn 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  149. default netchecker-server-645d759b79-v4bqm 150m (7%) 300m (15%) 192M (2%) 512M (6%) 22h
  150. kruise-system kruise-controller-manager-74847d59cf-295rk 100m (5%) 200m (10%) 256Mi (3%) 512Mi (7%) 20h
  151. kruise-system kruise-controller-manager-74847d59cf-854sq 100m (5%) 200m (10%) 256Mi (3%) 512Mi (7%) 20h
  152. kruise-system kruise-daemon-rz9pj 0 (0%) 50m (2%) 0 (0%) 128Mi (1%) 20h
  153. kube-system calico-kube-controllers-74df5cd99c-n9qsn 30m (1%) 1 (52%) 64M (0%) 256M (3%) 22h
  154. kube-system calico-node-brqrk 150m (7%) 300m (15%) 64M (0%) 500M (6%) 22h
  155. kube-system coredns-76b4fb4578-9cqc7 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 14h
  156. kube-system kube-proxy-rpxb5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22h
  157. kube-system nginx-proxy-node0 25m (1%) 0 (0%) 32M (0%) 0 (0%) 22h
  158. kube-system nodelocaldns-qkxhv 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 14h
  159. local-path-storage local-path-storage-local-path-provisioner-d5bb7f8c9-qfp8h 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
  160. mo-hn matrixone-operator-f8496ff5c-fp6zm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 20h
  161. mo-hn mo-tn-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  162. mo-hn mo-log-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  163. mo-hn mo-log-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  164. mo-hn mo-log-2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  165. mo-hn mo-tp-cn-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  166. mostorage minio-674ccf54f7-tdglh 0 (0%) 0 (0%) 512Mi (7%) 0 (0%) 20h
  167. Allocated resources:
  168. (Total limits may be over 100 percent, i.e., overcommitted.)
  169. Resource Requests Limits
  170. -------- -------- ------
  171. cpu 785m (41%) 2110m (111%)
  172. memory 1700542464 (22%) 3032475392 (39%)
  173. ephemeral-storage 0 (0%) 0 (0%)
  174. hugepages-1Gi 0 (0%) 0 (0%)
  175. hugepages-2Mi 0 (0%) 0 (0%)
  176. Events: <none>

Pod 监控

  1. 通过以下命令可以查看 MatrixOne 集群的 Pod:

    1. NS="mo-hn"
    2. kubectl get pod -n${NS}
  2. 根据上述返回结果,使用以下命令来查看特定 Pod 的资源使用情况:

    1. POD="[待监控 pod 名称]" # 根据上述结果,例如:tn 为 mo-tn-0,cn 为 mo-tp-cn-0、mo-tp-cn-1、...,logservice 为 mo-log-0、mo-log-1、...
    2. kubectl top pod ${POD} -n${NS}

    该命令将显示指定 Pod 的 CPU 和内存使用情况,类似于以下输出:

    1. [root@master0 ~]# kubectl top pod mo-tp-cn-0 -nmo-hn
    2. NAME CPU(cores) MEMORY(bytes)
    3. mo-tp-cn-0 20m 214Mi
    4. [root@master0 ~]# kubectl top pod mo-tn-0 -nmo-hn
    5. NAME CPU(cores) MEMORY(bytes)
    6. mo-tn-0 36m 161Mi
  3. 此外,你还可以查看特定 Pod 的资源声明情况,以便与实际使用的资源进行对比。

  1. kubectl describe pod ${POD_NAME} -n${NS}
  2. kubectl get pod ${POD_NAME} -n${NS} -oyaml
  1. [root@master0 ~]# kubectl describe pod mo-tp-cn-0 -nmo-hn
  2. Name: mo-tp-cn-0
  3. Namespace: mo-hn
  4. Priority: 0
  5. Node: node0/10.206.134.14
  6. Start Time: Sun, 07 May 2023 21:01:50 +0800
  7. Labels: controller-revision-hash=mo-tp-cn-8666cdfb56
  8. lifecycle.apps.kruise.io/state=Normal
  9. matrixorigin.io/cluster=mo
  10. matrixorigin.io/component=CNSet
  11. matrixorigin.io/instance=mo-tp
  12. matrixorigin.io/namespace=mo-hn
  13. statefulset.kubernetes.io/pod-name=mo-tp-cn-0
  14. Annotations: apps.kruise.io/runtime-containers-meta:
  15. {"containers":[{"name":"main","containerID":"docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f","restartCount":0,"...
  16. cni.projectcalico.org/containerID: 80b286789a2d6fa9e615c3edee79b57edb452eaeafddb9b7b82ec5fb2e339409
  17. cni.projectcalico.org/podIP: 10.234.60.53/32
  18. cni.projectcalico.org/podIPs: 10.234.60.53/32
  19. kruise.io/related-pub: mo
  20. lifecycle.apps.kruise.io/timestamp: 2023-05-07T13:01:50Z
  21. matrixone.cloud/cn-label: null
  22. matrixone.cloud/dns-based-identity: False
  23. Status: Running
  24. IP: 10.234.60.53
  25. IPs:
  26. IP: 10.234.60.53
  27. Controlled By: StatefulSet/mo-tp-cn
  28. Containers:
  29. main:
  30. Container ID: docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f
  31. Image: matrixorigin/matrixone:nightly-144f3be4
  32. Image ID: docker-pullable://matrixorigin/matrixone@sha256:288fe3d626c6aa564684099e4686a9d4b28e16fdd16512bd968a67bb41d5aaa3
  33. Port: <none>
  34. Host Port: <none>
  35. Command:
  36. /bin/sh
  37. /etc/matrixone/config/start.sh
  38. Args:
  39. -debug-http=:6060
  40. State: Running
  41. Started: Sun, 07 May 2023 21:01:54 +0800
  42. Ready: True
  43. Restart Count: 0
  44. Environment:
  45. POD_NAME: mo-tp-cn-0 (v1:metadata.name)
  46. NAMESPACE: mo-hn (v1:metadata.namespace)
  47. HEADLESS_SERVICE_NAME: mo-tp-cn-headless
  48. AWS_ACCESS_KEY_ID: <set to the key 'AWS_ACCESS_KEY_ID' in secret 'minio'> Optional: false
  49. AWS_SECRET_ACCESS_KEY: <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 'minio'> Optional: false
  50. AWS_REGION: us-west-2
  51. Mounts:
  52. /etc/matrixone/config from config (ro)
  53. /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ngpcs (ro)
  54. Readiness Gates:
  55. Type Status
  56. InPlaceUpdateReady True
  57. KruisePodReady True
  58. Conditions:
  59. Type Status
  60. KruisePodReady True
  61. InPlaceUpdateReady True
  62. Initialized True
  63. Ready True
  64. ContainersReady True
  65. PodScheduled True
  66. Volumes:
  67. config:
  68. Type: ConfigMap (a volume populated by a ConfigMap)
  69. Name: mo-tp-cn-config-5abf454
  70. Optional: false
  71. kube-api-access-ngpcs:
  72. Type: Projected (a volume that contains injected data from multiple sources)
  73. TokenExpirationSeconds: 3607
  74. ConfigMapName: kube-root-ca.crt
  75. ConfigMapOptional: <nil>
  76. DownwardAPI: true
  77. QoS Class: BestEffort
  78. Node-Selectors: <none>
  79. Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
  80. node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
  81. Events: <none>
  82. [root@master0 ~]# kubectl get pod mo-tp-cn-0 -nmo-hn -oyaml
  83. apiVersion: v1
  84. kind: Pod
  85. metadata:
  86. annotations:
  87. apps.kruise.io/runtime-containers-meta: '{"containers":[{"name":"main","containerID":"docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f","restartCount":0,"hashes":{"plainHash":1670287891}}]}'
  88. cni.projectcalico.org/containerID: 80b286789a2d6fa9e615c3edee79b57edb452eaeafddb9b7b82ec5fb2e339409
  89. cni.projectcalico.org/podIP: 10.234.60.53/32
  90. cni.projectcalico.org/podIPs: 10.234.60.53/32
  91. kruise.io/related-pub: mo
  92. lifecycle.apps.kruise.io/timestamp: "2023-05-07T13:01:50Z"
  93. matrixone.cloud/cn-label: "null"
  94. matrixone.cloud/dns-based-identity: "False"
  95. creationTimestamp: "2023-05-07T13:01:50Z"
  96. generateName: mo-tp-cn-
  97. labels:
  98. controller-revision-hash: mo-tp-cn-8666cdfb56
  99. lifecycle.apps.kruise.io/state: Normal
  100. matrixorigin.io/cluster: mo
  101. matrixorigin.io/component: CNSet
  102. matrixorigin.io/instance: mo-tp
  103. matrixorigin.io/namespace: mo-hn
  104. statefulset.kubernetes.io/pod-name: mo-tp-cn-0
  105. name: mo-tp-cn-0
  106. namespace: mo-hn
  107. ownerReferences:
  108. - apiVersion: apps.kruise.io/v1beta1
  109. blockOwnerDeletion: true
  110. controller: true
  111. kind: StatefulSet
  112. name: mo-tp-cn
  113. uid: 891e0453-89a5-45d5-ad12-16ef048c804f
  114. resourceVersion: "72625"
  115. uid: 1e3e2df3-f1c2-4444-8694-8d23e7125d35
  116. spec:
  117. containers:
  118. - args:
  119. - -debug-http=:6060
  120. command:
  121. - /bin/sh
  122. - /etc/matrixone/config/start.sh
  123. env:
  124. - name: POD_NAME
  125. valueFrom:
  126. fieldRef:
  127. apiVersion: v1
  128. fieldPath: metadata.name
  129. - name: NAMESPACE
  130. valueFrom:
  131. fieldRef:
  132. apiVersion: v1
  133. fieldPath: metadata.namespace
  134. - name: HEADLESS_SERVICE_NAME
  135. value: mo-tp-cn-headless
  136. - name: AWS_ACCESS_KEY_ID
  137. valueFrom:
  138. secretKeyRef:
  139. key: AWS_ACCESS_KEY_ID
  140. name: minio
  141. - name: AWS_SECRET_ACCESS_KEY
  142. valueFrom:
  143. secretKeyRef:
  144. key: AWS_SECRET_ACCESS_KEY
  145. name: minio
  146. - name: AWS_REGION
  147. value: us-west-2
  148. image: matrixorigin/matrixone:nightly-144f3be4
  149. imagePullPolicy: Always
  150. name: main
  151. resources: {}
  152. terminationMessagePath: /dev/termination-log
  153. terminationMessagePolicy: File
  154. volumeMounts:
  155. - mountPath: /etc/matrixone/config
  156. name: config
  157. readOnly: true
  158. - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  159. name: kube-api-access-ngpcs
  160. readOnly: true
  161. dnsPolicy: ClusterFirst
  162. enableServiceLinks: true
  163. hostname: mo-tp-cn-0
  164. nodeName: node0
  165. preemptionPolicy: PreemptLowerPriority
  166. priority: 0
  167. readinessGates:
  168. - conditionType: InPlaceUpdateReady
  169. - conditionType: KruisePodReady
  170. restartPolicy: Always
  171. schedulerName: default-scheduler
  172. securityContext: {}
  173. serviceAccount: default
  174. serviceAccountName: default
  175. subdomain: mo-tp-cn-headless
  176. terminationGracePeriodSeconds: 30
  177. tolerations:
  178. - effect: NoExecute
  179. key: node.kubernetes.io/not-ready
  180. operator: Exists
  181. tolerationSeconds: 300
  182. - effect: NoExecute
  183. key: node.kubernetes.io/unreachable
  184. operator: Exists
  185. tolerationSeconds: 300
  186. volumes:
  187. - configMap:
  188. defaultMode: 420
  189. name: mo-tp-cn-config-5abf454
  190. name: config
  191. - name: kube-api-access-ngpcs
  192. projected:
  193. defaultMode: 420
  194. sources:
  195. - serviceAccountToken:
  196. expirationSeconds: 3607
  197. path: token
  198. - configMap:
  199. items:
  200. - key: ca.crt
  201. path: ca.crt
  202. name: kube-root-ca.crt
  203. - downwardAPI:
  204. items:
  205. - fieldRef:
  206. apiVersion: v1
  207. fieldPath: metadata.namespace
  208. path: namespace
  209. status:
  210. conditions:
  211. - lastProbeTime: null
  212. lastTransitionTime: "2023-05-07T13:01:50Z"
  213. status: "True"
  214. type: KruisePodReady
  215. - lastProbeTime: null
  216. lastTransitionTime: "2023-05-07T13:01:50Z"
  217. status: "True"
  218. type: InPlaceUpdateReady
  219. - lastProbeTime: null
  220. lastTransitionTime: "2023-05-07T13:01:50Z"
  221. status: "True"
  222. type: Initialized
  223. - lastProbeTime: null
  224. lastTransitionTime: "2023-05-07T13:01:54Z"
  225. status: "True"
  226. type: Ready
  227. - lastProbeTime: null
  228. lastTransitionTime: "2023-05-07T13:01:54Z"
  229. status: "True"
  230. type: ContainersReady
  231. - lastProbeTime: null
  232. lastTransitionTime: "2023-05-07T13:01:50Z"
  233. status: "True"
  234. type: PodScheduled
  235. containerStatuses:
  236. - containerID: docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f
  237. image: matrixorigin/matrixone:nightly-144f3be4
  238. imageID: docker-pullable://matrixorigin/matrixone@sha256:288fe3d626c6aa564684099e4686a9d4b28e16fdd16512bd968a67bb41d5aaa3
  239. lastState: {}
  240. name: main
  241. ready: true
  242. restartCount: 0
  243. started: true
  244. state:
  245. running:
  246. startedAt: "2023-05-07T13:01:54Z"
  247. hostIP: 10.206.134.14
  248. phase: Running
  249. podIP: 10.234.60.53
  250. podIPs:
  251. - ip: 10.234.60.53
  252. qosClass: BestEffort
  253. startTime: "2023-05-07T13:01:50Z"

MatrixOne 监控

查看集群状态

在 Operator 部署过程中,我们定义了 matrixOnecluster 作为整个集群的自定义资源名称。通过检查 MatrixOneCluster,我们可以判断集群是否正常运行。你可以使用以下命令进行检查:

  1. MO_NAME="mo"
  2. NS="mo-hn"
  3. kubectl get matrixonecluster -n${NS} ${MO_NAME}

如果状态为 “Ready”,则表示集群正常。如果状态为 “NotReady”,则需要进一步排查问题。

  1. [root@master0 ~]# MO_NAME="mo"
  2. [root@master0 ~]# NS="mo-hn"
  3. [root@master0 ~]# kubectl get matrixonecluster -n${NS} ${MO_NAME}
  4. NAME LOG TN TP AP VERSION PHASE AGE
  5. mo 3 1 1 nightly-144f3be4 Ready 13h

要查看 MatrixOne 集群状态的详细信息,可以运行以下命令:

  1. kubectl describe matrixonecluster -n${NS} ${MO_NAME}
  1. [root@master0 ~]# kubectl describe matrixonecluster -n${NS} ${MO_NAME}
  2. Name: mo
  3. Namespace: mo-hn
  4. Labels: <none>
  5. Annotations: <none>
  6. API Version: core.matrixorigin.io/v1alpha1
  7. Kind: MatrixOneCluster
  8. Metadata:
  9. Creation Timestamp: 2023-05-07T12:54:17Z
  10. Finalizers:
  11. matrixorigin.io/matrixonecluster
  12. Generation: 2
  13. Managed Fields:
  14. API Version: core.matrixorigin.io/v1alpha1
  15. Fields Type: FieldsV1
  16. fieldsV1:
  17. f:metadata:
  18. f:annotations:
  19. .:
  20. f:kubectl.kubernetes.io/last-applied-configuration:
  21. f:spec:
  22. .:
  23. f:Tn:
  24. .:
  25. f:config:
  26. f:replicas:
  27. f:imagePullPolicy:
  28. f:imageRepository:
  29. f:logService:
  30. .:
  31. f:config:
  32. f:pvcRetentionPolicy:
  33. f:replicas:
  34. f:sharedStorage:
  35. .:
  36. f:s3:
  37. .:
  38. f:endpoint:
  39. f:secretRef:
  40. f:type:
  41. f:volume:
  42. .:
  43. f:size:
  44. f:tp:
  45. .:
  46. f:config:
  47. f:nodePort:
  48. f:replicas:
  49. f:serviceType:
  50. f:version:
  51. Manager: kubectl-client-side-apply
  52. Operation: Update
  53. Time: 2023-05-07T12:54:17Z
  54. API Version: core.matrixorigin.io/v1alpha1
  55. Fields Type: FieldsV1
  56. fieldsV1:
  57. f:metadata:
  58. f:finalizers:
  59. .:
  60. v:"matrixorigin.io/matrixonecluster":
  61. Manager: manager
  62. Operation: Update
  63. Time: 2023-05-07T12:54:17Z
  64. API Version: core.matrixorigin.io/v1alpha1
  65. Fields Type: FieldsV1
  66. fieldsV1:
  67. f:spec:
  68. f:logService:
  69. f:sharedStorage:
  70. f:s3:
  71. f:path:
  72. Manager: kubectl-edit
  73. Operation: Update
  74. Time: 2023-05-07T13:00:53Z
  75. API Version: core.matrixorigin.io/v1alpha1
  76. Fields Type: FieldsV1
  77. fieldsV1:
  78. f:status:
  79. .:
  80. f:cnGroups:
  81. .:
  82. f:desiredGroups:
  83. f:readyGroups:
  84. f:syncedGroups:
  85. f:conditions:
  86. f:credentialRef:
  87. f:Tn:
  88. .:
  89. f:availableStores:
  90. f:conditions:
  91. f:logService:
  92. .:
  93. f:availableStores:
  94. f:conditions:
  95. f:discovery:
  96. .:
  97. f:address:
  98. f:port:
  99. f:phase:
  100. Manager: manager
  101. Operation: Update
  102. Subresource: status
  103. Time: 2023-05-07T13:02:12Z
  104. Resource Version: 72671
  105. UID: be2355c0-0c69-4f0f-95bb-9310224200b6
  106. Spec:
  107. Tn:
  108. Config:
  109. [dn]
  110. [dn.Ckp]
  111. flush-interval = "60s"
  112. global-interval = "100000s"
  113. incremental-interval = "60s"
  114. min-count = 100
  115. scan-interval = "5s"
  116. [dn.Txn]
  117. [dn.Txn.Storage]
  118. backend = "TAE"
  119. log-backend = "logservice"
  120. [log]
  121. format = "json"
  122. level = "error"
  123. max-size = 512
  124. Replicas: 1
  125. Resources:
  126. Service Args:
  127. -debug-http=:6060
  128. Shared Storage Cache:
  129. Memory Cache Size: 0
  130. Image Pull Policy: Always
  131. Image Repository: matrixorigin/matrixone
  132. Log Service:
  133. Config:
  134. [log]
  135. format = "json"
  136. level = "error"
  137. max-size = 512
  138. Initial Config:
  139. TN Shards: 1
  140. Log Shard Replicas: 3
  141. Log Shards: 1
  142. Pvc Retention Policy: Retain
  143. Replicas: 3
  144. Resources:
  145. Service Args:
  146. -debug-http=:6060
  147. Shared Storage:
  148. s3:
  149. Endpoint: http://minio.mostorage:9000
  150. Path: minio-mo
  151. s3RetentionPolicy: Retain
  152. Secret Ref:
  153. Name: minio
  154. Type: minio
  155. Store Failure Timeout: 10m0s
  156. Volume:
  157. Size: 1Gi
  158. Tp:
  159. Config:
  160. [cn]
  161. [cn.Engine]
  162. type = "distributed-tae"
  163. [log]
  164. format = "json"
  165. level = "debug"
  166. max-size = 512
  167. Node Port: 31474
  168. Replicas: 1
  169. Resources:
  170. Service Args:
  171. -debug-http=:6060
  172. Service Type: NodePort
  173. Shared Storage Cache:
  174. Memory Cache Size: 0
  175. Version: nightly-144f3be4
  176. Status:
  177. Cn Groups:
  178. Desired Groups: 1
  179. Ready Groups: 1
  180. Synced Groups: 1
  181. Conditions:
  182. Last Transition Time: 2023-05-07T13:02:14Z
  183. Message: the object is synced
  184. Reason: empty
  185. Status: True
  186. Type: Synced
  187. Last Transition Time: 2023-05-07T13:02:14Z
  188. Message:
  189. Reason: AllSetsReady
  190. Status: True
  191. Type: Ready
  192. Credential Ref:
  193. Name: mo-credential
  194. Tn:
  195. Available Stores:
  196. Last Transition: 2023-05-07T13:01:48Z
  197. Phase: Up
  198. Pod Name: mo-tn-0
  199. Conditions:
  200. Last Transition Time: 2023-05-07T13:01:48Z
  201. Message: the object is synced
  202. Reason: empty
  203. Status: True
  204. Type: Synced
  205. Last Transition Time: 2023-05-07T13:01:48Z
  206. Message:
  207. Reason: empty
  208. Status: True
  209. Type: Ready
  210. Log Service:
  211. Available Stores:
  212. Last Transition: 2023-05-07T13:01:25Z
  213. Phase: Up
  214. Pod Name: mo-log-0
  215. Last Transition: 2023-05-07T13:01:25Z
  216. Phase: Up
  217. Pod Name: mo-log-1
  218. Last Transition: 2023-05-07T13:01:25Z
  219. Phase: Up
  220. Pod Name: mo-log-2
  221. Conditions:
  222. Last Transition Time: 2023-05-07T13:01:25Z
  223. Message: the object is synced
  224. Reason: empty
  225. Status: True
  226. Type: Synced
  227. Last Transition Time: 2023-05-07T13:01:25Z
  228. Message:
  229. Reason: empty
  230. Status: True
  231. Type: Ready
  232. Discovery:
  233. Address: mo-log-discovery.mo-hn.svc
  234. Port: 32001
  235. Phase: Ready
  236. Events:
  237. Type Reason Age From Message
  238. ---- ------ ---- ---- -------
  239. Normal ReconcileSuccess 29m (x2 over 13h) matrixonecluster object is synced

查看组件状态

当前 MatrixOne 集群包含以下组件:TN、CN、Log Service,它们分别对应着自定义资源类型 TNSet、CNSet、LogSet,这些对象由 MatrixOneCluster 控制器生成。

要检查各组件是否正常,以 TN 为例,可以运行以下命令:

  1. SET_TYPE="tnset"
  2. NS="mo-hn"
  3. kubectl get ${SET_TYPE} -n${NS}

这将显示 TN 组件的状态信息,信息如下:

  1. [root@master0 ~]# SET_TYPE="tnset"
  2. [root@master0 ~]# NS="mo-hn"
  3. [root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
  4. NAME IMAGE REPLICAS AGE
  5. mo matrixorigin/matrixone:nightly-144f3be4 1 13h
  6. [root@master0 ~]# SET_TYPE="cnset"
  7. [root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
  8. NAME IMAGE REPLICAS AGE
  9. mo-tp matrixorigin/matrixone:nightly-144f3be4 1 13h
  10. [root@master0 ~]# SET_TYPE="logset"
  11. [root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
  12. NAME IMAGE REPLICAS AGE
  13. mo matrixorigin/matrixone:nightly-144f3be4 3 13h

查看 Pod 状态

你可以直接检查 MO 集群中生成的原生 Kubernetes 对象,以确认集群的健康状态。通常,通过查询 Pod 即可完成。

  1. NS="mo-hn"
  2. kubectl get pod -n${NS}

这将显示 Pod 的状态信息。

  1. [root@master0 ~]# NS="mo-hn"
  2. [root@master0 ~]# kubectl get pod -n${NS}
  3. NAME READY STATUS RESTARTS AGE
  4. matrixone-operator-f8496ff5c-fp6zm 1/1 Running 0 19h
  5. mo-tn-0 1/1 Running 0 13h
  6. mo-log-0 1/1 Running 0 13h
  7. mo-log-1 1/1 Running 0 13h
  8. mo-log-2 1/1 Running 0 13h
  9. mo-tp-cn-0 1/1 Running 0 13h

通常情况下,Running 状态表示 Pod 正常运行。但也有一些特殊情况,可能 Pod 状态为 Running,但 MatrixOne 集群实际上不正常。例如,无法通过 MySQL 客户端连接 MatrixOne 集群。在这种情况下,您可以进一步查看 Pod 的日志,以检查是否有异常信息输出。

  1. NS="mo-hn"
  2. POD_NAME="[上述返回pod的名称]" # 例如mo-tp-cn-0
  3. kubectl logs ${POD_NAME} -n${NS}

如果 Pod 状态为非 Running 状态,例如 Pending,您可以通过查看 Pod 的事件(Events)来确认异常原因。以前面的例子为例,由于集群资源无法满足 mo-tp-cn-3 的请求,该 Pod 无法调度,并处于 Pending 状态。在这种情况下,您可以通过扩容节点资源来解决。

  1. kubectl describe pod ${POD_NAME} -n${NS}

健康检查与资源监控 - 图2