Health check and resource monitoring

Health check and resource monitoring

In the MatrixOne distributed cluster, there are multiple components and objects. To ensure its regular operation and troubleshooting, we must perform a series of health checks and resource monitoring.

The health check and resource monitoring environment introduced in this document will be based on the MatrixOne Distributed Cluster Deployment environment.

The upgraded environment introduced in this document will be based on the environment of MatrixOne Distributed Cluster Deployment.

Objects to check

Physical resource layer: including the three virtual machines’ CPU, memory, and disk resources. For a mature solution to monitor these resources, see Monitoring Solution.
Logical resource layer: including the capacity usage of MinIO, the CPU and memory resource usage of each node and Pod of Kubernetes, the overall status of MatrixOne, and the status of each component (such as LogService, CN, TN).

Resource monitoring

MinIO capacity usage monitoring

MinIO has a management interface through which we can visually monitor its capacity usage, including the remaining space. For more information, see official documentation.

Node/Pod resource monitoring

To determine whether the MatrixOne service needs to be scaled up or down, users often need to monitor the resources used by the Node where the MatrixOne cluster resides and the pods corresponding to the components.

You can use the kubectl top command to complete it. For more information, see Kubernetes official website document of

Node Monitoring

Use the following command to view the details of MatrixOne cluster nodes:

kubectl get node

[root@master0 ~]# kubectl get node
NAME      STATUS   ROLES                  AGE   VERSION
master0   Ready    control-plane,master   22h   v1.23.17
node0     Ready    <none>                 22h   v1.23.17

According to the above-returned results, use the following command to view the resource usage of a specific node. According to the previous deployment scheme, it can be seen that the MatrixOne cluster is located on the node named node0:

NODE="[node to be monitored]" # According to the above results, it may be IP, hostname, or alias, such as 10.0.0.1, host-10-0-0-1, node01
kubectl top node ${NODE}

[root@master0 ~]# kubectl top node
NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master0   179m         9%     4632Mi          66%       
node0     292m         15%    4115Mi          56%  
[root@master0 ~]# kubectl top node node0
NAME    CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
node0   299m         15%    4079Mi          56%

You can also view the node’s resource allocation and upper limit. Note that allocated resources are not equal to used resources.

[root@master0 ~]# kubectl describe node node0
Name:               master0
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=master0
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.206.134.8/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.234.166.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 07 May 2023 12:28:57 +0800
Taints:             node-role.kubernetes.io/master:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  master0
  AcquireTime:     <unset>
  RenewTime:       Mon, 08 May 2023 10:56:08 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Sun, 07 May 2023 12:30:08 +0800   Sun, 07 May 2023 12:30:08 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Mon, 08 May 2023 10:56:07 +0800   Sun, 07 May 2023 12:28:55 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 08 May 2023 10:56:07 +0800   Sun, 07 May 2023 12:28:55 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 08 May 2023 10:56:07 +0800   Sun, 07 May 2023 12:28:55 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 08 May 2023 10:56:07 +0800   Sun, 07 May 2023 20:47:39 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.206.134.8
  Hostname:    master0
Capacity:
  cpu:                2
  ephemeral-storage:  51473868Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7782436Ki
  pods:               110
Allocatable:
  cpu:                1800m
  ephemeral-storage:  47438316671
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7155748Ki
  pods:               110
System Info:
  Machine ID:                 fb436be013b5415799d27abf653585d3
  System UUID:                FB436BE0-13B5-4157-99D2-7ABF653585D3
  Boot ID:                    552bd576-56c8-4d22-9549-d950069a5a77
  Kernel Version:             3.10.0-1160.88.1.el7.x86_64
  OS Image:                   CentOS Linux 7 (Core)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.23
  Kubelet Version:            v1.23.17
  Kube-Proxy Version:         v1.23.17
PodCIDR:                      10.234.0.0/23
PodCIDRs:                     10.234.0.0/23
Non-terminated Pods:          (12 in total)
  Namespace                   Name                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                               ------------  ----------  ---------------  -------------  ---
  default                     netchecker-agent-7xnwb             15m (0%)      30m (1%)    64M (0%)         100M (1%)      22h
  default                     netchecker-agent-hostnet-bw85f     15m (0%)      30m (1%)    64M (0%)         100M (1%)      22h
  kruise-system               kruise-daemon-xvl8t                0 (0%)        50m (2%)    0 (0%)           128Mi (1%)     20h
  kube-system                 calico-node-sbzfc                  150m (8%)     300m (16%)  64M (0%)         500M (6%)      22h
  kube-system                 dns-autoscaler-7874cf6bcf-l55q4    20m (1%)      0 (0%)      10Mi (0%)        0 (0%)         22h
  kube-system                 kube-apiserver-master0             250m (13%)    0 (0%)      0 (0%)           0 (0%)         22h
  kube-system                 kube-controller-manager-master0    200m (11%)    0 (0%)      0 (0%)           0 (0%)         22h
  kube-system                 kube-proxy-lfkhk                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         22h
  kube-system                 kube-scheduler-master0             100m (5%)     0 (0%)      0 (0%)           0 (0%)         22h
  kube-system                 metrics-server-7bd47f88c4-knh9b    100m (5%)     100m (5%)   200Mi (2%)       200Mi (2%)     22h
  kube-system                 nodelocaldns-dcffl                 100m (5%)     0 (0%)      70Mi (1%)        170Mi (2%)     14h
  kuboard                     kuboard-v3-master0                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         22h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests        Limits
  --------           --------        ------
  cpu                950m (52%)      510m (28%)
  memory             485601280 (6%)  1222190848 (16%)
  ephemeral-storage  0 (0%)          0 (0%)
  hugepages-1Gi      0 (0%)          0 (0%)
  hugepages-2Mi      0 (0%)          0 (0%)
Events:              <none>
Name:               node0
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=node0
                    kubernetes.io/os=linux
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.206.134.14/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.234.60.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 07 May 2023 12:29:46 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  node0
  AcquireTime:     <unset>
  RenewTime:       Mon, 08 May 2023 10:56:06 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Sun, 07 May 2023 12:30:08 +0800   Sun, 07 May 2023 12:30:08 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Mon, 08 May 2023 10:56:12 +0800   Sun, 07 May 2023 12:29:46 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 08 May 2023 10:56:12 +0800   Sun, 07 May 2023 12:29:46 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 08 May 2023 10:56:12 +0800   Sun, 07 May 2023 12:29:46 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 08 May 2023 10:56:12 +0800   Sun, 07 May 2023 20:48:36 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.206.134.14
  Hostname:    node0
Capacity:
  cpu:                2
  ephemeral-storage:  51473868Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7782444Ki
  pods:               110
Allocatable:
  cpu:                1900m
  ephemeral-storage:  47438316671
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7417900Ki
  pods:               110
System Info:
  Machine ID:                 a6600151884b44fb9f0bc9af490e44b7
  System UUID:                A6600151-884B-44FB-9F0B-C9AF490E44B7
  Boot ID:                    b7f3357f-44e6-425e-8c90-6ada14e92703
  Kernel Version:             3.10.0-1160.88.1.el7.x86_64
  OS Image:                   CentOS Linux 7 (Core)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.23
  Kubelet Version:            v1.23.17
  Kube-Proxy Version:         v1.23.17
PodCIDR:                      10.234.2.0/23
PodCIDRs:                     10.234.2.0/23
Non-terminated Pods:          (20 in total)
  Namespace                   Name                                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                         ------------  ----------  ---------------  -------------  ---
  default                     netchecker-agent-6v8rl                                       15m (0%)      30m (1%)    64M (0%)         100M (1%)      22h
  default                     netchecker-agent-hostnet-fb2jn                               15m (0%)      30m (1%)    64M (0%)         100M (1%)      22h
  default                     netchecker-server-645d759b79-v4bqm                           150m (7%)     300m (15%)  192M (2%)        512M (6%)      22h
  kruise-system               kruise-controller-manager-74847d59cf-295rk                   100m (5%)     200m (10%)  256Mi (3%)       512Mi (7%)     20h
  kruise-system               kruise-controller-manager-74847d59cf-854sq                   100m (5%)     200m (10%)  256Mi (3%)       512Mi (7%)     20h
  kruise-system               kruise-daemon-rz9pj                                          0 (0%)        50m (2%)    0 (0%)           128Mi (1%)     20h
  kube-system                 calico-kube-controllers-74df5cd99c-n9qsn                     30m (1%)      1 (52%)     64M (0%)         256M (3%)      22h
  kube-system                 calico-node-brqrk                                            150m (7%)     300m (15%)  64M (0%)         500M (6%)      22h
  kube-system                 coredns-76b4fb4578-9cqc7                                     100m (5%)     0 (0%)      70Mi (0%)        170Mi (2%)     14h
  kube-system                 kube-proxy-rpxb5                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         22h
  kube-system                 nginx-proxy-node0                                            25m (1%)      0 (0%)      32M (0%)         0 (0%)         22h
  kube-system                 nodelocaldns-qkxhv                                           100m (5%)     0 (0%)      70Mi (0%)        170Mi (2%)     14h
  local-path-storage          local-path-storage-local-path-provisioner-d5bb7f8c9-qfp8h    0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  mo-hn                       matrixone-operator-f8496ff5c-fp6zm                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         20h
  mo-hn                       mo-tn-0                                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  mo-hn                       mo-log-0                                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  mo-hn                       mo-log-1                                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  mo-hn                       mo-log-2                                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  mo-hn                       mo-tp-cn-0                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  mostorage                   minio-674ccf54f7-tdglh                                       0 (0%)        0 (0%)      512Mi (7%)       0 (0%)         20h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests          Limits
  --------           --------          ------
  cpu                785m (41%)        2110m (111%)
  memory             1700542464 (22%)  3032475392 (39%)
  ephemeral-storage  0 (0%)            0 (0%)
  hugepages-1Gi      0 (0%)            0 (0%)
  hugepages-2Mi      0 (0%)            0 (0%)
Events:              <none>

Pod Monitoring

Use the following command to view the Pods of the MatrixOne cluster:
```
NS="mo-hn"
kubectl get pod -n ${NS}
```

According to the above-returned results, use the following command to view the resource usage of a specific Pod:

POD="[pod name to be monitored]" # According to the above results, for example: TN is mo-tn-0, cn is mo-tp-cn-0, mo-tp-cn-1, ..., logservice is mo -log-0, mo-log-1, ...
kubectl top pod ${POD} -n ${NS}

The command will display the CPU and memory usage for the specified Pod, similar to the following output:

[root@master0 ~]# kubectl top pod mo-tp-cn-0 -nmo-hn
NAME         CPU(cores)   MEMORY(bytes)   
mo-tp-cn-0   20m          214Mi
[root@master0 ~]# kubectl top pod mo-tn-0 -nmo-hn     
NAME      CPU(cores)   MEMORY(bytes)   
mo-tn-0   36m          161Mi

you can also view the resource declaration of a specific Pod to compare it with the actual resource usage.

kubectl describe pod ${POD_NAME} -n${NS}
kubectl get pod ${POD_NAME} -n${NS} -oyaml

[root@master0 ~]# kubectl describe pod mo-tp-cn-0 -nmo-hn
Name:         mo-tp-cn-0
Namespace:    mo-hn
Priority:     0
Node:         node0/10.206.134.14
Start Time:   Sun, 07 May 2023 21:01:50 +0800
Labels:       controller-revision-hash=mo-tp-cn-8666cdfb56
              lifecycle.apps.kruise.io/state=Normal
              matrixorigin.io/cluster=mo
              matrixorigin.io/component=CNSet
              matrixorigin.io/instance=mo-tp
              matrixorigin.io/namespace=mo-hn
              statefulset.kubernetes.io/pod-name=mo-tp-cn-0
Annotations:  apps.kruise.io/runtime-containers-meta:
                {"containers":[{"name":"main","containerID":"docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f","restartCount":0,"...
              cni.projectcalico.org/containerID: 80b286789a2d6fa9e615c3edee79b57edb452eaeafddb9b7b82ec5fb2e339409
              cni.projectcalico.org/podIP: 10.234.60.53/32
              cni.projectcalico.org/podIPs: 10.234.60.53/32
              kruise.io/related-pub: mo
              lifecycle.apps.kruise.io/timestamp: 2023-05-07T13:01:50Z
              matrixone.cloud/cn-label: null
              matrixone.cloud/dns-based-identity: False
Status:       Running
IP:           10.234.60.53
IPs:
  IP:           10.234.60.53
Controlled By:  StatefulSet/mo-tp-cn
Containers:
  main:
    Container ID:  docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f
    Image:         matrixorigin/matrixone:nightly-144f3be4
    Image ID:      docker-pullable://matrixorigin/matrixone@sha256:288fe3d626c6aa564684099e4686a9d4b28e16fdd16512bd968a67bb41d5aaa3
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      /etc/matrixone/config/start.sh
    Args:
      -debug-http=:6060
    State:          Running
      Started:      Sun, 07 May 2023 21:01:54 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAME:               mo-tp-cn-0 (v1:metadata.name)
      NAMESPACE:              mo-hn (v1:metadata.namespace)
      HEADLESS_SERVICE_NAME:  mo-tp-cn-headless
      AWS_ACCESS_KEY_ID:      <set to the key 'AWS_ACCESS_KEY_ID' in secret 'minio'>      Optional: false
      AWS_SECRET_ACCESS_KEY:  <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 'minio'>  Optional: false
      AWS_REGION:             us-west-2
    Mounts:
      /etc/matrixone/config from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ngpcs (ro)
Readiness Gates:
  Type                 Status
  InPlaceUpdateReady   True
  KruisePodReady       True
Conditions:
  Type                 Status
  KruisePodReady       True
  InPlaceUpdateReady   True
  Initialized          True
  Ready                True
  ContainersReady      True
  PodScheduled         True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mo-tp-cn-config-5abf454
    Optional:  false
  kube-api-access-ngpcs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
[root@master0 ~]# kubectl get pod  mo-tp-cn-0 -nmo-hn -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    apps.kruise.io/runtime-containers-meta: '{"containers":[{"name":"main","containerID":"docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f","restartCount":0,"hashes":{"plainHash":1670287891}}]}'
    cni.projectcalico.org/containerID: 80b286789a2d6fa9e615c3edee79b57edb452eaeafddb9b7b82ec5fb2e339409
    cni.projectcalico.org/podIP: 10.234.60.53/32
    cni.projectcalico.org/podIPs: 10.234.60.53/32
    kruise.io/related-pub: mo
    lifecycle.apps.kruise.io/timestamp: "2023-05-07T13:01:50Z"
    matrixone.cloud/cn-label: "null"
    matrixone.cloud/dns-based-identity: "False"
  creationTimestamp: "2023-05-07T13:01:50Z"
  generateName: mo-tp-cn-
  labels:
    controller-revision-hash: mo-tp-cn-8666cdfb56
    lifecycle.apps.kruise.io/state: Normal
    matrixorigin.io/cluster: mo
    matrixorigin.io/component: CNSet
    matrixorigin.io/instance: mo-tp
    matrixorigin.io/namespace: mo-hn
    statefulset.kubernetes.io/pod-name: mo-tp-cn-0
  name: mo-tp-cn-0
  namespace: mo-hn
  ownerReferences:
  - apiVersion: apps.kruise.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: mo-tp-cn
    uid: 891e0453-89a5-45d5-ad12-16ef048c804f
  resourceVersion: "72625"
  uid: 1e3e2df3-f1c2-4444-8694-8d23e7125d35
spec:
  containers:
  - args:
    - -debug-http=:6060
    command:
    - /bin/sh
    - /etc/matrixone/config/start.sh
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: HEADLESS_SERVICE_NAME
      value: mo-tp-cn-headless
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: AWS_ACCESS_KEY_ID
          name: minio
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: AWS_SECRET_ACCESS_KEY
          name: minio
    - name: AWS_REGION
      value: us-west-2
    image: matrixorigin/matrixone:nightly-144f3be4
    imagePullPolicy: Always
    name: main
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/matrixone/config
      name: config
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-ngpcs
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: mo-tp-cn-0
  nodeName: node0
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  readinessGates:
  - conditionType: InPlaceUpdateReady
  - conditionType: KruisePodReady
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: mo-tp-cn-headless
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: mo-tp-cn-config-5abf454
    name: config
  - name: kube-api-access-ngpcs
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-05-07T13:01:50Z"
    status: "True"
    type: KruisePodReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-07T13:01:50Z"
    status: "True"
    type: InPlaceUpdateReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-07T13:01:50Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-05-07T13:01:54Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-05-07T13:01:54Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-07T13:01:50Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f
    image: matrixorigin/matrixone:nightly-144f3be4
    imageID: docker-pullable://matrixorigin/matrixone@sha256:288fe3d626c6aa564684099e4686a9d4b28e16fdd16512bd968a67bb41d5aaa3
    lastState: {}
    name: main
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-05-07T13:01:54Z"
  hostIP: 10.206.134.14
  phase: Running
  podIP: 10.234.60.53
  podIPs:
  - ip: 10.234.60.53
  qosClass: BestEffort
  startTime: "2023-05-07T13:01:50Z"

MatrixOne Monitoring

View cluster status

During Operator deployment, we defined matrixOnecluster as the custom resource name for the entire cluster. By checking MatrixOneCluster, we can tell if the cluster is functioning correctly. You can check with the following command:

MO_NAME="mo"
NS="mo-hn"
kubectl get matrixonecluster -n ${NS} ${MO_NAME}

If the status is “Ready”, the cluster is healthy. If the status is “NotReady”, further troubleshooting is required.

[root@master0 ~]# MO_NAME="mo"
[root@master0 ~]# NS="mo-hn"
[root@master0 ~]# kubectl get matrixonecluster -n${NS} ${MO_NAME}
NAME   LOG   TN    TP    AP    VERSION            PHASE   AGE
mo     3     1     1           nightly-144f3be4   Ready   13h

To view details of the MatrixOne cluster status, you can run the following command:

kubectl describe matrixonecluster -n${NS} ${MO_NAME}

[root@master0 ~]# kubectl describe matrixonecluster -n${NS} ${MO_NAME}
Name:         mo
Namespace:    mo-hn
Labels:       <none>
Annotations:  <none>
API Version:  core.matrixorigin.io/v1alpha1
Kind:         MatrixOneCluster
Metadata:
  Creation Timestamp:  2023-05-07T12:54:17Z
  Finalizers:
    matrixorigin.io/matrixonecluster
  Generation:  2
  Managed Fields:
    API Version:  core.matrixorigin.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:Tn:
          .:
          f:config:
          f:replicas:
        f:imagePullPolicy:
        f:imageRepository:
        f:logService:
          .:
          f:config:
          f:pvcRetentionPolicy:
          f:replicas:
          f:sharedStorage:
            .:
            f:s3:
              .:
              f:endpoint:
              f:secretRef:
              f:type:
          f:volume:
            .:
            f:size:
        f:tp:
          .:
          f:config:
          f:nodePort:
          f:replicas:
          f:serviceType:
        f:version:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-05-07T12:54:17Z
    API Version:  core.matrixorigin.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"matrixorigin.io/matrixonecluster":
    Manager:      manager
    Operation:    Update
    Time:         2023-05-07T12:54:17Z
    API Version:  core.matrixorigin.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:logService:
          f:sharedStorage:
            f:s3:
              f:path:
    Manager:      kubectl-edit
    Operation:    Update
    Time:         2023-05-07T13:00:53Z
    API Version:  core.matrixorigin.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:cnGroups:
          .:
          f:desiredGroups:
          f:readyGroups:
          f:syncedGroups:
        f:conditions:
        f:credentialRef:
        f:Tn:
          .:
          f:availableStores:
          f:conditions:
        f:logService:
          .:
          f:availableStores:
          f:conditions:
          f:discovery:
            .:
            f:address:
            f:port:
        f:phase:
    Manager:         manager
    Operation:       Update
    Subresource:     status
    Time:            2023-05-07T13:02:12Z
  Resource Version:  72671
  UID:               be2355c0-0c69-4f0f-95bb-9310224200b6
Spec:
  Tn:
    Config:  
[dn]
[dn.Ckp]
flush-interval = "60s"
global-interval = "100000s"
incremental-interval = "60s"
min-count = 100
scan-interval = "5s"
[dn.Txn]
[dn.Txn.Storage]
backend = "TAE"
log-backend = "logservice"
[log]
format = "json"
level = "error"
max-size = 512
    Replicas:  1
    Resources:
    Service Args:
      -debug-http=:6060
    Shared Storage Cache:
      Memory Cache Size:  0
  Image Pull Policy:      Always
  Image Repository:       matrixorigin/matrixone
  Log Service:
    Config:  
[log]
format = "json"
level = "error"
max-size = 512
    Initial Config:
      TN Shards:           1
      Log Shard Replicas:  3
      Log Shards:          1
    Pvc Retention Policy:  Retain
    Replicas:              3
    Resources:
    Service Args:
      -debug-http=:6060
    Shared Storage:
      s3:
        Endpoint:           http://minio.mostorage:9000
        Path:               minio-mo
        s3RetentionPolicy:  Retain
        Secret Ref:
          Name:             minio
        Type:               minio
    Store Failure Timeout:  10m0s
    Volume:
      Size:  1Gi
  Tp:
    Config:  
[cn]
[cn.Engine]
type = "distributed-tae"
[log]
format = "json"
level = "debug"
max-size = 512
    Node Port:  31474
    Replicas:   1
    Resources:
    Service Args:
      -debug-http=:6060
    Service Type:  NodePort
    Shared Storage Cache:
      Memory Cache Size:  0
  Version:                nightly-144f3be4
Status:
  Cn Groups:
    Desired Groups:  1
    Ready Groups:    1
    Synced Groups:   1
  Conditions:
    Last Transition Time:  2023-05-07T13:02:14Z
    Message:               the object is synced
    Reason:                empty
    Status:                True
    Type:                  Synced
    Last Transition Time:  2023-05-07T13:02:14Z
    Message:               
    Reason:                AllSetsReady
    Status:                True
    Type:                  Ready
  Credential Ref:
    Name:  mo-credential
  Tn:
    Available Stores:
      Last Transition:  2023-05-07T13:01:48Z
      Phase:            Up
      Pod Name:         mo-tn-0
    Conditions:
      Last Transition Time:  2023-05-07T13:01:48Z
      Message:               the object is synced
      Reason:                empty
      Status:                True
      Type:                  Synced
      Last Transition Time:  2023-05-07T13:01:48Z
      Message:               
      Reason:                empty
      Status:                True
      Type:                  Ready
  Log Service:
    Available Stores:
      Last Transition:  2023-05-07T13:01:25Z
      Phase:            Up
      Pod Name:         mo-log-0
      Last Transition:  2023-05-07T13:01:25Z
      Phase:            Up
      Pod Name:         mo-log-1
      Last Transition:  2023-05-07T13:01:25Z
      Phase:            Up
      Pod Name:         mo-log-2
    Conditions:
      Last Transition Time:  2023-05-07T13:01:25Z
      Message:               the object is synced
      Reason:                empty
      Status:                True
      Type:                  Synced
      Last Transition Time:  2023-05-07T13:01:25Z
      Message:               
      Reason:                empty
      Status:                True
      Type:                  Ready
    Discovery:
      Address:  mo-log-discovery.mo-hn.svc
      Port:     32001
  Phase:        Ready
Events:
  Type    Reason            Age                From              Message
  ----    ------            ----               ----              -------
  Normal  ReconcileSuccess  29m (x2 over 13h)  matrixonecluster  object is synced

View component status

The current MatrixOne cluster includes the following components: TN, CN, and Log Service, which correspond to the custom resource types TNSet, CNSet, and LogSet, respectively, and these objects are generated by the MatrixOneCluster controller.

To check whether each component is standard, take TN as an example; you can run the following command:

SET_TYPE="tnset"
NS="mo-hn"
kubectl get ${SET_TYPE} -n ${NS}

This will display status information for the TN component as follows:

[root@master0 ~]# SET_TYPE="tnset"
[root@master0 ~]# NS="mo-hn"
[root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
NAME   IMAGE                                     REPLICAS   AGE
mo     matrixorigin/matrixone:nightly-144f3be4   1          13h
[root@master0 ~]# SET_TYPE="cnset"
[root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
NAME    IMAGE                                     REPLICAS   AGE
mo-tp   matrixorigin/matrixone:nightly-144f3be4   1          13h
[root@master0 ~]# SET_TYPE="logset"                 
[root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
NAME   IMAGE                                     REPLICAS   AGE
mo     matrixorigin/matrixone:nightly-144f3be4   3          13h

View Pod Status

In addition, you can directly check the native k8s objects generated in the MO cluster to confirm the cluster’s health, usually by querying the pod.

NS="mo-hn"
kubectl get pod -n ${NS}

This will display status information for the Pod.

[root@master0 ~]# NS="mo-hn"
[root@master0 ~]# kubectl get pod -n${NS}
NAME                                 READY   STATUS    RESTARTS   AGE
matrixone-operator-f8496ff5c-fp6zm   1/1     Running   0          19h
mo-tn-0                              1/1     Running   0          13h
mo-log-0                             1/1     Running   0          13h
mo-log-1                             1/1     Running   0          13h
mo-log-2                             1/1     Running   0          13h
mo-tp-cn-0                           1/1     Running   0          13h

Typically, the Running state indicates that the Pod is functioning normally. But there are also some special cases where the Pod status may be Running, but the MatrixOne cluster is abnormal. For example, connecting to a MatrixOne cluster through a MySQL client is impossible. In this case, you can look further into the Pod’s logs to check for unusual output.

NS="mo-hn"
POD_NAME="[the name of the pod returned above]" # For example, mo-tp-cn-0
kubectl logs ${POD_NAME} -n ${NS}

If the Pod status is not Running, such as Pending, you can check the Pod events (Events) to confirm the cause of the exception. Taking the previous example, because the cluster resources cannot satisfy the request of mo-tp-cn-3, the Pod cannot be scheduled and is Pending. In this case, you can solve it by expanding node resources.

kubectl describe pod ${POD_NAME} -n${NS}