监控能力说明

本文档介绍了 KubeCube 的监控能力。

总体能力说明

监控内容采集源说明
k8s 核心组件监控各个 k8s 服务组件暴露的 metrics 接口监控 k8s api-server、controller-manager、kube-proxy、scheduler、etcd、coredns和kubelet组件的功能运行情况。查看指标时以组件名字开头+下划线。
k8s 节点监控node-exportnode-export 监控 k8s 集群 node 节点的 cpu、memory、network、disk 等信息。
k8s 容器指标cAdvisork8s kubelet 内置的 cAdvisor 会监控各个节点中运行的容器。
k8s 资源监控kube-state-metricskube-state-metrics 关注各种 k8s 资源对象的指标信息,监控各项 k8s 资源包括node、deployment、pod等。

常用指标说明

Node-exporter

Node-exporter 关注容器的指标,更多指标信息参考Node-exporter

Metric nameDescription
instance:nodenum_cpu:sum节点的CPU核数
instance:node_load1_per_cpu:ratio节点的CPU负载率
instance:node_memory_utilisation:ratio节点内存使用率
node_cpu节点CPU指标
nodememory节点内存指标
nodedisk节点磁盘指标
nodenetwork节点网络指标

CAdvisor

CAdcisor 关注容器的指标,更多指标信息参考cAdvisor

Metric nameTypeDescriptionUnit (where applicable)option parameter
container_cpu_cfs_periods_totalCounter容器生命周期中度过的 cpu 周期总数cpu
container_cpu_cfs_throttled_periods_totalCounter容器生命周期中度过的受限的 cpu 周期总数cpu
container_cpu_cfs_throttled_seconds_totalCounter容器 cpu 受限制的持续时间secondscpu
container_cpu_load_average_10sGauge监控过去10秒cpu负载的平均值cpuLoad
container_cpu_system_seconds_totalCounter内核态累计消耗的 cpu 时间secondscpu
container_cpu_usage_seconds_totalCounter累计消耗的 cpu 时间secondscpu
container_cpu_user_seconds_totalCounter用户态累计消耗的 cpu 时间secondscpu
container_file_descriptorsGauge打开的文件描述符数process
container_fs_io_currentGauge当前正在进行IO操作的进程数diskIO
container_fs_limit_bytesGauge容器文件系统可使用的字节数bytesdisk
container_fs_usage_bytesGauge容器文件系统已使用的字节数bytesdisk
container_memory_cacheGauge内存cache字节数bytesmemory
container_memory_failcntCounter内存达到限制值的次数memory
container_memory_mapped_fileGauge内存映射文件大小bytesmemory
container_memory_max_usage_bytesGauge记录的最大内存使用值bytesmemory
container_memory_usage_bytesGauge当前内存使用情况,包括所有内存bytesmemory
container_network_receive_bytes_totalCounter容器网络接收的累积字节数bytesnetwork
container_network_receive_errors_totalCounter容器网络接收时遇到的累积错误次数network
container_network_receive_packets_dropped_totalCounter容器网络接收时丢掉的网络包数network
container_network_receive_packets_totalCounter容器网络累积接收的网络包数network
container_network_tcp_usage_totalGauge容器的TCP链接使用统计tcp
container_network_transmit_bytes_totalCounter容器网络传输的累积字节数bytesnetwork
container_network_transmit_errors_totalCounter容器网络传输时遇到的累积错误次数network
container_network_transmit_packets_dropped_totalCounter容器网络传输时丢掉的网络包数network
container_network_transmit_packets_totalCounter容器网络传输累积的网络包数network
container_processesGauge容器正在运行的进程数process
container_socketsGauge容器打开的sockets链接数process
container_spec_cpu_periodGauge容器的CPU周期-
container_spec_cpu_quotaGauge容器的CPU配额-
container_spec_memory_limit_bytesGauge容器的内存限制bytes-
container_tasks_stateGauge处于这些状态的任务数 (sleeping, running, stopped, uninterruptible, or ioawaiting)cpuLoad
container_threadsGauge容器正在运行的threads数吗process
container_threads_maxGauge容器允许的最大threads数码process

Kube-state-metrics

Kube-state-metrics 关注各种 k8s 资源对象的指标信息,详细指标说明参考 kube-state-metrics 指标说明

Pod Metrics

Metric nameMetric typeLabels/tags
kube_pod_status_phaseGaugepod=<pod-name> namespace=<pod-namespace> phase=<Pending/Running/Succeeded/Failed/Unknown>
kube_pod_container_infoGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace> image=<image-name> image_id=<image-id> container_id=<containerid>
kube_pod_container_status_waiting_reasonGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace> reason=<ContainerCreating/CrashLoopBackOff/ErrImagePull/ImagePullBackOff/CreateContainerConfigError/InvalidImageName/CreateContainerError>
kube_pod_container_status_runningGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace>
kube_pod_container_status_terminated_reasonGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace> reason=<OOMKilled/Error/Completed/ContainerCannotRun/DeadlineExceeded>
kube_pod_container_status_readyGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace>
kube_pod_container_status_restarts_totalCountercontainer=<container-name> namespace=<pod-namespace> pod=<pod-name>
kube_pod_container_resource_requests_cpu_coresGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace> node=< node-name>
kube_pod_container_resource_requestsGaugeresource=<resource-name> unit=<resource-unit> container=<container-name> pod=<pod-name> namespace=<pod-namespace> node=< node-name>
kube_pod_container_resource_limitsGaugeresource=<resource-name> unit=<resource-unit> container=<container-name> pod=<pod-name> namespace=<pod-namespace> node=< node-name>
kube_pod_init_container_infoGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace> image=<image-name> image_id=<image-id> container_id=<containerid>
kube_pod_init_container_status_runningGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace>
kube_pod_init_container_status_terminated_reasonGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace> reason=<OOMKilled/Error/Completed/ContainerCannotRun/DeadlineExceeded>
kube_pod_init_container_status_readyGaugecontainer=<container-name> pod=<pod-name> namespace=<pod-namespace>
kube_pod_init_container_resource_limitsGaugeresource=<resource-name> unit=<resource-unit> container=<container-name> pod=<pod-name> namespace=<pod-namespace> node=< node-name>

Deployment Metrics

Metric nameMetric typeLabels/tags
kube_deployment_status_replicasGaugedeployment=<deployment-name> namespace=<deployment-namespace>
kube_deployment_status_replicas_availableGaugedeployment=<deployment-name> namespace=<deployment-namespace>
kube_deployment_status_replicas_unavailableGaugedeployment=<deployment-name> namespace=<deployment-namespace>
kube_deployment_status_conditionGaugedeployment=<deployment-name> namespace=<deployment-namespace> condition=<deployment-condition> status=<true/false/unknown>
kube_deployment_spec_replicasGaugedeployment=<deployment-name> namespace=<deployment-namespace>
kube_deployment_spec_pausedGaugedeployment=<deployment-name> namespace=<deployment-namespace>
kube_deployment_spec_strategy_rollingupdate_max_unavailableGaugedeployment=<deployment-name> namespace=<deployment-namespace>
kube_deployment_spec_strategy_rollingupdate_max_surgeGaugedeployment=<deployment-name> namespace=<deployment-namespace>

DaemonSet Metrics

Metric nameMetric typeLabels/tags
kube_daemonset_status_current_number_scheduledGaugedaemonset=<daemonset-name> namespace=<daemonset-namespace>
kube_daemonset_status_desired_number_scheduledGaugedaemonset=<daemonset-name> namespace=<daemonset-namespace>
kube_daemonset_status_number_availableGaugedaemonset=<daemonset-name> namespace=<daemonset-namespace>
kube_daemonset_status_number_misscheduledGaugedaemonset=<daemonset-name> namespace=<daemonset-namespace>
kube_daemonset_status_number_readyGaugedaemonset=<daemonset-name> namespace=<daemonset-namespace>
kube_daemonset_status_number_unavailableGaugedaemonset=<daemonset-name> namespace=<daemonset-namespace>
kube_daemonset_updated_number_scheduledGaugedaemonset=<daemonset-name> namespace=<daemonset-namespace>

Stateful Set Metrics

Metric nameMetric typeLabels/tags
kube_statefulset_status_replicasGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace>
kube_statefulset_status_replicas_currentGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace>
kube_statefulset_status_replicas_readyGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace>
kube_statefulset_status_replicas_updatedGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace>
kube_statefulset_replicasGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace>
kube_statefulset_createdGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace>
kube_statefulset_status_current_revisionGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace> revision=<statefulset-current-revision>
kube_statefulset_status_update_revisionGaugestatefulset=<statefulset-name> namespace=<statefulset-namespace> revision=<statefulset-update-revision>

Job Metrics

Metric nameMetric typeLabels/tags
kube_job_spec_parallelismGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_spec_completionsGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_spec_active_deadline_secondsGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_status_activeGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_status_succeededGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_status_failedGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_status_start_timeGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_status_completion_timeGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_completeGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_failedGaugejob_name=<job-name> namespace=<job-namespace>
kube_job_createdGaugejob_name=<job-name> namespace=<job-namespace>

CronJob Metrics

Metric nameMetric typeLabels/tags
kube_cronjob_next_schedule_timeGaugecronjob=<cronjob-name> namespace=<cronjob-namespace>
kube_cronjob_status_activeGaugecronjob=<cronjob-name> namespace=<cronjob-namespace>
kube_cronjob_status_last_schedule_timeGaugecronjob=<cronjob-name> namespace=<cronjob-namespace>
kube_cronjob_spec_suspendGaugecronjob=<cronjob-name> namespace=<cronjob-namespace>
kube_cronjob_spec_starting_deadline_secondsGaugecronjob=<cronjob-name> namespace=<cronjob-namespace>

PersistentVolume Metrics

Metric nameMetric typeLabels/tags
kube_persistentvolume_capacity_bytesGaugepersistentvolume=<pv-name>
kube_persistentvolume_status_phaseGaugepersistentvolume=<pv-name> phase=<Bound/Failed/Pending/Available/Released>
kube_persistentvolume_infoGaugepersistentvolume=<pv-name> storageclass=<storageclass-name>

PersistentVolumeClaim Metrics

Metric nameMetric typeLabels/tags
kube_persistentvolumeclaim_access_modeGaugeaccess_mode=<persistentvolumeclaim-access-mode> namespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name>
kube_persistentvolumeclaim_infoGaugenamespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name> storageclass=<persistentvolumeclaim-storageclassname> volumename=<volumename>
kube_persistentvolumeclaim_resource_requests_storage_bytesGaugenamespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name>
kube_persistentvolumeclaim_status_phaseGaugenamespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name> phase=<Pending/Bound/Lost>

Node Metrics

Metric nameMetric typeLabels/tags
kube_node_infoGaugenode=<node-address> kernel_version=<kernel-version> os_image=<os-image-name> container_runtime_version=<container-runtime-and-version-combination> kubelet_version=<kubelet-version> kubeproxy_version=<kubeproxy-version> pod_cidr=<pod-cidr> provider_id=<provider-id>
kube_node_spec_taintGaugenode=<node-address> key=<taint-key> value=<taint-value> effect=<taint-effect>
kube_node_status_capacityGaugenode=<node-address> resource=<resource-name> unit=<resource-unit>
kube_node_status_allocatableGaugenode=<node-address> resource=<resource-name> unit=<resource-unit>
kube_node_status_conditionGaugenode=<node-address> condition=<node-condition> status=<true/false/unknown>

最后修改 November 1, 2021 : [feature] add monitoring metrics (867a6bcc)