Node metrics dashboard

Node metrics dashboard

The node metrics dashboard is a visual analytics dashboard that helps you identify potential pod scaling issues.

About the node metrics dashboard

The node metrics dashboard enables administrative and support team members to monitor metrics related to pod scaling, including scaling limits used to diagnose and troubleshoot scaling issues. Particularly, you can use the visual analytics displayed through the dashboard to monitor workload distributions across nodes. Insights gained from these analytics help you determine the health of your CRI-O and Kubelet system components as well as identify potential sources of excessive or imbalanced resource consumption and system instability.

The dashboard displays visual analytics widgets organized into the following categories:

Critical

Includes visualizations that can help you identify node issues that could result in system instability and inefficiency

Outliers

Includes histograms that visualize processes with runtime durations that fall outside of the 95th percentile

Average durations

Helps you track change in the time that system components take to process operations

Number of operations

Displays visualizations that help you identify changes in the number of operations being run, which in turn helps you determine the load balance and efficiency of your system

Accessing the node metrics dashboard

You can access the node metrics dashboard from the Administrator perspective.

Procedure

Expand the Observe menu option and select Dashboards.
Under the Dashboard filter, select Node cluster.

If no data appears in the visualizations under the Critical category, no critical anomalies were detected. The dashboard is working as intended.

Identify metrics for indicating optimal node resource usage

The node metrics dashboard is organized into four categories: Critical, Outliers, Average durations, and Number of Operations. The metrics in the Critical category help you indicate optimal node resource usage. These metrics include:

Top 3 containers with the most OOM kills in the last day
Failure rate for image pulls in the last hour
Nodes with system reserved memory utilization > 80%
Nodes with Kubelet system reserved memory utilization > 50%
Nodes with CRI-O system reserved memory utilization > 50%
Nodes with system reserved CPU utilization > 80%
Nodes with Kubelet system reserved CPU utilization > 50%
Nodes with CRI-O system reserved CPU utilization > 50%

Top 3 containers with the most OOM kills in the last day

The Top 3 containers with the most OOM kills in the last day query fetches details regarding the top three containers that have experienced the most Out-Of-Memory (OOM) kills in the previous day.

Example default query

topk(3, sum(increase(container_runtime_crio_containers_oom_count_total[1d])) by (name))

OOM kills force the system to terminate some processes due to low memory. Frequent OOM kills can hinder the functionality of the node and even the entire Kubernetes ecosystem. Containers experiencing frequent OOM kills might be consuming more memory than they should, which causes system instability.

Use this metric to identify containers that are experiencing frequent OOM kills and investigate why these containers are consuming an excessive amount of memory. Adjust the resource allocation if necessary and consider resizing the containers based on their memory usage. You can also review the metrics under the Outliers, Average durations, and Number of operations categories to gain further insights into the health and stability of your nodes.

Failure rate for image pulls in the last hour

The Failure rate for image pulls in the last hour query divides the total number of failed image pulls by the sum of successful and failed image pulls to provide a ratio of failures.

Example default query

rate(container_runtime_crio_image_pulls_failure_total[1h]) / (rate(container_runtime_crio_image_pulls_success_total[1h]) + rate(container_runtime_crio_image_pulls_failure_total[1h]))

Understanding the failure rate of image pulls is crucial for maintaining the health of the node. A high failure rate might indicate networking issues, storage problems, misconfigurations, or other issues that could disrupt pod density and the deployment of new containers.

If the outcome of this query is high, investigate possible causes such as network connections, the availability of remote repositories, node storage, and the accuracy of image references. You can also review the metrics under the Outliers, Average durations, and Number of operations categories to gain further insights.

Nodes with system reserved memory utilization > 80%

The Nodes with system reserved memory utilization > 80% query calculates the percentage of system reserved memory that is utilized for each node. The calculation divides the total resident set size (RSS) by the total memory capacity of the node subtracted from the allocatable memory. RSS is the portion of the system’s memory occupied by a process that is held in main memory (RAM). Nodes are flagged if their resulting value equals or exceeds an 80% threshold.

Example default query

sum by (node) (container_memory_rss{id="/system.slice"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 80

System reserved memory is crucial for a Kubernetes node as it is utilized to run system daemons and Kubernetes system daemons. System reserved memory utilization that exceeds 80% indicates that the system and Kubernetes daemons are consuming too much memory and can suggest node instability that could affect the performance of running pods. Excessive memory consumption can cause Out-of-Memory (OOM) killers that can terminate critical system processes to free up memory.

If a node is flagged by this metric, identify which system or Kubernetes processes are consuming excessive memory and take appropriate actions to mitigate the situation. These actions may include scaling back non-critical processes, optimizing program configurations to reduce memory usage, or upgrading node systems to hardware with greater memory capacity. You can also review the metrics under the Outliers, Average durations, and Number of operations categories to gain further insights into node performance.

Nodes with Kubelet system reserved memory utilization > 50%

The Nodes with Kubelet system reserved memory utilization > 50% query indicates nodes where the Kubelet’s system reserved memory utilization exceeds 50%. The query examines the memory that the Kubelet process itself is consuming on a node.

Example default query

sum by (node) (container_memory_rss{id="/system.slice/kubelet.service"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 50

This query helps you identify any possible memory pressure situations in your nodes that could affect the stability and efficiency of node operations. Kubelet memory utilization that consistently exceeds 50% of the system reserved memory, indicate that the system reserved settings are not configured properly and that there is a high risk of the node becoming unstable.

If this metric is highlighted, review your configuration policy and consider adjusting the system reserved settings or the resource limits settings for the Kubelet. Additionally, if your Kubelet memory utilization consistently exceeds half of your total reserved system memory, examine metrics under the Outliers, Average durations, and Number of operations categories to gain further insights for more precise diagnostics.

Nodes with CRI-O system reserved memory utilization > 50%

The Nodes with CRI-O system reserved memory utilization > 50% query calculates all nodes where the percentage of used memory reserved for the CRI-O system is greater than or equal to 50%. In this case, memory usage is defined by the resident set size (RSS), which is the portion of the CRI-O system’s memory held in RAM.

Example default query

sum by (node) (container_memory_rss{id="/system.slice/crio.service"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 50

This query helps you monitor the status of memory reserved for the CRI-O system on each node. High utilization could indicate a lack of available resources and potential performance issues. If the memory reserved for the CRI-O system exceeds the advised limit of 50%, it indicates that half of the system reserved memory is being used by CRI-O on a node.

Check memory allocation and usage and assess whether memory resources need to be shifted or increased to prevent possible node instability. You can also examine the metrics under the Outliers, Average durations, and Number of operations categories to gain further insights.

Nodes with System Reserved CPU Utilization > 80%

The Nodes with system reserved CPU utilization > 80% query identifies nodes where the system-reserved CPU utilization is more than 80%. The query focuses on the system-reserved capacity to calculate the rate of CPU usage in the last 5 minutes and compares that to the CPU resources available on the nodes. If the ratio exceeds 80%, the node’s result is displayed in the metric.

Example default query

sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 80

This query indicates a critical level of system-reserved CPU usage, which can lead to resource exhaustion. High system-reserved CPU usage can result in the inability of the system processes (including the Kubelet and CRI-O) to adequately manage resources on the node. This query can indicate excessive system processes or misconfigured CPU allocation.

Potential corrective measures include rebalancing workloads to other nodes or increasing the CPU resources allocated to the nodes. Investigate the cause of the high system CPU utilization and review the corresponding metrics in the Outliers, Average durations, and Number of operations categories for additional insights into the node’s behavior.

Nodes with Kubelet system reserved CPU utilization > 50%

The Nodes with Kubelet system reserved CPU utilization > 50% query calculates the percentage of the CPU that the Kubelet system is currently using from system reserved.

Example default query

sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice/kubelet.service"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 50

The Kubelet uses the system reserved CPU for its own operations and for running critical system services. For the node’s health, it is important to ensure that system reserve CPU usage does not exceed the 50% threshold. Exceeding this limit could indicate heavy utilization or load on the Kubelet, which affects node stability and potentially the performance of the entire Kubernetes cluster.

If any node is displayed in this metric, the Kubelet and the system overall are under heavy load. You can reduce overload on a particular node by balancing the load across other nodes in the cluster. Check other query metrics under the Outliers, Average durations, and Number of operations categories to gain further insights and take necessary corrective action.

Nodes with CRI-O system reserved CPU utilization > 50%

The Nodes with CRI-O system reserved CPU utilization > 50% query identifies nodes where the CRI-O system reserved CPU utilization has exceeded 50% in the last 5 minutes. The query monitors CPU resource consumption by CRI-O, your container runtime, on a per-node basis.

Example default query

sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice/crio.service"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 50

This query allows for quick identification of abnormal start times that could negatively impact pod performance. If this query returns a high value, your pod start times are slower than usual, which suggests potential issues with the kubelet, pod configuration, or resources.

Investigate further by checking your pod configurations and allocated resources. Make sure that they align with your system capabilities. If you still see high start times, explore metrics panels from other categories on the dashboard to determine the state of your system components.

Customizing dashboard queries

You can customize the default queries used to build the node metrics dashboard.

Procedure

Choose a metric and click Inspect to navigate into the data. This page displays the metric in detail, including an expanded visualization of the results of the query, the Prometheus query used to analyze the data, and the data subset used in the query.
Make any required changes to the query parameters.
Optional: Click Add query to run additional queries against the data.
Click Run query to rerun the query using your specified parameters.