- Low latency tuning
- Understanding low latency
- Provisioning real-time and low latency workloads
- Known limitations for real-time
- Provisioning a worker with real-time capabilities
- Verifying the real-time kernel installation
- Creating a workload that works in real-time
- Creating a pod with a QoS class of
Guaranteed
- Optional: Disabling CPU load balancing for DPDK
- Assigning a proper node selector
- Scheduling a workload onto a worker with real-time capabilities
- Reducing power consumption by taking CPUs offline
- Optional: Power saving configurations
- Managing device interrupt processing for guaranteed pod isolated CPUs
- Upgrading the performance profile to use device interrupt processing
- Tuning nodes for low latency with the performance profile
- Reducing NIC queues using the Node Tuning Operator
- Debugging low latency CNF tuning status
- Collecting low latency tuning debugging data for Red Hat Support
Low latency tuning
Understanding low latency
The emergence of Edge computing in the area of Telco / 5G plays a key role in reducing latency and congestion problems and improving application performance.
Simply put, latency determines how fast data (packets) moves from the sender to receiver and returns to the sender after processing by the receiver. Maintaining a network architecture with the lowest possible delay of latency speeds is key for meeting the network performance requirements of 5G. Compared to 4G technology, with an average latency of 50 ms, 5G is targeted to reach latency numbers of 1 ms or less. This reduction in latency boosts wireless throughput by a factor of 10.
Many of the deployed applications in the Telco space require low latency that can only tolerate zero packet loss. Tuning for zero packet loss helps mitigate the inherent issues that degrade network performance. For more information, see Tuning for Zero Packet Loss in OpenStack.
The Edge computing initiative also comes in to play for reducing latency rates. Think of it as being on the edge of the cloud and closer to the user. This greatly reduces the distance between the user and distant data centers, resulting in reduced application response times and performance latency.
Administrators must be able to manage their many Edge sites and local services in a centralized way so that all of the deployments can run at the lowest possible management cost. They also need an easy way to deploy and configure certain nodes of their cluster for real-time low latency and high-performance purposes. Low latency nodes are useful for applications such as Cloud-native Network Functions (CNF) and Data Plane Development Kit (DPDK).
OKD currently provides mechanisms to tune software on an OKD cluster for real-time running and low latency (around <20 microseconds reaction time). This includes tuning the kernel and OKD set values, installing a kernel, and reconfiguring the machine. But this method requires setting up four different Operators and performing many configurations that, when done manually, is complex and could be prone to mistakes.
OKD uses the Node Tuning Operator to implement automatic tuning to achieve low latency performance for OKD applications. The cluster administrator uses this performance profile configuration that makes it easier to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt, reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolate CPUs for application containers to run the workloads.
Currently, disabling CPU load balancing is not supported by cgroup v2. As a result, you might not get the desired behavior from performance profiles if you have cgroup v2 enabled. Enabling cgroup v2 is not recommended if you are using performace profiles. |
OKD also supports workload hints for the Node Tuning Operator that can tune the PerformanceProfile
to meet the demands of different industry environments. Workload hints are available for highPowerConsumption
(very low latency at the cost of increased power consumption) and realTime
(priority given to optimum latency). A combination of true/false
settings for these hints can be used to deal with application-specific workload profiles and requirements.
Workload hints simplify the fine-tuning of performance to industry sector settings. Instead of a “one size fits all” approach, workload hints can cater to usage patterns such as placing priority on:
Low latency
Real-time capability
Efficient use of power
In an ideal world, all of those would be prioritized: in real life, some come at the expense of others. The Node Tuning Operator is now aware of the workload expectations and better able to meet the demands of the workload. The cluster admin can now specify into which use case that workload falls. The Node Tuning Operator uses the PerformanceProfile
to fine tune the performance settings for the workload.
The environment in which an application is operating influences its behavior. For a typical data center with no strict latency requirements, only minimal default tuning is needed that enables CPU partitioning for some high performance workload pods. For data centers and workloads where latency is a higher priority, measures are still taken to optimize power consumption. The most complicated cases are clusters close to latency-sensitive equipment such as manufacturing machinery and software-defined radios. This last class of deployment is often referred to as Far edge. For Far edge deployments, ultra-low latency is the ultimate priority, and is achieved at the expense of power management.
In OKD version 4.10 and previous versions, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance. Now this functionality is part of the Node Tuning Operator.
About hyperthreading for low latency and real-time applications
Hyperthreading is an Intel processor technology that allows a physical CPU processor core to function as two logical cores, executing two independent threads simultaneously. Hyperthreading allows for better system throughput for certain workload types where parallel processing is beneficial. The default OKD configuration expects hyperthreading to be enabled by default.
For telecommunications applications, it is important to design your application infrastructure to minimize latency as much as possible. Hyperthreading can slow performance times and negatively affect throughput for compute intensive workloads that require low latency. Disabling hyperthreading ensures predictable performance and can decrease processing times for these workloads.
Hyperthreading implementation and configuration differs depending on the hardware you are running OKD on. Consult the relevant host hardware tuning information for more details of the hyperthreading implementation specific to that hardware. Disabling hyperthreading can increase the cost per core of the cluster. |
Additional resources
Provisioning real-time and low latency workloads
Many industries and organizations need extremely high performance computing and might require low and predictable latency, especially in the financial and telecommunications industries. For these industries, with their unique requirements, OKD provides the Node Tuning Operator to implement automatic tuning to achieve low latency performance and consistent response time for OKD applications.
The cluster administrator can use this performance profile configuration to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt (real-time), reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, isolate CPUs for application containers to run the workloads, and disable unused CPUs to reduce power consumption.
The usage of execution probes in conjunction with applications that require guaranteed CPUs can cause latency spikes. It is recommended to use other probes, such as a properly configured set of network probes, as an alternative. |
In earlier versions of OKD, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OKD 4.11 and later, these functions are part of the Node Tuning Operator. |
Known limitations for real-time
In most deployments, kernel-rt is supported only on worker nodes when you use a standard cluster with three control plane nodes and three worker nodes. There are exceptions for compact and single nodes on OKD deployments. For installations on a single node, kernel-rt is supported on the single control plane node. |
To fully utilize the real-time mode, the containers must run with elevated privileges. See Set capabilities for a Container for information on granting privileges.
OKD restricts the allowed capabilities, so you might need to create a SecurityContext
as well.
This procedure is fully supported with bare metal installations using Fedora CoreOS (FCOS) systems. |
Establishing the right performance expectations refers to the fact that the real-time kernel is not a panacea. Its objective is consistent, low-latency determinism offering predictable response times. There is some additional kernel overhead associated with the real-time kernel. This is due primarily to handling hardware interruptions in separately scheduled threads. The increased overhead in some workloads results in some degradation in overall throughput. The exact amount of degradation is very workload dependent, ranging from 0% to 30%. However, it is the cost of determinism.
Provisioning a worker with real-time capabilities
Optional: Add a node to the OKD cluster. See Setting BIOS parameters for system tuning.
Add the label
worker-rt
to the worker nodes that require the real-time capability by using theoc
command.Create a new machine config pool for real-time nodes:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-rt
labels:
machineconfiguration.openshift.io/role: worker-rt
spec:
machineConfigSelector:
matchExpressions:
- {
key: machineconfiguration.openshift.io/role,
operator: In,
values: [worker, worker-rt],
}
paused: false
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-rt: ""
Note that a machine config pool worker-rt is created for group of nodes that have the label
worker-rt
.Add the node to the proper machine config pool by using node role labels.
You must decide which nodes are configured with real-time workloads. You could configure all of the nodes in the cluster, or a subset of the nodes. The Node Tuning Operator that expects all of the nodes are part of a dedicated machine config pool. If you use all of the nodes, you must point the Node Tuning Operator to the worker node role label. If you use a subset, you must group the nodes into a new machine config pool.
Create the
PerformanceProfile
with the proper set of housekeeping cores andrealTimeKernel: enabled: true
.You must set
machineConfigPoolSelector
inPerformanceProfile
:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: example-performanceprofile
spec:
...
realTimeKernel:
enabled: true
nodeSelector:
node-role.kubernetes.io/worker-rt: ""
machineConfigPoolSelector:
machineconfiguration.openshift.io/role: worker-rt
Verify that a matching machine config pool exists with a label:
$ oc describe mcp/worker-rt
Example output
Name: worker-rt
Namespace:
Labels: machineconfiguration.openshift.io/role=worker-rt
OKD will start configuring the nodes, which might involve multiple reboots. Wait for the nodes to settle. This can take a long time depending on the specific hardware you use, but 20 minutes per node is expected.
Verify everything is working as expected.
Verifying the real-time kernel installation
Use this command to verify that the real-time kernel is installed:
$ oc get node -o wide
Note the worker with the role worker-rt
that contains the string 4.18.0-305.30.1.rt7.102.el8_4.x86_64 cri-o://1.26.0-99.rhaos4.10.gitc3131de.el8
:
NAME STATUS ROLES AGE VERSION INTERNAL-IP
EXTERNAL-IP OS-IMAGE KERNEL-VERSION
CONTAINER-RUNTIME
rt-worker-0.example.com Ready worker,worker-rt 5d17h v1.26.0
128.66.135.107 <none> Red Hat Enterprise Linux CoreOS 46.82.202008252340-0 (Ootpa)
4.18.0-305.30.1.rt7.102.el8_4.x86_64 cri-o://1.26.0-99.rhaos4.10.gitc3131de.el8
[...]
Creating a workload that works in real-time
Use the following procedures for preparing a workload that will use real-time capabilities.
Procedure
Create a pod with a QoS class of
Guaranteed
.Optional: Disable CPU load balancing for DPDK.
Assign a proper node selector.
When writing your applications, follow the general recommendations described in Application tuning and deployment.
Creating a pod with a QoS class of Guaranteed
Keep the following in mind when you create a pod that is given a QoS class of Guaranteed
:
Every container in the pod must have a memory limit and a memory request, and they must be the same.
Every container in the pod must have a CPU limit and a CPU request, and they must be the same.
The following example shows the configuration file for a pod that has one container. The container has a memory limit and a memory request, both equal to 200 MiB. The container has a CPU limit and a CPU request, both equal to 1 CPU.
apiVersion: v1
kind: Pod
metadata:
name: qos-demo
namespace: qos-example
spec:
containers:
- name: qos-demo-ctr
image: <image-pull-spec>
resources:
limits:
memory: "200Mi"
cpu: "1"
requests:
memory: "200Mi"
cpu: "1"
Create the pod:
$ oc apply -f qos-pod.yaml --namespace=qos-example
View detailed information about the pod:
$ oc get pod qos-demo --namespace=qos-example --output=yaml
Example output
spec:
containers:
...
status:
qosClass: Guaranteed
If a container specifies its own memory limit, but does not specify a memory request, OKD automatically assigns a memory request that matches the limit. Similarly, if a container specifies its own CPU limit, but does not specify a CPU request, OKD automatically assigns a CPU request that matches the limit.
Optional: Disabling CPU load balancing for DPDK
Functionality to disable or enable CPU load balancing is implemented on the CRI-O level. The code under the CRI-O disables or enables CPU load balancing only when the following requirements are met.
The pod must use the
performance-<profile-name>
runtime class. You can get the proper name by looking at the status of the performance profile, as shown here:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
...
status:
...
runtimeClass: performance-manual
Currently, disabling CPU load balancing is not supported with cgroup v2. |
The Node Tuning Operator is responsible for the creation of the high-performance runtime handler config snippet under relevant nodes and for creation of the high-performance runtime class under the cluster. It will have the same content as default runtime handler except it enables the CPU load balancing configuration functionality.
To disable the CPU load balancing for the pod, the Pod
specification must include the following fields:
apiVersion: v1
kind: Pod
metadata:
...
annotations:
...
cpu-load-balancing.crio.io: "disable"
...
...
spec:
...
runtimeClassName: performance-<profile_name>
...
Only disable CPU load balancing when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU load balancing can affect the performance of other containers in the cluster. |
Assigning a proper node selector
The preferred way to assign a pod to nodes is to use the same node selector the performance profile used, as shown here:
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
# ...
nodeSelector:
node-role.kubernetes.io/worker-rt: ""
For more information, see Placing pods on specific nodes using node selectors.
Scheduling a workload onto a worker with real-time capabilities
Use label selectors that match the nodes attached to the machine config pool that was configured for low latency by the Node Tuning Operator. For more information, see Assigning pods to nodes.
Reducing power consumption by taking CPUs offline
You can generally anticipate telecommunication workloads. When not all of the CPU resources are required, the Node Tuning Operator allows you take unused CPUs offline to reduce power consumption by manually updating the performance profile.
To take unused CPUs offline, you must perform the following tasks:
Set the offline CPUs in the performance profile and save the contents of the YAML file:
Example performance profile with offlined CPUs
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
additionalKernelArgs:
- nmi_watchdog=0
- audit=0
- mce=off
- processor.max_cstate=1
- intel_idle.max_cstate=0
- idle=poll
cpu:
isolated: "2-23,26-47"
reserved: "0,1,24,25"
offlined: “48-59” (1)
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numa:
topologyPolicy: single-numa-node
realTimeKernel:
enabled: true
1 Optional. You can list CPUs in the offlined
field to take the specified CPUs offline.Apply the updated profile by running the following command:
$ oc apply -f my-performance-profile.yaml
Optional: Power saving configurations
You can enable power savings for a node that has low priority workloads that are colocated with high priority workloads without impacting the latency or throughput of the high priority workloads. Power saving is possible without modifications to the workloads themselves.
The feature is supported on Intel Ice Lake and later generations of Intel CPUs. The capabilities of the processor might impact the latency and throughput of the high priority workloads. |
When you configure a node with a power saving configuration, you must configure high priority workloads with performance configuration at the pod level, which means that the configuration applies to all the cores used by the pod.
By disabling P-states and C-states at the pod level, you can configure high priority workloads for best performance and lowest latency.
Annotation | Description |
---|---|
| Provides the best performance for a pod by disabling C-states and specifying the governor type for CPU scaling. The |
Prerequisites
- You enabled C-states and OS-controlled P-states in the BIOS
Procedure
Generate a
PerformanceProfile
withper-pod-power-management
set totrue
:$ podman run --entrypoint performance-profile-creator -v \
/must-gather:/must-gather:z registry.redhat.io/openshift4/performance-addon-rhel8-operator:v4.13 \
--mcp-name=worker-cnf --reserved-cpu-count=20 --rt-kernel=true \
--split-reserved-cpus-across-numa=false --topology-manager-policy=single-numa-node \
--must-gather-dir-path /must-gather -power-consumption-mode=low-latency \ (1)
--per-pod-power-management=true > my-performance-profile.yaml
1 The power-consumption-mode
must bedefault
orlow-latency
when theper-pod-power-management
is set totrue
.Example
PerformanceProfile
withperPodPowerManagement
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
[.....]
workloadHints:
realTime: true
highPowerConsumption: false
perPodPowerManagement: true
Set the default
cpufreq
governor as an additional kernel argument in thePerformanceProfile
custom resource (CR):apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
...
additionalKernelArgs:
- cpufreq.default_governor=schedutil (1)
1 Using the schedutil
governor is recommended, however, you can use other governors such as theondemand
orpowersave
governors.Set the maximum CPU frequency in the
TunedPerformancePatch
CR:spec:
profile:
- data: |
[sysfs]
/sys/devices/system/cpu/intel_pstate/max_perf_pct = <x> (1)
1 The max_perf_pct
controls the maximum frequency thecpufreq
driver is allowed to set as a percentage of the maximum supported cpu frequency. This value applies to all CPUs. You can check the maximum supported frequency in/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
. As a starting point, you can use a percentage that caps all CPUs at theAll Cores Turbo
frequency. TheAll Cores Turbo
frequency is the frequency that all cores will run at when the cores are all fully occupied.Add the desired annotations to your high priority workload pods. The annotations override the
default
settings.Example high priority workload annotation
apiVersion: v1
kind: Pod
metadata:
...
annotations:
...
cpu-c-states.crio.io: "disable"
cpu-freq-governor.crio.io: "<governor>"
...
...
spec:
...
runtimeClassName: performance-<profile_name>
...
Restart the pods.
Additional resources
- For more information about recommended firmware configuration, see Recommended firmware configuration for vDU cluster hosts.
Managing device interrupt processing for guaranteed pod isolated CPUs
The Node Tuning Operator can manage host CPUs by dividing them into reserved CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolated CPUs for application containers to run the workloads. This allows you to set CPUs for low latency workloads as isolated.
Device interrupts are load balanced between all isolated and reserved CPUs to avoid CPUs being overloaded, with the exception of CPUs where there is a guaranteed pod running. Guaranteed pod CPUs are prevented from processing device interrupts when the relevant annotations are set for the pod.
In the performance profile, globallyDisableIrqLoadBalancing
is used to manage whether device interrupts are processed or not. For certain workloads, the reserved CPUs are not always sufficient for dealing with device interrupts, and for this reason, device interrupts are not globally disabled on the isolated CPUs. By default, Node Tuning Operator does not disable device interrupts on isolated CPUs.
To achieve low latency for workloads, some (but not all) pods require the CPUs they are running on to not process device interrupts. A pod annotation, irq-load-balancing.crio.io
, is used to define whether device interrupts are processed or not. When configured, CRI-O disables device interrupts only as long as the pod is running.
Disabling CPU CFS quota
To reduce CPU throttling for individual guaranteed pods, create a pod specification with the annotation cpu-quota.crio.io: "disable"
. This annotation disables the CPU completely fair scheduler (CFS) quota at the pod run time. The following pod specification contains this annotation:
apiVersion: performance.openshift.io/v2
kind: Pod
metadata:
annotations:
cpu-quota.crio.io: "disable"
spec:
runtimeClassName: performance-<profile_name>
...
Only disable CPU CFS quota when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU CFS quota can affect the performance of other containers in the cluster. |
Disabling global device interrupts handling in Node Tuning Operator
To configure Node Tuning Operator to disable global device interrupts for the isolated CPU set, set the globallyDisableIrqLoadBalancing
field in the performance profile to true
. When true
, conflicting pod annotations are ignored. When false
, IRQ loads are balanced across all CPUs.
A performance profile snippet illustrates this setting:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
globallyDisableIrqLoadBalancing: true
...
Disabling interrupt processing for individual pods
To disable interrupt processing for individual pods, ensure that globallyDisableIrqLoadBalancing
is set to false
in the performance profile. Then, in the pod specification, set the irq-load-balancing.crio.io
pod annotation to disable
. The following pod specification contains this annotation:
apiVersion: performance.openshift.io/v2
kind: Pod
metadata:
annotations:
irq-load-balancing.crio.io: "disable"
spec:
runtimeClassName: performance-<profile_name>
...
Upgrading the performance profile to use device interrupt processing
When you upgrade the Node Tuning Operator performance profile custom resource definition (CRD) from v1 or v1alpha1 to v2, globallyDisableIrqLoadBalancing
is set to true
on existing profiles.
|
Supported API Versions
The Node Tuning Operator supports v2
, v1
, and v1alpha1
for the performance profile apiVersion
field. The v1 and v1alpha1 APIs are identical. The v2 API includes an optional boolean field globallyDisableIrqLoadBalancing
with a default value of false
.
Upgrading Node Tuning Operator API from v1alpha1 to v1
When upgrading Node Tuning Operator API version from v1alpha1 to v1, the v1alpha1 performance profiles are converted on-the-fly using a “None” Conversion strategy and served to the Node Tuning Operator with API version v1.
Upgrading Node Tuning Operator API from v1alpha1 or v1 to v2
When upgrading from an older Node Tuning Operator API version, the existing v1 and v1alpha1 performance profiles are converted using a conversion webhook that injects the globallyDisableIrqLoadBalancing
field with a value of true
.
Tuning nodes for low latency with the performance profile
The performance profile lets you control latency tuning aspects of nodes that belong to a certain machine config pool. After you specify your settings, the PerformanceProfile
object is compiled into multiple objects that perform the actual node level tuning:
A
MachineConfig
file that manipulates the nodes.A
KubeletConfig
file that configures the Topology Manager, the CPU Manager, and the OKD nodes.The Tuned profile that configures the Node Tuning Operator.
You can use a performance profile to specify whether to update the kernel to kernel-rt, to allocate huge pages, and to partition the CPUs for performing housekeeping duties or running workloads.
You can manually create the |
Sample performance profile
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
cpu:
isolated: "5-15" (1)
reserved: "0-4" (2)
hugepages:
defaultHugepagesSize: "1G"
pages:
- size: "1G"
count: 16
node: 0
realTimeKernel:
enabled: true (3)
numa: (4)
topologyPolicy: "best-effort"
nodeSelector:
node-role.kubernetes.io/worker-cnf: "" (5)
1 | Use this field to isolate specific CPUs to use with application containers for workloads. |
2 | Use this field to reserve specific CPUs to use with infra containers for housekeeping. |
3 | Use this field to install the real-time kernel on the node. Valid values are true or false . Setting the true value installs the real-time kernel. |
4 | Use this field to configure the topology manager policy. Valid values are none (default), best-effort , restricted , and single-numa-node . For more information, see Topology Manager Policies. |
5 | Use this field to specify a node selector to apply the performance profile to specific nodes. |
Additional resources
- For information on using the Performance Profile Creator (PPC) to generate a performance profile, see Creating a performance profile.
Configuring huge pages
Nodes must pre-allocate huge pages used in an OKD cluster. Use the Node Tuning Operator to allocate huge pages on a specific node.
OKD provides a method for creating and allocating huge pages. Node Tuning Operator provides an easier method for doing this using the performance profile.
For example, in the hugepages
pages
section of the performance profile, you can specify multiple blocks of size
, count
, and, optionally, node
:
hugepages:
defaultHugepagesSize: "1G"
pages:
- size: "1G"
count: 4
node: 0 (1)
1 | node is the NUMA node in which the huge pages are allocated. If you omit node , the pages are evenly spread across all NUMA nodes. |
Wait for the relevant machine config pool status that indicates the update is finished. |
These are the only configuration steps you need to do to allocate huge pages.
Verification
To verify the configuration, see the
/proc/meminfo
file on the node:$ oc debug node/ip-10-0-141-105.ec2.internal
# grep -i huge /proc/meminfo
Example output
AnonHugePages: ###### ##
ShmemHugePages: 0 kB
HugePages_Total: 2
HugePages_Free: 2
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: #### ##
Hugetlb: #### ##
Use
oc describe
to report the new size:$ oc describe node worker-0.ocp4poc.example.com | grep -i huge
Example output
hugepages-1g=true
hugepages-###: ###
hugepages-###: ###
Allocating multiple huge page sizes
You can request huge pages with different sizes under the same container. This allows you to define more complicated pods consisting of containers with different huge page size needs.
For example, you can define sizes 1G
and 2M
and the Node Tuning Operator will configure both sizes on the node, as shown here:
spec:
hugepages:
defaultHugepagesSize: 1G
pages:
- count: 1024
node: 0
size: 2M
- count: 4
node: 1
size: 1G
Configuring a node for IRQ dynamic load balancing
Configure a cluster node for IRQ dynamic load balancing to control which cores can receive device interrupt requests (IRQ).
Prerequisites
- For core isolation, all server hardware components must support IRQ affinity. For more information, see the Additional resources section.
Procedure
Log in to the OKD cluster as a user with cluster-admin privileges.
Set the performance profile
apiVersion
to useperformance.openshift.io/v2
.Remove the
globallyDisableIrqLoadBalancing
field or set it tofalse
.Set the appropriate isolated and reserved CPUs. The following snippet illustrates a profile that reserves 2 CPUs. IRQ load-balancing is enabled for pods running on the
isolated
CPU set:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: dynamic-irq-profile
spec:
cpu:
isolated: 2-5
reserved: 0-1
...
When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
Create the pod that uses exclusive CPUs, and set
irq-load-balancing.crio.io
andcpu-quota.crio.io
annotations todisable
. For example:apiVersion: v1
kind: Pod
metadata:
name: dynamic-irq-pod
annotations:
irq-load-balancing.crio.io: "disable"
cpu-quota.crio.io: "disable"
spec:
containers:
- name: dynamic-irq-pod
image: "registry.redhat.io/openshift4/cnf-tests-rhel8:v4.4.13"
command: ["sleep", "10h"]
resources:
requests:
cpu: 2
memory: "200M"
limits:
cpu: 2
memory: "200M"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
runtimeClassName: performance-dynamic-irq-profile
...
Enter the pod
runtimeClassName
in the form performance-<profile_name>, where <profile_name> is thename
from thePerformanceProfile
YAML, in this example,performance-dynamic-irq-profile
.Set the node selector to target a cnf-worker.
Ensure the pod is running correctly. Status should be
running
, and the correct cnf-worker node should be set:$ oc get pod -o wide
Expected output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dynamic-irq-pod 1/1 Running 0 5h33m <ip-address> <node-name> <none> <none>
Get the CPUs that the pod configured for IRQ dynamic load balancing runs on:
$ oc exec -it dynamic-irq-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"
Expected output
Cpus_allowed_list: 2-3
Ensure the node configuration is applied correctly. SSH into the node to verify the configuration.
$ oc debug node/<node-name>
Expected output
Starting pod/<node-name>-debug ...
To use host binaries, run `chroot /host`
Pod IP: <ip-address>
If you don't see a command prompt, try pressing enter.
sh-4.4#
Verify that you can use the node file system:
sh-4.4# chroot /host
Expected output
sh-4.4#
Ensure the default system CPU affinity mask does not include the
dynamic-irq-pod
CPUs, for example, CPUs 2 and 3.$ cat /proc/irq/default_smp_affinity
Example output
33
Ensure the system IRQs are not configured to run on the
dynamic-irq-pod
CPUs:find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;
Example output
/proc/irq/0/smp_affinity_list: 0-5
/proc/irq/1/smp_affinity_list: 5
/proc/irq/2/smp_affinity_list: 0-5
/proc/irq/3/smp_affinity_list: 0-5
/proc/irq/4/smp_affinity_list: 0
/proc/irq/5/smp_affinity_list: 0-5
/proc/irq/6/smp_affinity_list: 0-5
/proc/irq/7/smp_affinity_list: 0-5
/proc/irq/8/smp_affinity_list: 4
/proc/irq/9/smp_affinity_list: 4
/proc/irq/10/smp_affinity_list: 0-5
/proc/irq/11/smp_affinity_list: 0
/proc/irq/12/smp_affinity_list: 1
/proc/irq/13/smp_affinity_list: 0-5
/proc/irq/14/smp_affinity_list: 1
/proc/irq/15/smp_affinity_list: 0
/proc/irq/24/smp_affinity_list: 1
/proc/irq/25/smp_affinity_list: 1
/proc/irq/26/smp_affinity_list: 1
/proc/irq/27/smp_affinity_list: 5
/proc/irq/28/smp_affinity_list: 1
/proc/irq/29/smp_affinity_list: 0
/proc/irq/30/smp_affinity_list: 0-5
Some IRQ controllers do not support IRQ re-balancing and will always expose all online CPUs as the IRQ mask. These IRQ controllers effectively run on CPU 0. For more information on the host configuration, SSH into the host and run the following, replacing <irq-num>
with the CPU number that you want to query:
$ cat /proc/irq/<irq-num>/effective_affinity
Additional resources
Hardware compatibility with IRQ affinity
For core isolation, all server hardware components must support IRQ affinity. To check if the hardware components of your server support IRQ affinity, view the server’s hardware specifications or contact your hardware provider.
OKD supports the following hardware devices for dynamic load balancing:
Manufacturer | Model | Vendor ID | Device ID |
---|---|---|---|
Broadcom | BCM57414 | 14e4 | 16d7 |
Broadcom | BCM57508 | 14e4 | 1750 |
Intel | X710 | 8086 | 1572 |
Intel | XL710 | 8086 | 1583 |
Intel | XXV710 | 8086 | 158b |
Intel | E810-CQDA2 | 8086 | 1592 |
Intel | E810-2CQDA2 | 8086 | 1592 |
Intel | E810-XXVDA2 | 8086 | 159b |
Intel | E810-XXVDA4 | 8086 | 1593 |
Mellanox | MT27700 Family [ConnectX‑4] | 15b3 | 1013 |
Mellanox | MT27710 Family [ConnectX‑4 Lx] | 15b3 | 1015 |
Mellanox | MT27800 Family [ConnectX‑5] | 15b3 | 1017 |
Mellanox | MT28880 Family [ConnectX‑5 Ex] | 15b3 | 1019 |
Mellanox | MT28908 Family [ConnectX‑6] | 15b3 | 101b |
Mellanox | MT2892 Family [ConnectX‑6 Dx] | 15b3 | 101d |
Mellanox | MT2894 Family [ConnectX‑6 Lx] | 15b3 | 101f |
Mellanox | MT42822 BlueField‑2 in ConnectX‑6 NIC mode | 15b3 | a2d6 |
Pensando | DSC-25 dual-port 25G distributed services card for ionic driver | 0x1dd8 | 0x1002 |
Pensando | DSC-100 dual-port 100G distributed services card for ionic driver | 0x1dd8 | 0x1003 |
Silicom | STS Family | 8086 | 1591 |
Configuring hyperthreading for a cluster
To configure hyperthreading for an OKD cluster, set the CPU threads in the performance profile to the same cores that are configured for the reserved or isolated CPU pools.
If you configure a performance profile, and subsequently change the hyperthreading configuration for the host, ensure that you update the CPU |
Disabling a previously enabled host hyperthreading configuration can cause the CPU core IDs listed in the |
Prerequisites
Access to the cluster as a user with the
cluster-admin
role.Install the OpenShift CLI (oc).
Procedure
Ascertain which threads are running on what CPUs for the host you want to configure.
You can view which threads are running on the host CPUs by logging in to the cluster and running the following command:
$ lscpu --all --extended
Example output
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 4800.0000 400.0000
1 0 0 1 1:1:1:0 yes 4800.0000 400.0000
2 0 0 2 2:2:2:0 yes 4800.0000 400.0000
3 0 0 3 3:3:3:0 yes 4800.0000 400.0000
4 0 0 0 0:0:0:0 yes 4800.0000 400.0000
5 0 0 1 1:1:1:0 yes 4800.0000 400.0000
6 0 0 2 2:2:2:0 yes 4800.0000 400.0000
7 0 0 3 3:3:3:0 yes 4800.0000 400.0000
In this example, there are eight logical CPU cores running on four physical CPU cores. CPU0 and CPU4 are running on physical Core0, CPU1 and CPU5 are running on physical Core 1, and so on.
Alternatively, to view the threads that are set for a particular physical CPU core (
cpu0
in the example below), open a command prompt and run the following:$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
Example output
0-4
Apply the isolated and reserved CPUs in the
PerformanceProfile
YAML. For example, you can set logical cores CPU0 and CPU4 asisolated
, and logical cores CPU1 to CPU3 and CPU5 to CPU7 asreserved
. When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs....
cpu:
isolated: 0,4
reserved: 1-3,5-7
...
The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.
Hyperthreading is enabled by default on most Intel processors. If you enable hyperthreading, all threads processed by a particular core must be isolated or processed on the same core. |
Disabling hyperthreading for low latency applications
When configuring clusters for low latency processing, consider whether you want to disable hyperthreading before you deploy the cluster. To disable hyperthreading, do the following:
Create a performance profile that is appropriate for your hardware and topology.
Set
nosmt
as an additional kernel argument. The following example performance profile illustrates this setting:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: example-performanceprofile
spec:
additionalKernelArgs:
- nmi_watchdog=0
- audit=0
- mce=off
- processor.max_cstate=1
- idle=poll
- intel_idle.max_cstate=0
- nosmt
cpu:
isolated: 2-3
reserved: 0-1
hugepages:
defaultHugepagesSize: 1G
pages:
- count: 2
node: 0
size: 1G
nodeSelector:
node-role.kubernetes.io/performance: ''
realTimeKernel:
enabled: true
When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
Understanding workload hints
The following table describes how combinations of power consumption and real-time settings impact on latency.
The following workload hints can be configured manually. You can also work with workload hints using the Performance Profile Creator. For more information about the performance profile, see the “Creating a performance profile” section. |
Performance Profile creator setting | Hint | Environment | Description |
---|---|---|---|
Default |
| High throughput cluster without latency requirements | Performance achieved through CPU partitioning only. |
Low-latency |
| Regional datacenters | Both energy savings and low-latency are desirable: compromise between power management, latency and throughput. |
Ultra-low-latency |
| Far edge clusters, latency critical workloads | Optimized for absolute minimal latency and maximum determinism at the cost of increased power consumption. |
Per-pod power management |
| Critical and non-critical workloads | Allows for power management per pod. |
Additional resources
- For information on using the Performance Profile Creator (PPC) to generate a performance profile, see Creating a performance profile.
Configuring workload hints manually
Procedure
Create a
PerformanceProfile
appropriate for the environment’s hardware and topology as described in the table in “Understanding workload hints”. Adjust the profile to match the expected workload. In this example, we tune for the lowest possible latency.Add the
highPowerConsumption
andrealTime
workload hints. Both are set totrue
here.apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: workload-hints
spec:
...
workloadHints:
highPowerConsumption: true (1)
realTime: true (2)
1 If highPowerConsumption
istrue
, the node is tuned for very low latency at the cost of increased power consumption.2 Disables some debugging and monitoring features that can affect system latency.
Restricting CPUs for infra and application containers
Generic housekeeping and workload tasks use CPUs in a way that may impact latency-sensitive processes. By default, the container runtime uses all online CPUs to run all containers together, which can result in context switches and spikes in latency. Partitioning the CPUs prevents noisy processes from interfering with latency-sensitive processes by separating them from each other. The following table describes how processes run on a CPU after you have tuned the node using the Node Tuning Operator:
Process type | Details |
---|---|
| Runs on any CPU except where low latency workload is running |
Infrastructure pods | Runs on any CPU except where low latency workload is running |
Interrupts | Redirects to reserved CPUs (optional in OKD 4.7 and later) |
Kernel processes | Pins to reserved CPUs |
Latency-sensitive workload pods | Pins to a specific set of exclusive CPUs from the isolated pool |
OS processes/systemd services | Pins to reserved CPUs |
The allocatable capacity of cores on a node for pods of all QoS process types, Burstable
, BestEffort
, or Guaranteed
, is equal to the capacity of the isolated pool. The capacity of the reserved pool is removed from the node’s total core capacity for use by the cluster and operating system housekeeping duties.
Example 1
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 25 cores to QoS Guaranteed
pods and 25 cores for BestEffort
or Burstable
pods. This matches the capacity of the isolated pool.
Example 2
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 50 cores to QoS Guaranteed
pods and one core for BestEffort
or Burstable
pods. This exceeds the capacity of the isolated pool by one core. Pod scheduling fails because of insufficient CPU capacity.
The exact partitioning pattern to use depends on many factors like hardware, workload characteristics and the expected system load. Some sample use cases are as follows:
If the latency-sensitive workload uses specific hardware, such as a network interface controller (NIC), ensure that the CPUs in the isolated pool are as close as possible to this hardware. At a minimum, you should place the workload in the same Non-Uniform Memory Access (NUMA) node.
The reserved pool is used for handling all interrupts. When depending on system networking, allocate a sufficiently-sized reserve pool to handle all the incoming packet interrupts. In 4.13 and later versions, workloads can optionally be labeled as sensitive.
The decision regarding which specific CPUs should be used for reserved and isolated partitions requires detailed analysis and measurements. Factors like NUMA affinity of devices and memory play a role. The selection also depends on the workload architecture and the specific use case.
The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node. |
To ensure that housekeeping tasks and workloads do not interfere with each other, specify two groups of CPUs in the spec
section of the performance profile.
isolated
- Specifies the CPUs for the application container workloads. These CPUs have the lowest latency. Processes in this group have no interruptions and can, for example, reach much higher DPDK zero packet loss bandwidth.reserved
- Specifies the CPUs for the cluster and operating system housekeeping duties. Threads in thereserved
group are often busy. Do not run latency-sensitive applications in thereserved
group. Latency-sensitive applications run in theisolated
group.
Procedure
Create a performance profile appropriate for the environment’s hardware and topology.
Add the
reserved
andisolated
parameters with the CPUs you want reserved and isolated for the infra and application containers:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: infra-cpus
spec:
cpu:
reserved: "0-4,9" (1)
isolated: "5-8" (2)
nodeSelector: (3)
node-role.kubernetes.io/worker: ""
1 Specify which CPUs are for infra containers to perform cluster and operating system housekeeping duties. 2 Specify which CPUs are for application containers to run workloads. 3 Optional: Specify a node selector to apply the performance profile to specific nodes.
Additional resources
Reducing NIC queues using the Node Tuning Operator
The Node Tuning Operator allows you to adjust the network interface controller (NIC) queue count for each network device by configuring the performance profile. Device network queues allows the distribution of packets among different physical queues and each queue gets a separate thread for packet processing.
In real-time or low latency systems, all the unnecessary interrupt request lines (IRQs) pinned to the isolated CPUs must be moved to reserved or housekeeping CPUs.
In deployments with applications that require system, OKD networking or in mixed deployments with Data Plane Development Kit (DPDK) workloads, multiple queues are needed to achieve good throughput and the number of NIC queues should be adjusted or remain unchanged. For example, to achieve low latency the number of NIC queues for DPDK based workloads should be reduced to just the number of reserved or housekeeping CPUs.
Too many queues are created by default for each CPU and these do not fit into the interrupt tables for housekeeping CPUs when tuning for low latency. Reducing the number of queues makes proper tuning possible. Smaller number of queues means a smaller number of interrupts that then fit in the IRQ table.
In earlier versions of OKD, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OKD 4.11 and later, this functionality is part of the Node Tuning Operator. |
Adjusting the NIC queues with the performance profile
The performance profile lets you adjust the queue count for each network device.
Supported network devices:
Non-virtual network devices
Network devices that support multiple queues (channels)
Unsupported network devices:
Pure software network interfaces
Block devices
Intel DPDK virtual functions
Prerequisites
Access to the cluster as a user with the
cluster-admin
role.Install the OpenShift CLI (
oc
).
Procedure
Log in to the OKD cluster running the Node Tuning Operator as a user with
cluster-admin
privileges.Create and apply a performance profile appropriate for your hardware and topology. For guidance on creating a profile, see the “Creating a performance profile” section.
Edit this created performance profile:
$ oc edit -f <your_profile_name>.yaml
Populate the
spec
field with thenet
object. The object list can contain two fields:userLevelNetworking
is a required field specified as a boolean flag. IfuserLevelNetworking
istrue
, the queue count is set to the reserved CPU count for all supported devices. The default isfalse
.devices
is an optional field specifying a list of devices that will have the queues set to the reserved CPU count. If the device list is empty, the configuration applies to all network devices. The configuration is as follows:interfaceName
: This field specifies the interface name, and it supports shell-style wildcards, which can be positive or negative.Example wildcard syntax is as follows:
<string> .*
Negative rules are prefixed with an exclamation mark. To apply the net queue changes to all devices other than the excluded list, use
!<device>
, for example,!eno1
.
vendorID
: The network device vendor ID represented as a 16-bit hexadecimal number with a0x
prefix.deviceID
: The network device ID (model) represented as a 16-bit hexadecimal number with a0x
prefix.When a
deviceID
is specified, thevendorID
must also be defined. A device that matches all of the device identifiers specified in a device entryinterfaceName
,vendorID
, or a pair ofvendorID
plusdeviceID
qualifies as a network device. This network device then has its net queues count set to the reserved CPU count.When two or more devices are specified, the net queues count is set to any net device that matches one of them.
Set the queue count to the reserved CPU count for all devices by using this example performance profile:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,54-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices matching any of the defined device identifiers by using this example performance profile:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,54-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: “eth0”
- interfaceName: “eth1”
- vendorID: “0x1af4”
- deviceID: “0x1000”
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices starting with the interface name
eth
by using this example performance profile:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,54-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: “eth*”
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices with an interface named anything other than
eno1
by using this example performance profile:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,54-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: “!eno1”
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices that have an interface name
eth0
,vendorID
of0x1af4
, anddeviceID
of0x1000
by using this example performance profile:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,54-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: “eth0”
- vendorID: “0x1af4”
- deviceID: “0x1000”
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Apply the updated performance profile:
$ oc apply -f <your_profile_name>.yaml
Additional resources
Verifying the queue status
In this section, a number of examples illustrate different performance profiles and how to verify the changes are applied.
Example 1
In this example, the net queue count is set to the reserved CPU count (2) for all supported devices.
The relevant section from the performance profile is:
apiVersion: performance.openshift.io/v2
metadata:
name: performance
spec:
kind: PerformanceProfile
spec:
cpu:
reserved: 0-1 #total = 2
isolated: 2-8
net:
userLevelNetworking: true
# ...
Display the status of the queues associated with a device using the following command:
Run this command on the node where the performance profile was applied.
$ ethtool -l <device>
Verify the queue status before the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 2 (1)
1 | The combined channel shows that the total count of reserved CPUs for all supported devices is 2. This matches what is configured in the performance profile. |
Example 2
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices with a specific vendorID
.
The relevant section from the performance profile is:
apiVersion: performance.openshift.io/v2
metadata:
name: performance
spec:
kind: PerformanceProfile
spec:
cpu:
reserved: 0-1 #total = 2
isolated: 2-8
net:
userLevelNetworking: true
devices:
- vendorID = 0x1af4
# ...
Display the status of the queues associated with a device using the following command:
Run this command on the node where the performance profile was applied.
$ ethtool -l <device>
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 2 (1)
1 | The total count of reserved CPUs for all supported devices with vendorID=0x1af4 is 2. For example, if there is another network device ens2 with vendorID=0x1af4 it will also have total net queues of 2. This matches what is configured in the performance profile. |
Example 3
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices that match any of the defined device identifiers.
The command udevadm info
provides a detailed report on a device. In this example the devices are:
# udevadm info -p /sys/class/net/ens4
...
E: ID_MODEL_ID=0x1000
E: ID_VENDOR_ID=0x1af4
E: INTERFACE=ens4
...
# udevadm info -p /sys/class/net/eth0
...
E: ID_MODEL_ID=0x1002
E: ID_VENDOR_ID=0x1001
E: INTERFACE=eth0
...
Set the net queues to 2 for a device with
interfaceName
equal toeth0
and any devices that have avendorID=0x1af4
with the following performance profile:apiVersion: performance.openshift.io/v2
metadata:
name: performance
spec:
kind: PerformanceProfile
spec:
cpu:
reserved: 0-1 #total = 2
isolated: 2-8
net:
userLevelNetworking: true
devices:
- interfaceName = eth0
- vendorID = 0x1af4
...
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 2 (1)
1 The total count of reserved CPUs for all supported devices with vendorID=0x1af4
is set to 2. For example, if there is another network deviceens2
withvendorID=0x1af4
, it will also have the total net queues set to 2. Similarly, a device withinterfaceName
equal toeth0
will have total net queues set to 2.
Logging associated with adjusting NIC queues
Log messages detailing the assigned devices are recorded in the respective Tuned daemon logs. The following messages might be recorded to the /var/log/tuned/tuned.log
file:
An
INFO
message is recorded detailing the successfully assigned devices:INFO tuned.plugins.base: instance net_test (net): assigning devices ens1, ens2, ens3
A
WARNING
message is recorded if none of the devices can be assigned:WARNING tuned.plugins.base: instance net_test: no matching devices available
Debugging low latency CNF tuning status
The PerformanceProfile
custom resource (CR) contains status fields for reporting tuning status and debugging latency degradation issues. These fields report on conditions that describe the state of the operator’s reconciliation functionality.
A typical issue can arise when the status of machine config pools that are attached to the performance profile are in a degraded state, causing the PerformanceProfile
status to degrade. In this case, the machine config pool issues a failure message.
The Node Tuning Operator contains the performanceProfile.spec.status.Conditions
status field:
Status:
Conditions:
Last Heartbeat Time: 2020-06-02T10:01:24Z
Last Transition Time: 2020-06-02T10:01:24Z
Status: True
Type: Available
Last Heartbeat Time: 2020-06-02T10:01:24Z
Last Transition Time: 2020-06-02T10:01:24Z
Status: True
Type: Upgradeable
Last Heartbeat Time: 2020-06-02T10:01:24Z
Last Transition Time: 2020-06-02T10:01:24Z
Status: False
Type: Progressing
Last Heartbeat Time: 2020-06-02T10:01:24Z
Last Transition Time: 2020-06-02T10:01:24Z
Status: False
Type: Degraded
The Status
field contains Conditions
that specify Type
values that indicate the status of the performance profile:
Available
All machine configs and Tuned profiles have been created successfully and are available for cluster components are responsible to process them (NTO, MCO, Kubelet).
Upgradeable
Indicates whether the resources maintained by the Operator are in a state that is safe to upgrade.
Progressing
Indicates that the deployment process from the performance profile has started.
Degraded
Indicates an error if:
Validation of the performance profile has failed.
Creation of all relevant components did not complete successfully.
Each of these types contain the following fields:
Status
The state for the specific type (true
or false
).
Timestamp
The transaction timestamp.
Reason string
The machine readable reason.
Message string
The human readable reason describing the state and error details, if any.
Machine config pools
A performance profile and its created products are applied to a node according to an associated machine config pool (MCP). The MCP holds valuable information about the progress of applying the machine configurations created by performance profiles that encompass kernel args, kube config, huge pages allocation, and deployment of rt-kernel. The Performance Profile controller monitors changes in the MCP and updates the performance profile status accordingly.
The only conditions returned by the MCP to the performance profile status is when the MCP is Degraded
, which leads to performaceProfile.status.condition.Degraded = true
.
Example
The following example is for a performance profile with an associated machine config pool (worker-cnf
) that was created for it:
The associated machine config pool is in a degraded state:
# oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-2ee57a93fa6c9181b546ca46e1571d2d True False False 3 3 3 0 2d21h
worker rendered-worker-d6b2bdc07d9f5a59a6b68950acf25e5f True False False 2 2 2 0 2d21h
worker-cnf rendered-worker-cnf-6c838641b8a08fff08dbd8b02fb63f7c False True True 2 1 1 1 2d20h
The
describe
section of the MCP shows the reason:# oc describe mcp worker-cnf
Example output
Message: Node node-worker-cnf is reporting: "prepping update:
machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not
found"
Reason: 1 nodes are reporting degraded status on sync
The degraded state should also appear under the performance profile
status
field marked asdegraded = true
:# oc describe performanceprofiles performance
Example output
Message: Machine config pool worker-cnf Degraded Reason: 1 nodes are reporting degraded status on sync.
Machine config pool worker-cnf Degraded Message: Node yquinn-q8s5v-w-b-z5lqn.c.openshift-gce-devel.internal is
reporting: "prepping update: machineconfig.machineconfiguration.openshift.io
\"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found". Reason: MCPDegraded
Status: True
Type: Degraded
Collecting low latency tuning debugging data for Red Hat Support
When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.
The must-gather
tool enables you to collect diagnostic information about your OKD cluster, including node tuning, NUMA topology, and other information needed to debug issues with low latency setup.
For prompt support, supply diagnostic information for both OKD and low latency tuning.
About the must-gather tool
The oc adm must-gather
CLI command collects the information from your cluster that is most likely needed for debugging issues, such as:
Resource definitions
Audit logs
Service logs
You can specify one or more images when you run the command by including the --image
argument. When you specify an image, the tool collects data related to that feature or product. When you run oc adm must-gather
, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local
. This directory is created in your current working directory.
About collecting low latency tuning data
Use the oc adm must-gather
CLI command to collect information about your cluster, including features and objects associated with low latency tuning, including:
The Node Tuning Operator namespaces and child objects.
MachineConfigPool
and associatedMachineConfig
objects.The Node Tuning Operator and associated Tuned objects.
Linux Kernel command line options.
CPU and NUMA topology
Basic PCI device information and NUMA locality.
To collect debugging information with must-gather
, you must specify the Performance Addon Operator must-gather
image:
--image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.13.
In earlier versions of OKD, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OKD 4.11 and later, this functionality is part of the Node Tuning Operator. However, you must still use the |
Gathering data about specific features
You can gather debugging information about specific features by using the oc adm must-gather
CLI command with the --image
or --image-stream
argument. The must-gather
tool supports multiple images, so you can gather data about more than one feature by running a single command.
To collect the default |
In earlier versions of OKD, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OKD 4.11, these functions are part of the Node Tuning Operator. However, you must still use the |
Prerequisites
Access to the cluster as a user with the
cluster-admin
role.The OKD CLI (oc) installed.
Procedure
Navigate to the directory where you want to store the
must-gather
data.Run the
oc adm must-gather
command with one or more--image
or--image-stream
arguments. For example, the following command gathers both the default cluster data and information specific to the Node Tuning Operator:$ oc adm must-gather \
--image-stream=openshift/must-gather \ (1)
--image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.13 (2)
1 The default OKD must-gather
image.2 The must-gather
image for low latency tuning diagnostics.Create a compressed file from the
must-gather
directory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:$ tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/ (1)
1 Replace must-gather-local.5421342344627712289/
with the actual directory name.Attach the compressed file to your support case on the Red Hat Customer Portal.
Additional resources
For more information about MachineConfig and KubeletConfig, see Managing nodes.
For more information about the Node Tuning Operator, see Using the Node Tuning Operator.
For more information about the PerformanceProfile, see Configuring huge pages.
For more information about consuming huge pages from your containers, see How huge pages are consumed by apps.