Low latency tuning

Low latency tuning

Understanding low latency

The emergence of Edge computing in the area of Telco / 5G plays a key role in reducing latency and congestion problems and improving application performance.

Simply put, latency determines how fast data (packets) moves from the sender to receiver and returns to the sender after processing by the receiver. Maintaining a network architecture with the lowest possible delay of latency speeds is key for meeting the network performance requirements of 5G. Compared to 4G technology, with an average latency of 50 ms, 5G is targeted to reach latency numbers of 1 ms or less. This reduction in latency boosts wireless throughput by a factor of 10.

Many of the deployed applications in the Telco space require low latency that can only tolerate zero packet loss. Tuning for zero packet loss helps mitigate the inherent issues that degrade network performance. For more information, see Tuning for Zero Packet Loss in OpenStack.

The Edge computing initiative also comes in to play for reducing latency rates. Think of it as being on the edge of the cloud and closer to the user. This greatly reduces the distance between the user and distant data centers, resulting in reduced application response times and performance latency.

Administrators must be able to manage their many Edge sites and local services in a centralized way so that all of the deployments can run at the lowest possible management cost. They also need an easy way to deploy and configure certain nodes of their cluster for real-time low latency and high-performance purposes. Low latency nodes are useful for applications such as Cloud-native Network Functions (CNF) and Data Plane Development Kit (DPDK).

OKD currently provides mechanisms to tune software on an OKD cluster for real-time running and low latency (around <20 microseconds reaction time). This includes tuning the kernel and OKD set values, installing a kernel, and reconfiguring the machine. But this method requires setting up four different Operators and performing many configurations that, when done manually, is complex and could be prone to mistakes.

OKD uses the Node Tuning Operator to implement automatic tuning to achieve low latency performance for OKD applications. The cluster administrator uses this performance profile configuration that makes it easier to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt, reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolate CPUs for application containers to run the workloads.

OKD also supports workload hints for the Node Tuning Operator that can tune the PerformanceProfile to meet the demands of different industry environments. Workload hints are available for highPowerConsumption (very low latency at the cost of increased power consumption) and realtime (priority given to optimum latency). A combination of true/false settings for these hints can be used to deal with application-specific workload profiles and requirements.

Workload hints simplify the fine-tuning of performance to industry sector settings. Instead of a “one size fits all” approach, workload hints can cater to usage patterns such as placing priority on:

Low latency
Real-time capability
Efficient use of power

In an ideal world, all of those would be prioritized: in real life, some come at the expense of others. The Node Tuning Operator is now aware of the workload expectations and better able to meet the demands of the workload. The cluster admin can now specify into which use case that workload falls. The Node Tuning Operator uses the PerformanceProfile to fine tune the performance settings for the workload.

The environment in which an application is operating influences its behavior. For a typical data center with no strict latency requirements, only minimal default tuning is needed that enables CPU partitioning for some high performance workload pods. For data centers and workloads where latency is a higher priority, measures are still taken to optimize power consumption. The most complicated cases are clusters close to latency-sensitive equipment such as manufacturing machinery and software-defined radios. This last class of deployment is often referred to as Far edge. For Far edge deployments, ultra-low latency is the ultimate priority, and is achieved at the expense of power management.

In OKD version 4.10 and previous versions, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance. Now this functionality is part of the Node Tuning Operator.

About hyperthreading for low latency and real-time applications

Hyperthreading is an Intel processor technology that allows a physical CPU processor core to function as two logical cores, executing two independent threads simultaneously. Hyperthreading allows for better system throughput for certain workload types where parallel processing is beneficial. The default OKD configuration expects hyperthreading to be enabled by default.

For telecommunications applications, it is important to design your application infrastructure to minimize latency as much as possible. Hyperthreading can slow performance times and negatively affect throughput for compute intensive workloads that require low latency. Disabling hyperthreading ensures predictable performance and can decrease processing times for these workloads.

Hyperthreading implementation and configuration differs depending on the hardware you are running OKD on. Consult the relevant host hardware tuning information for more details of the hyperthreading implementation specific to that hardware. Disabling hyperthreading can increase the cost per core of the cluster.

Additional resources

Configuring hyperthreading for a cluster

Provisioning real-time and low latency workloads

Many industries and organizations need extremely high performance computing and might require low and predictable latency, especially in the financial and telecommunications industries. For these industries, with their unique requirements, OKD provides the Node Tuning Operator to implement automatic tuning to achieve low latency performance and consistent response time for OKD applications.

The cluster administrator can use this performance profile configuration to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt (real-time), reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, isolate CPUs for application containers to run the workloads, and disable unused CPUs to reduce power consumption.

The usage of execution probes in conjunction with applications that require guaranteed CPUs can cause latency spikes. It is recommended to use other probes, such as a properly configured set of network probes, as an alternative.

In earlier versions of OKD, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OKD 4.11 and later, these functions are part of the Node Tuning Operator.

Known limitations for real-time

In most deployments, kernel-rt is supported only on worker nodes when you use a standard cluster with three control plane nodes and three worker nodes. There are exceptions for compact and single nodes on OKD deployments. For installations on a single node, kernel-rt is supported on the single control plane node.

To fully utilize the real-time mode, the containers must run with elevated privileges. See Set capabilities for a Container for information on granting privileges.

OKD restricts the allowed capabilities, so you might need to create a SecurityContext as well.

This procedure is fully supported with bare metal installations using Fedora CoreOS (FCOS) systems.

Establishing the right performance expectations refers to the fact that the real-time kernel is not a panacea. Its objective is consistent, low-latency determinism offering predictable response times. There is some additional kernel overhead associated with the real-time kernel. This is due primarily to handling hardware interruptions in separately scheduled threads. The increased overhead in some workloads results in some degradation in overall throughput. The exact amount of degradation is very workload dependent, ranging from 0% to 30%. However, it is the cost of determinism.

Provisioning a worker with real-time capabilities

Optional: Add a node to the OKD cluster. See Setting BIOS parameters.
Add the label worker-rt to the worker nodes that require the real-time capability by using the oc command.

Create a new machine config pool for real-time nodes:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-rt
  labels:
    machineconfiguration.openshift.io/role: worker-rt
spec:
  machineConfigSelector:
    matchExpressions:
      - {
           key: machineconfiguration.openshift.io/role,
           operator: In,
           values: [worker, worker-rt],
        }
  paused: false
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-rt: ""

Note that a machine config pool worker-rt is created for group of nodes that have the label worker-rt.

Add the node to the proper machine config pool by using node role labels.

You must decide which nodes are configured with real-time workloads. You could configure all of the nodes in the cluster, or a subset of the nodes. The Node Tuning Operator that expects all of the nodes are part of a dedicated machine config pool. If you use all of the nodes, you must point the Node Tuning Operator to the worker node role label. If you use a subset, you must group the nodes into a new machine config pool.

Create the PerformanceProfile with the proper set of housekeeping cores and realTimeKernel: enabled: true.

You must set machineConfigPoolSelector in PerformanceProfile:

  apiVersion: performance.openshift.io/v2
  kind: PerformanceProfile
  metadata:
   name: example-performanceprofile
  spec:
  ...
    realTimeKernel:
      enabled: true
    nodeSelector:
       node-role.kubernetes.io/worker-rt: ""
    machineConfigPoolSelector:
       machineconfiguration.openshift.io/role: worker-rt

Verify that a matching machine config pool exists with a label:

$ oc describe mcp/worker-rt

Example output

Name:         worker-rt
Namespace:
Labels:       machineconfiguration.openshift.io/role=worker-rt

OKD will start configuring the nodes, which might involve multiple reboots. Wait for the nodes to settle. This can take a long time depending on the specific hardware you use, but 20 minutes per node is expected.
Verify everything is working as expected.

Verifying the real-time kernel installation

Use this command to verify that the real-time kernel is installed:

$ oc get node -o wide

Note the worker with the role worker-rt that contains the string 4.18.0-305.30.1.rt7.102.el8_4.x86_64 cri-o://1.24.0-99.rhaos4.10.gitc3131de.el8:

NAME                                   STATUS   ROLES               AGE     VERSION                      INTERNAL-IP
EXTERNAL-IP   OS-IMAGE                                           KERNEL-VERSION
CONTAINER-RUNTIME
rt-worker-0.example.com              Ready     worker,worker-rt   5d17h   v1.24.0
128.66.135.107   <none>                Red Hat Enterprise Linux CoreOS 46.82.202008252340-0 (Ootpa)
4.18.0-305.30.1.rt7.102.el8_4.x86_64   cri-o://1.24.0-99.rhaos4.10.gitc3131de.el8
[...]

Creating a workload that works in real-time

Use the following procedures for preparing a workload that will use real-time capabilities.

Procedure

Create a pod with a QoS class of Guaranteed.
Optional: Disable CPU load balancing for DPDK.
Assign a proper node selector.

When writing your applications, follow the general recommendations described in Application tuning and deployment.

Creating a pod with a QoS class of `Guaranteed`

Keep the following in mind when you create a pod that is given a QoS class of Guaranteed:

Every container in the pod must have a memory limit and a memory request, and they must be the same.
Every container in the pod must have a CPU limit and a CPU request, and they must be the same.

The following example shows the configuration file for a pod that has one container. The container has a memory limit and a memory request, both equal to 200 MiB. The container has a CPU limit and a CPU request, both equal to 1 CPU.

apiVersion: v1
kind: Pod
metadata:
  name: qos-demo
  namespace: qos-example
spec:
  containers:
  - name: qos-demo-ctr
    image: <image-pull-spec>
    resources:
      limits:
        memory: "200Mi"
        cpu: "1"
      requests:
        memory: "200Mi"
        cpu: "1"

Create the pod:

$ oc  apply -f qos-pod.yaml --namespace=qos-example

View detailed information about the pod:

$ oc get pod qos-demo --namespace=qos-example --output=yaml

Example output

spec:
  containers:
    ...
status:
  qosClass: Guaranteed

If a container specifies its own memory limit, but does not specify a memory request, OKD automatically assigns a memory request that matches the limit. Similarly, if a container specifies its own CPU limit, but does not specify a CPU request, OKD automatically assigns a CPU request that matches the limit.

Optional: Disabling CPU load balancing for DPDK

Functionality to disable or enable CPU load balancing is implemented on the CRI-O level. The code under the CRI-O disables or enables CPU load balancing only when the following requirements are met.

The pod must use the performance-<profile-name> runtime class. You can get the proper name by looking at the status of the performance profile, as shown here:
```
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
...
status:
  ...
  runtimeClass: performance-manual
```
The pod must have the cpu-load-balancing.crio.io: true annotation.

The Node Tuning Operator is responsible for the creation of the high-performance runtime handler config snippet under relevant nodes and for creation of the high-performance runtime class under the cluster. It will have the same content as default runtime handler except it enables the CPU load balancing configuration functionality.

To disable the CPU load balancing for the pod, the Pod specification must include the following fields:

apiVersion: v1
kind: Pod
metadata:
  ...
  annotations:
    ...
    cpu-load-balancing.crio.io: "disable"
    ...
  ...
spec:
  ...
  runtimeClassName: performance-<profile_name>
  ...

Only disable CPU load balancing when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU load balancing can affect the performance of other containers in the cluster.

Assigning a proper node selector

The preferred way to assign a pod to nodes is to use the same node selector the performance profile used, as shown here:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  # ...
  nodeSelector:
    node-role.kubernetes.io/worker-rt: ""

For more information, see Placing pods on specific nodes using node selectors.

Scheduling a workload onto a worker with real-time capabilities

Use label selectors that match the nodes attached to the machine config pool that was configured for low latency by the Node Tuning Operator. For more information, see Assigning pods to nodes.

Reducing power consumption by taking CPUs offline

You can generally anticipate telecommunication workloads. When not all of the CPU resources are required, the Node Tuning Operator allows you take unused CPUs offline to reduce power consumption by manually updating the performance profile.

To take unused CPUs offline, you must perform the following tasks:

Set the offline CPUs in the performance profile and save the contents of the YAML file:

Example performance profile with offlined CPUs

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  additionalKernelArgs:
  - nmi_watchdog=0
  - audit=0
  - mce=off
  - processor.max_cstate=1
  - intel_idle.max_cstate=0
  - idle=poll
  cpu:
    isolated: "2-23,26-47"
    reserved: "0,1,24,25"
    offlined: “48-59” (1)
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numa:
    topologyPolicy: single-numa-node
  realTimeKernel:
    enabled: true

1	Optional. You can list CPUs in the `offlined` field to take the specified CPUs offline.

Apply the updated profile by running the following command:
```
$ oc apply -f my-performance-profile.yaml
```

Managing device interrupt processing for guaranteed pod isolated CPUs

The Node Tuning Operator can manage host CPUs by dividing them into reserved CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolated CPUs for application containers to run the workloads. This allows you to set CPUs for low latency workloads as isolated.

Device interrupts are load balanced between all isolated and reserved CPUs to avoid CPUs being overloaded, with the exception of CPUs where there is a guaranteed pod running. Guaranteed pod CPUs are prevented from processing device interrupts when the relevant annotations are set for the pod.

In the performance profile, globallyDisableIrqLoadBalancing is used to manage whether device interrupts are processed or not. For certain workloads, the reserved CPUs are not always sufficient for dealing with device interrupts, and for this reason, device interrupts are not globally disabled on the isolated CPUs. By default, Node Tuning Operator does not disable device interrupts on isolated CPUs.

To achieve low latency for workloads, some (but not all) pods require the CPUs they are running on to not process device interrupts. A pod annotation, irq-load-balancing.crio.io, is used to define whether device interrupts are processed or not. When configured, CRI-O disables device interrupts only as long as the pod is running.

Disabling CPU CFS quota

To reduce CPU throttling for individual guaranteed pods, create a pod specification with the annotation cpu-quota.crio.io: "disable". This annotation disables the CPU completely fair scheduler (CFS) quota at the pod run time. The following pod specification contains this annotation:

apiVersion: performance.openshift.io/v2
kind: Pod
metadata:
  annotations:
      cpu-quota.crio.io: "disable"
spec:
    runtimeClassName: performance-<profile_name>
...

Only disable CPU CFS quota when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU CFS quota can affect the performance of other containers in the cluster.

Disabling global device interrupts handling in Node Tuning Operator

To configure Node Tuning Operator to disable global device interrupts for the isolated CPU set, set the globallyDisableIrqLoadBalancing field in the performance profile to true. When true, conflicting pod annotations are ignored. When false, IRQ loads are balanced across all CPUs.

A performance profile snippet illustrates this setting:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  globallyDisableIrqLoadBalancing: true
...

Disabling interrupt processing for individual pods

To disable interrupt processing for individual pods, ensure that globallyDisableIrqLoadBalancing is set to false in the performance profile. Then, in the pod specification, set the irq-load-balancing.crio.io pod annotation to disable. The following pod specification contains this annotation:

apiVersion: performance.openshift.io/v2
kind: Pod
metadata:
  annotations:
      irq-load-balancing.crio.io: "disable"
spec:
    runtimeClassName: performance-<profile_name>
...

Upgrading the performance profile to use device interrupt processing

When you upgrade the Node Tuning Operator performance profile custom resource definition (CRD) from v1 or v1alpha1 to v2, globallyDisableIrqLoadBalancing is set to true on existing profiles.

globallyDisableIrqLoadBalancing toggles whether IRQ load balancing will be disabled for the Isolated CPU set. When the option is set to true it disables IRQ load balancing for the Isolated CPU set. Setting the option to false allows the IRQs to be balanced across all CPUs.

Supported API Versions

The Node Tuning Operator supports v2, v1, and v1alpha1 for the performance profile apiVersion field. The v1 and v1alpha1 APIs are identical. The v2 API includes an optional boolean field globallyDisableIrqLoadBalancing with a default value of false.

Upgrading Node Tuning Operator API from v1alpha1 to v1

When upgrading Node Tuning Operator API version from v1alpha1 to v1, the v1alpha1 performance profiles are converted on-the-fly using a “None” Conversion strategy and served to the Node Tuning Operator with API version v1.

Upgrading Node Tuning Operator API from v1alpha1 or v1 to v2

When upgrading from an older Node Tuning Operator API version, the existing v1 and v1alpha1 performance profiles are converted using a conversion webhook that injects the globallyDisableIrqLoadBalancing field with a value of true.

Tuning nodes for low latency with the performance profile

The performance profile lets you control latency tuning aspects of nodes that belong to a certain machine config pool. After you specify your settings, the PerformanceProfile object is compiled into multiple objects that perform the actual node level tuning:

A MachineConfig file that manipulates the nodes.
A KubeletConfig file that configures the Topology Manager, the CPU Manager, and the OKD nodes.
The Tuned profile that configures the Node Tuning Operator.

You can use a performance profile to specify whether to update the kernel to kernel-rt, to allocate huge pages, and to partition the CPUs for performing housekeeping duties or running workloads.

You can manually create the PerformanceProfile object or use the Performance Profile Creator (PPC) to generate a performance profile. See the additional resources below for more information on the PPC.

Sample performance profile

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
 name: performance
spec:
 cpu:
  isolated: "5-15" (1)
  reserved: "0-4" (2)
 hugepages:
  defaultHugepagesSize: "1G"
  pages:
  - size: "1G"
    count: 16
    node: 0
 realTimeKernel:
  enabled: true  (3)
 numa:  (4)
  topologyPolicy: "best-effort"
 nodeSelector:
  node-role.kubernetes.io/worker-cnf: "" (5)

1	Use this field to isolate specific CPUs to use with application containers for workloads.
2	Use this field to reserve specific CPUs to use with infra containers for housekeeping.
3	Use this field to install the real-time kernel on the node. Valid values are `true` or `false`. Setting the `true` value installs the real-time kernel.
4	Use this field to configure the topology manager policy. Valid values are `none` (default), `best-effort`, `restricted`, and `single-numa-node`. For more information, see Topology Manager Policies.
5	Use this field to specify a node selector to apply the performance profile to specific nodes.

Additional resources

For information on using the Performance Profile Creator (PPC) to generate a performance profile, see Creating a performance profile.

Configuring huge pages

Nodes must pre-allocate huge pages used in an OKD cluster. Use the Node Tuning Operator to allocate huge pages on a specific node.

OKD provides a method for creating and allocating huge pages. Node Tuning Operator provides an easier method for doing this using the performance profile.

For example, in the hugepages pages section of the performance profile, you can specify multiple blocks of size, count, and, optionally, node:

hugepages:
   defaultHugepagesSize: "1G"
   pages:
   - size:  "1G"
     count:  4
     node:  0 (1)

1	`node` is the NUMA node in which the huge pages are allocated. If you omit `node`, the pages are evenly spread across all NUMA nodes.

Wait for the relevant machine config pool status that indicates the update is finished.

These are the only configuration steps you need to do to allocate huge pages.

Verification

To verify the configuration, see the /proc/meminfo file on the node:

$ oc debug node/ip-10-0-141-105.ec2.internal

# grep -i huge /proc/meminfo

Example output

AnonHugePages:    ###### ##
ShmemHugePages:        0 kB
HugePages_Total:       2
HugePages_Free:        2
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       #### ##
Hugetlb:            #### ##

Use oc describe to report the new size:

$ oc describe node worker-0.ocp4poc.example.com | grep -i huge

Example output

                                   hugepages-1g=true
 hugepages-###:  ###
 hugepages-###:  ###

Allocating multiple huge page sizes

You can request huge pages with different sizes under the same container. This allows you to define more complicated pods consisting of containers with different huge page size needs.

For example, you can define sizes 1G and 2M and the Node Tuning Operator will configure both sizes on the node, as shown here:

spec:
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 1024
      node: 0
      size: 2M
    - count: 4
      node: 1
      size: 1G

Configuring a node for IRQ dynamic load balancing

To configure a cluster node to handle IRQ dynamic load balancing, do the following:

Log in to the OKD cluster as a user with cluster-admin privileges.
Set the performance profile apiVersion to use performance.openshift.io/v2.
Remove the globallyDisableIrqLoadBalancing field or set it to false.
Set the appropriate isolated and reserved CPUs. The following snippet illustrates a profile that reserves 2 CPUs. IRQ load-balancing is enabled for pods running on the isolated CPU set:
```
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: dynamic-irq-profile
spec:
  cpu:
    isolated: 2-5
    reserved: 0-1
...
```
When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.

Create the pod that uses exclusive CPUs, and set irq-load-balancing.crio.io and cpu-quota.crio.io annotations to disable. For example:

apiVersion: v1
kind: Pod
metadata:
  name: dynamic-irq-pod
  annotations:
     irq-load-balancing.crio.io: "disable"
     cpu-quota.crio.io: "disable"
spec:
  containers:
  - name: dynamic-irq-pod
    image: "registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11"
    command: ["sleep", "10h"]
    resources:
      requests:
        cpu: 2
        memory: "200M"
      limits:
        cpu: 2
        memory: "200M"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  runtimeClassName: performance-dynamic-irq-profile
...

Enter the pod runtimeClassName in the form performance-<profile_name>, where <profile_name> is the name from the PerformanceProfile YAML, in this example, performance-dynamic-irq-profile.
Set the node selector to target a cnf-worker.

Ensure the pod is running correctly. Status should be running, and the correct cnf-worker node should be set:

$ oc get pod -o wide

Expected output

NAME              READY   STATUS    RESTARTS   AGE     IP             NODE          NOMINATED NODE   READINESS GATES
dynamic-irq-pod   1/1     Running   0          5h33m   <ip-address>   <node-name>   <none>           <none>

Get the CPUs that the pod configured for IRQ dynamic load balancing runs on:

$ oc exec -it dynamic-irq-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"

Expected output

Cpus_allowed_list:  2-3

Ensure the node configuration is applied correctly. SSH into the node to verify the configuration.

$ oc debug node/<node-name>

Expected output

Starting pod/<node-name>-debug ...
To use host binaries, run `chroot /host`
Pod IP: <ip-address>
If you don't see a command prompt, try pressing enter.
sh-4.4#

Verify that you can use the node file system:
```
sh-4.4# chroot /host
```
Expected output
```
sh-4.4#
```
Ensure the default system CPU affinity mask does not include the dynamic-irq-pod CPUs, for example, CPUs 2 and 3.
```
$ cat /proc/irq/default_smp_affinity
```
Example output
```
33
```

Ensure the system IRQs are not configured to run on the dynamic-irq-pod CPUs:

find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;

Example output

/proc/irq/0/smp_affinity_list: 0-5
/proc/irq/1/smp_affinity_list: 5
/proc/irq/2/smp_affinity_list: 0-5
/proc/irq/3/smp_affinity_list: 0-5
/proc/irq/4/smp_affinity_list: 0
/proc/irq/5/smp_affinity_list: 0-5
/proc/irq/6/smp_affinity_list: 0-5
/proc/irq/7/smp_affinity_list: 0-5
/proc/irq/8/smp_affinity_list: 4
/proc/irq/9/smp_affinity_list: 4
/proc/irq/10/smp_affinity_list: 0-5
/proc/irq/11/smp_affinity_list: 0
/proc/irq/12/smp_affinity_list: 1
/proc/irq/13/smp_affinity_list: 0-5
/proc/irq/14/smp_affinity_list: 1
/proc/irq/15/smp_affinity_list: 0
/proc/irq/24/smp_affinity_list: 1
/proc/irq/25/smp_affinity_list: 1
/proc/irq/26/smp_affinity_list: 1
/proc/irq/27/smp_affinity_list: 5
/proc/irq/28/smp_affinity_list: 1
/proc/irq/29/smp_affinity_list: 0
/proc/irq/30/smp_affinity_list: 0-5

Some IRQ controllers do not support IRQ re-balancing and will always expose all online CPUs as the IRQ mask. These IRQ controllers effectively run on CPU 0. For more information on the host configuration, SSH into the host and run the following, replacing <irq-num> with the CPU number that you want to query:

$ cat /proc/irq/<irq-num>/effective_affinity

Configuring hyperthreading for a cluster

To configure hyperthreading for an OKD cluster, set the CPU threads in the performance profile to the same cores that are configured for the reserved or isolated CPU pools.

If you configure a performance profile, and subsequently change the hyperthreading configuration for the host, ensure that you update the CPU isolated and reserved fields in the PerformanceProfile YAML to match the new configuration.

Disabling a previously enabled host hyperthreading configuration can cause the CPU core IDs listed in the PerformanceProfile YAML to be incorrect. This incorrect configuration can cause the node to become unavailable because the listed CPUs can no longer be found.

Prerequisites

Access to the cluster as a user with the cluster-admin role.
Install the OpenShift CLI (oc).

Procedure

Ascertain which threads are running on what CPUs for the host you want to configure.

You can view which threads are running on the host CPUs by logging in to the cluster and running the following command:

$ lscpu --all --extended

Example output

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       yes    4800.0000 400.0000
1   0    0      1    1:1:1:0       yes    4800.0000 400.0000
2   0    0      2    2:2:2:0       yes    4800.0000 400.0000
3   0    0      3    3:3:3:0       yes    4800.0000 400.0000
4   0    0      0    0:0:0:0       yes    4800.0000 400.0000
5   0    0      1    1:1:1:0       yes    4800.0000 400.0000
6   0    0      2    2:2:2:0       yes    4800.0000 400.0000
7   0    0      3    3:3:3:0       yes    4800.0000 400.0000

In this example, there are eight logical CPU cores running on four physical CPU cores. CPU0 and CPU4 are running on physical Core0, CPU1 and CPU5 are running on physical Core 1, and so on.

Alternatively, to view the threads that are set for a particular physical CPU core (cpu0 in the example below), open a command prompt and run the following:

$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list

Example output

0-4

Apply the isolated and reserved CPUs in the PerformanceProfile YAML. For example, you can set logical cores CPU0 and CPU4 as isolated, and logical cores CPU1 to CPU3 and CPU5 to CPU7 as reserved. When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
```
...
  cpu:
    isolated: 0,4
    reserved: 1-3,5-7
...
```
The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.

Hyperthreading is enabled by default on most Intel processors. If you enable hyperthreading, all threads processed by a particular core must be isolated or processed on the same core.

Disabling hyperthreading for low latency applications

When configuring clusters for low latency processing, consider whether you want to disable hyperthreading before you deploy the cluster. To disable hyperthreading, do the following:

Create a performance profile that is appropriate for your hardware and topology.

Set nosmt as an additional kernel argument. The following example performance profile illustrates this setting:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: example-performanceprofile
spec:
  additionalKernelArgs:
    - nmi_watchdog=0
    - audit=0
    - mce=off
    - processor.max_cstate=1
    - idle=poll
    - intel_idle.max_cstate=0
    - nosmt
  cpu:
    isolated: 2-3
    reserved: 0-1
  hugepages:
    defaultHugepagesSize: 1G
    pages:
      - count: 2
        node: 0
        size: 1G
  nodeSelector:
    node-role.kubernetes.io/performance: ''
  realTimeKernel:
    enabled: true

When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.

Understanding workload hints

The following table describes how combinations of power consumption and real-time settings impact on latency.

The following workload hints can be configured manually. See “Creating a performance profile” for further information. You can also work with workload hints using the Performance Profile Creator. See “Creating a performance profile” for further information.

Performance Profile creator setting	Hint	Environment	Description
Default	`workloadHints:` `highPowerConsumption: false` `realtime: false`	High throughput cluster without latency requirements	Performance achieved through CPU partitioning only.
Low-latency	`workloadHints:` `highPowerConsumption: false` `realtime: true`	Regional datacenters	Both energy savings and low-latency are desirable: compromise between power management, latency and throughput.
Ultra-low-latency	`workloadHints:` `highPowerConsumption: true` `realtime: true`	Far edge clusters, latency critical workloads	Optimized for absolute minimal latency and maximum determinism at the cost of increased power consumption.

Configuring workload hints manually

Procedure

Create a PerformanceProfile appropriate for the environment’s hardware and topology as described in the table in “Understanding workload hints”. Adjust the profile to match the expected workload. In this example, we tune for the lowest possible latency.
Add the highPowerConsumption and realtime workload hints. Both are set to true here.
```
    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: workload-hints
    spec:
      ...
      workloadHints:
        highPowerConsumption: true (1)
        realtime: true (2)
```
1 If highPowerConsumption is true, the node is tuned for very low latency at the cost of increased power consumption.
2 Disables some debugging and monitoring features that can affect system latency.

Additional resources

For information on using the Performance Profile Creator (PPC) to generate a performance profile, see Creating a performance profile.

Restricting CPUs for infra and application containers

Generic housekeeping and workload tasks use CPUs in a way that may impact latency-sensitive processes. By default, the container runtime uses all online CPUs to run all containers together, which can result in context switches and spikes in latency. Partitioning the CPUs prevents noisy processes from interfering with latency-sensitive processes by separating them from each other. The following table describes how processes run on a CPU after you have tuned the node using the Node Tuning Operator:

Table 1. Process’ CPU assignments
Process type	Details
`Burstable` and `BestEffort` pods	Runs on any CPU except where low latency workload is running
Infrastructure pods	Runs on any CPU except where low latency workload is running
Interrupts	Redirects to reserved CPUs (optional in OKD 4.7 and later)
Kernel processes	Pins to reserved CPUs
Latency-sensitive workload pods	Pins to a specific set of exclusive CPUs from the isolated pool
OS processes/systemd services	Pins to reserved CPUs

The allocatable capacity of cores on a node for pods of all QoS process types, Burstable, BestEffort, or Guaranteed, is equal to the capacity of the isolated pool. The capacity of the reserved pool is removed from the node’s total core capacity for use by the cluster and operating system housekeeping duties.

Example 1

A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 25 cores to QoS Guaranteed pods and 25 cores for BestEffort or Burstable pods. This matches the capacity of the isolated pool.

Example 2

A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 50 cores to QoS Guaranteed pods and one core for BestEffort or Burstable pods. This exceeds the capacity of the isolated pool by one core. Pod scheduling fails because of insufficient CPU capacity.

The exact partitioning pattern to use depends on many factors like hardware, workload characteristics and the expected system load. Some sample use cases are as follows:

If the latency-sensitive workload uses specific hardware, such as a network interface controller (NIC), ensure that the CPUs in the isolated pool are as close as possible to this hardware. At a minimum, you should place the workload in the same Non-Uniform Memory Access (NUMA) node.
The reserved pool is used for handling all interrupts. When depending on system networking, allocate a sufficiently-sized reserve pool to handle all the incoming packet interrupts. In 4.11 and later versions, workloads can optionally be labeled as sensitive.

The decision regarding which specific CPUs should be used for reserved and isolated partitions requires detailed analysis and measurements. Factors like NUMA affinity of devices and memory play a role. The selection also depends on the workload architecture and the specific use case.

The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.

To ensure that housekeeping tasks and workloads do not interfere with each other, specify two groups of CPUs in the spec section of the performance profile.

isolated - Specifies the CPUs for the application container workloads. These CPUs have the lowest latency. Processes in this group have no interruptions and can, for example, reach much higher DPDK zero packet loss bandwidth.
reserved - Specifies the CPUs for the cluster and operating system housekeeping duties. Threads in the reserved group are often busy. Do not run latency-sensitive applications in the reserved group. Latency-sensitive applications run in the isolated group.

Procedure

Create a performance profile appropriate for the environment’s hardware and topology.

Add the reserved and isolated parameters with the CPUs you want reserved and isolated for the infra and application containers:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: infra-cpus
spec:
  cpu:
    reserved: "0-4,9" (1)
    isolated: "5-8" (2)
  nodeSelector: (3)
    node-role.kubernetes.io/worker: ""

1	Specify which CPUs are for infra containers to perform cluster and operating system housekeeping duties.
2	Specify which CPUs are for application containers to run workloads.
3	Optional: Specify a node selector to apply the performance profile to specific nodes.

Additional resources

Reducing NIC queues using the Node Tuning Operator

The Node Tuning Operator allows you to adjust the network interface controller (NIC) queue count for each network device by configuring the performance profile. Device network queues allows the distribution of packets among different physical queues and each queue gets a separate thread for packet processing.

In real-time or low latency systems, all the unnecessary interrupt request lines (IRQs) pinned to the isolated CPUs must be moved to reserved or housekeeping CPUs.

In deployments with applications that require system, OKD networking or in mixed deployments with Data Plane Development Kit (DPDK) workloads, multiple queues are needed to achieve good throughput and the number of NIC queues should be adjusted or remain unchanged. For example, to achieve low latency the number of NIC queues for DPDK based workloads should be reduced to just the number of reserved or housekeeping CPUs.

Too many queues are created by default for each CPU and these do not fit into the interrupt tables for housekeeping CPUs when tuning for low latency. Reducing the number of queues makes proper tuning possible. Smaller number of queues means a smaller number of interrupts that then fit in the IRQ table.

In earlier versions of OKD, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OKD 4.11, these functions are part of the Node Tuning Operator.

Adjusting the NIC queues with the performance profile

The performance profile lets you adjust the queue count for each network device.

Supported network devices:

Non-virtual network devices
Network devices that support multiple queues (channels)

Unsupported network devices:

Pure software network interfaces
Block devices
Intel DPDK virtual functions

Prerequisites

Access to the cluster as a user with the cluster-admin role.
Install the OpenShift CLI (oc).

Procedure

Log in to the OKD cluster running the Node Tuning Operator as a user with cluster-admin privileges.
Create and apply a performance profile appropriate for your hardware and topology. For guidance on creating a profile, see the “Creating a performance profile” section.
Edit this created performance profile:
```
$ oc edit -f <your_profile_name>.yaml
```

Populate the spec field with the net object. The object list can contain two fields:

userLevelNetworking is a required field specified as a boolean flag. If userLevelNetworking is true, the queue count is set to the reserved CPU count for all supported devices. The default is false.

devices is an optional field specifying a list of devices that will have the queues set to the reserved CPU count. If the device list is empty, the configuration applies to all network devices. The configuration is as follows:

interfaceName: This field specifies the interface name, and it supports shell-style wildcards, which can be positive or negative.
- Example wildcard syntax is as follows: <string> .*
- Negative rules are prefixed with an exclamation mark. To apply the net queue changes to all devices other than the excluded list, use !<device>, for example, !eno1.
vendorID: The network device vendor ID represented as a 16-bit hexadecimal number with a 0x prefix.

deviceID: The network device ID (model) represented as a 16-bit hexadecimal number with a 0x prefix.

When a deviceID is specified, the vendorID must also be defined. A device that matches all of the device identifiers specified in a device entry interfaceName, vendorID, or a pair of vendorID plus deviceID qualifies as a network device. This network device then has its net queues count set to the reserved CPU count.

When two or more devices are specified, the net queues count is set to any net device that matches one of them.

Set the queue count to the reserved CPU count for all devices by using this example performance profile:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  cpu:
    isolated: 3-51,54-103
    reserved: 0-2,52-54
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Set the queue count to the reserved CPU count for all devices matching any of the defined device identifiers by using this example performance profile:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  cpu:
    isolated: 3-51,54-103
    reserved: 0-2,52-54
  net:
    userLevelNetworking: true
    devices:
    - interfaceName: “eth0”
    - interfaceName: “eth1”
    - vendorID: “0x1af4”
    - deviceID: “0x1000”
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Set the queue count to the reserved CPU count for all devices starting with the interface name eth by using this example performance profile:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  cpu:
    isolated: 3-51,54-103
    reserved: 0-2,52-54
  net:
    userLevelNetworking: true
    devices:
    - interfaceName: “eth*”
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Set the queue count to the reserved CPU count for all devices with an interface named anything other than eno1 by using this example performance profile:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  cpu:
    isolated: 3-51,54-103
    reserved: 0-2,52-54
  net:
    userLevelNetworking: true
    devices:
    - interfaceName: “!eno1”
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Set the queue count to the reserved CPU count for all devices that have an interface name eth0, vendorID of 0x1af4, and deviceID of 0x1000 by using this example performance profile:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  cpu:
    isolated: 3-51,54-103
    reserved: 0-2,52-54
  net:
    userLevelNetworking: true
    devices:
    - interfaceName: “eth0”
    - vendorID: “0x1af4”
    - deviceID: “0x1000”
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Apply the updated performance profile:
```
$ oc apply -f <your_profile_name>.yaml
```

Additional resources

Creating a performance profile.

Verifying the queue status

In this section, a number of examples illustrate different performance profiles and how to verify the changes are applied.

Example 1

In this example, the net queue count is set to the reserved CPU count (2) for all supported devices.

The relevant section from the performance profile is:

apiVersion: performance.openshift.io/v2
metadata:
  name: performance
spec:
  kind: PerformanceProfile
  spec:
    cpu:
      reserved: 0-1  #total = 2
      isolated: 2-8
    net:
      userLevelNetworking: true
# ...

Display the status of the queues associated with a device using the following command:

Run this command on the node where the performance profile was applied.
```
$ ethtool -l <device>
```

Verify the queue status before the profile is applied:

$ ethtool -l ens4

Example output

Channel parameters for ens4:
Pre-set maximums:
RX:         0
TX:         0
Other:      0
Combined:   4
Current hardware settings:
RX:         0
TX:         0
Other:      0
Combined:   4

Verify the queue status after the profile is applied:

$ ethtool -l ens4

Example output

Channel parameters for ens4:
Pre-set maximums:
RX:         0
TX:         0
Other:      0
Combined:   4
Current hardware settings:
RX:         0
TX:         0
Other:      0
Combined:   2 (1)

1	The combined channel shows that the total count of reserved CPUs for all supported devices is 2. This matches what is configured in the performance profile.

Example 2

In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices with a specific vendorID.

The relevant section from the performance profile is:

apiVersion: performance.openshift.io/v2
metadata:
  name: performance
spec:
  kind: PerformanceProfile
  spec:
    cpu:
      reserved: 0-1  #total = 2
      isolated: 2-8
    net:
      userLevelNetworking: true
      devices:
      - vendorID = 0x1af4
# ...

Display the status of the queues associated with a device using the following command:

Run this command on the node where the performance profile was applied.
```
$ ethtool -l <device>
```

Verify the queue status after the profile is applied:

$ ethtool -l ens4

Example output

Channel parameters for ens4:
Pre-set maximums:
RX:         0
TX:         0
Other:      0
Combined:   4
Current hardware settings:
RX:         0
TX:         0
Other:      0
Combined:   2 (1)

1	The total count of reserved CPUs for all supported devices with `vendorID=0x1af4` is 2. For example, if there is another network device `ens2` with `vendorID=0x1af4` it will also have total net queues of 2. This matches what is configured in the performance profile.

Example 3

In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices that match any of the defined device identifiers.

The command udevadm info provides a detailed report on a device. In this example the devices are:

# udevadm info -p /sys/class/net/ens4
...
E: ID_MODEL_ID=0x1000
E: ID_VENDOR_ID=0x1af4
E: INTERFACE=ens4
...

# udevadm info -p /sys/class/net/eth0
...
E: ID_MODEL_ID=0x1002
E: ID_VENDOR_ID=0x1001
E: INTERFACE=eth0
...

Set the net queues to 2 for a device with interfaceName equal to eth0 and any devices that have a vendorID=0x1af4 with the following performance profile:

apiVersion: performance.openshift.io/v2
metadata:
  name: performance
spec:
  kind: PerformanceProfile
    spec:
      cpu:
        reserved: 0-1  #total = 2
        isolated: 2-8
      net:
        userLevelNetworking: true
        devices:
        - interfaceName = eth0
        - vendorID = 0x1af4
...

Verify the queue status after the profile is applied:

$ ethtool -l ens4

Example output

Channel parameters for ens4:
Pre-set maximums:
RX:         0
TX:         0
Other:      0
Combined:   4
Current hardware settings:
RX:         0
TX:         0
Other:      0
Combined:   2 (1)

1 The total count of reserved CPUs for all supported devices with vendorID=0x1af4 is set to 2. For example, if there is another network device ens2 with vendorID=0x1af4, it will also have the total net queues set to 2. Similarly, a device with interfaceName equal to eth0 will have total net queues set to 2.

Logging associated with adjusting NIC queues

Log messages detailing the assigned devices are recorded in the respective Tuned daemon logs. The following messages might be recorded to the /var/log/tuned/tuned.log file:

An INFO message is recorded detailing the successfully assigned devices:

INFO tuned.plugins.base: instance net_test (net): assigning devices ens1, ens2, ens3

A WARNING message is recorded if none of the devices can be assigned:

WARNING  tuned.plugins.base: instance net_test: no matching devices available

Performing end-to-end tests for platform verification

The Cloud-native Network Functions (CNF) tests image is a containerized test suite that validates features required to run CNF payloads. You can use this image to validate a CNF-enabled OpenShift cluster where all the components required for running CNF workloads are installed.

The tests run by the image are split into three different phases:

Simple cluster validation
Setup
End to end tests

The validation phase checks that all the features required to be tested are deployed correctly on the cluster.

Validations include:

Targeting a machine config pool that belong to the machines to be tested
Enabling SCTP on the nodes
Enabling xt_u32 kernel module via machine config
Having the SR-IOV Operator installed
Having the PTP Operator installed
Enabling the contain-mount-namespace mode via machine config
Using OVN-kubernetes as the cluster network provider

Latency tests, a part of the CNF-test container, also require the same validations. For more information about running a latency test, see the Running the latency tests section.

The tests need to perform an environment configuration every time they are executed. This involves items such as creating SR-IOV node policies, performance profiles, or PTP profiles. Allowing the tests to configure an already configured cluster might affect the functionality of the cluster. Also, changes to configuration items such as SR-IOV node policy might result in the environment being temporarily unavailable until the configuration change is processed.

Prerequisites

The test entrypoint is /usr/bin/test-run.sh. It runs both a setup test set and the real conformance test suite. The minimum requirement is to provide it with a kubeconfig file and its related $KUBECONFIG environment variable, mounted through a volume.
The tests assumes that a given feature is already available on the cluster in the form of an Operator, flags enabled on the cluster, or machine configs.

Some tests require a pre-existing machine config pool to append their changes to. This must be created on the cluster before running the tests.

The default worker pool is worker-cnf and can be created with the following manifest:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-cnf
  labels:
    machineconfiguration.openshift.io/role: worker-cnf
spec:
  machineConfigSelector:
    matchExpressions:
      - {
          key: machineconfiguration.openshift.io/role,
          operator: In,
          values: [worker-cnf, worker],
        }
  paused: false
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-cnf: ""

You can use the ROLE_WORKER_CNF variable to override the worker pool name:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e
ROLE_WORKER_CNF=custom-worker-pool registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh

Currently, not all tests run selectively on the nodes belonging to the pool.

Dry run

Use this command to run in dry-run mode. This is useful for checking what is in the test suite and provides output for all of the tests the image would run.

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh -ginkgo.dryRun -ginkgo.v

Disconnected mode

The CNF tests image support running tests in a disconnected cluster, meaning a cluster that is not able to reach outer registries. This is done in two steps:

Performing the mirroring.
Instructing the tests to consume the images from a custom registry.

Mirroring the images to a custom registry accessible from the cluster

A mirror executable is shipped in the image to provide the input required by oc to mirror the images needed to run the tests to a local registry.

Run this command from an intermediate machine that has access both to the cluster and to registry.redhat.io over the internet:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/mirror -registry my.local.registry:5000/ |  oc image mirror -f -

Then, follow the instructions in the following section about overriding the registry used to fetch the images.

Instruct the tests to consume those images from a custom registry

This is done by setting the IMAGE_REGISTRY environment variable:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e IMAGE_REGISTRY="my.local.registry:5000/" -e CNF_TESTS_IMAGE="custom-cnf-tests-image:latests" registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh

Mirroring to the cluster internal registry

OKD provides a built-in container image registry, which runs as a standard workload on the cluster.

Procedure

Gain external access to the registry by exposing it with a route:

$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge

Fetch the registry endpoint:

REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')

Create a namespace for exposing the images:
```
$ oc create ns cnftests
```

Make that image stream available to all the namespaces used for tests. This is required to allow the tests namespaces to fetch the images from the cnftests image stream.

$ oc policy add-role-to-user system:image-puller system:serviceaccount:sctptest:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:dpdk-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:sriov-conformance-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:xt-u32-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:vrf-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:gatekeeper-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:ovs-qos-testing:default --namespace=cnftests

Retrieve the docker secret name and auth token:

SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'})
TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 --decode | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')

Write a dockerauth.json similar to this:

echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.json

Do the mirroring:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/mirror -registry $REGISTRY/cnftests |  oc image mirror --insecure=true -a=$(pwd)/dockerauth.json -f -

Run the tests:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests cnf-tests-local:latest /usr/bin/test-run.sh

Mirroring a different set of images

Procedure

The mirror command tries to mirror the u/s images by default. This can be overridden by passing a file with the following format to the image:

[
    {
        "registry": "public.registry.io:5000",
        "image": "imageforcnftests:4.11"
    },
    {
        "registry": "public.registry.io:5000",
        "image": "imagefordpdk:4.11"
    }
]

Pass it to the mirror command, for example saving it locally as images.json. With the following command, the local path is mounted in /kubeconfig inside the container and that can be passed to the mirror command.

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/mirror --registry "my.local.registry:5000/" --images "/kubeconfig/images.json" |  oc image mirror -f -

Running in a single-node cluster

Running tests on a single-node cluster causes the following limitations to be imposed:

Longer timeouts for certain tests, including SR-IOV and SCTP tests
Tests requiring master and worker nodes are skipped

Longer timeouts concern SR-IOV and SCTP tests. Reconfiguration requiring node reboots cause a reboot of the entire environment, including the OpenShift control plane, and therefore takes longer to complete. All PTP tests requiring a master and worker node are skipped. No additional configuration is needed because the tests check for the number of nodes at startup and adjust test behavior accordingly.

PTP tests can run in Discovery mode. The tests look for a PTP master configured outside of the cluster.

For more information, see the Discovery mode section.

To enable Discovery mode, the tests must be instructed by setting the DISCOVERY_MODE environment variable as follows:

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
DISCOVERY_MODE=true registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

Required parameters

ROLE_WORKER_CNF=master - Required because master is the only machine pool to which the node will belong.
XT_U32TEST_HAS_NON_CNF_WORKERS=false - Required to instruct the xt_u32 negative test to skip because there are only nodes where the module is loaded.
SCTPTEST_HAS_NON_CNF_WORKERS=false - Required to instruct the SCTP negative test to skip because there are only nodes where the module is loaded.

Impact of tests on the cluster

Depending on the feature, running the test suite could cause different impacts on the cluster. In general, only the SCTP tests do not change the cluster configuration. All of the other features have various impacts on the configuration.

SCTP

SCTP tests just run different pods on different nodes to check connectivity. The impacts on the cluster are related to running simple pods on two nodes.

XT_U32

XT_U32 tests run pods on different nodes to check iptables rule that utilize xt_u32. The impacts on the cluster are related to running simple pods on two nodes.

SR-IOV

SR-IOV tests require changes in the SR-IOV network configuration, where the tests create and destroy different types of configuration.

This might have an impact if existing SR-IOV network configurations are already installed on the cluster, because there may be conflicts depending on the priority of such configurations.

At the same time, the result of the tests might be affected by existing configurations.

PTP

PTP tests apply a PTP configuration to a set of nodes of the cluster. As with SR-IOV, this might conflict with any existing PTP configuration already in place, with unpredictable results.

Performance

Performance tests apply a performance profile to the cluster. The effect of this is changes in the node configuration, reserving CPUs, allocating memory huge pages, and setting the kernel packages to be realtime. If an existing profile named performance is already available on the cluster, the tests do not deploy it.

DPDK

DPDK relies on both performance and SR-IOV features, so the test suite configures both a performance profile and SR-IOV networks, so the impacts are the same as those described in SR-IOV testing and performance testing.

Container-mount-namespace

The validation test for container-mount-namespace mode only checks that the appropriate MachineConfig objects are present and active, and has no additional impact on the node.

Cleaning up

After running the test suite, all the dangling resources are cleaned up.

Override test image parameters

Depending on the requirements, the tests can use different images. There are two images used by the tests that can be changed using the following environment variables:

CNF_TESTS_IMAGE
DPDK_TESTS_IMAGE

For example, to change the CNF_TESTS_IMAGE with a custom registry run the following command:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e CNF_TESTS_IMAGE="custom-cnf-tests-image:latests" registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh

Ginkgo parameters

The Ginkgo BDD (Behavior-Driven Development) framework serves as the base for the test suite. This means that it accepts parameters for filtering or skipping tests.

You can use the -ginkgo.focus parameter to filter a set of tests:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh -ginkgo.focus="performance|sctp"

You can run only the latency test using the -ginkgo.focus parameter.

To run only the latency test, you must provide the -ginkgo.focus parameter and the PERF_TEST_PROFILE environment variable that has the name of the PerformanceProfile that needs to be tested. For example:

$ docker run --rm -v $KUBECONFIG:/kubeconfig -e KUBECONFIG=/kubeconfig -e LATENCY_TEST_RUN=true -e LATENCY_TEST_RUNTIME=600 -e OSLAT_MAXIMUM_LATENCY=20 -e PERF_TEST_PROFILE=<performance_profile_name> registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\[config\]|\[performance\]\ Latency\ Test"

There is a particular test that requires both SR-IOV and SCTP. Given the selective nature of the focus parameter, this test is triggered by only placing the sriov matcher. If the tests are executed against a cluster where SR-IOV is installed but SCTP is not, adding the -ginkgo.skip=SCTP parameter causes the tests to skip SCTP testing.

Available features

The set of available features to filter are:

performance
sriov
ptp
sctp
xt_u32
dpdk
container-mount-namespace

Discovery mode

Discovery mode allows you to validate the functionality of a cluster without altering its configuration. Existing environment configurations are used for the tests. The tests attempt to find the configuration items needed and use those items to execute the tests. If resources needed to run a specific test are not found, the test is skipped, providing an appropriate message to the user. After the tests are finished, no cleanup of the pre-configured configuration items is done, and the test environment can be immediately used for another test run.

Some configuration items are still created by the tests. These are specific items needed for a test to run; for example, a SR-IOV Network. These configuration items are created in custom namespaces and are cleaned up after the tests are executed.

An additional bonus is a reduction in test run times. As the configuration items are already there, no time is needed for environment configuration and stabilization.

To enable discovery mode, the tests must be instructed by setting the DISCOVERY_MODE environment variable as follows:

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
DISCOVERY_MODE=true registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

Required environment configuration prerequisites

SR-IOV tests

Most SR-IOV tests require the following resources:

SriovNetworkNodePolicy.
At least one with the resource specified by SriovNetworkNodePolicy being allocatable; a resource count of at least 5 is considered sufficient.

Some tests have additional requirements:

An unused device on the node with available policy resource, with link state DOWN and not a bridge slave.
A SriovNetworkNodePolicy with a MTU value of 9000.

DPDK tests

The DPDK related tests require:

A performance profile.
A SR-IOV policy.
A node with resources available for the SR-IOV policy and available with the PerformanceProfile node selector.

PTP tests

A slave PtpConfig (ptp4lOpts="-s" ,phc2sysOpts="-a -r").
A node with a label matching the slave PtpConfig.

SCTP tests

SriovNetworkNodePolicy.
A node matching both the SriovNetworkNodePolicy and a MachineConfig that enables SCTP.

XT_U32 tests

A node with a machine config that enables XT_U32.

Performance Operator tests

Various tests have different requirements. Some of them are:

A performance profile.
A performance profile having profile.Spec.CPU.Isolated = 1.
A performance profile having profile.Spec.RealTimeKernel.Enabled == true.
A node with no huge pages usage.

Container-mount-namespace tests

A node with a machine config which enables container-mount-namespace mode

Limiting the nodes used during tests

The nodes on which the tests are executed can be limited by specifying a NODES_SELECTOR environment variable. Any resources created by the test are then limited to the specified nodes.

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
NODES_SELECTOR=node-role.kubernetes.io/worker-cnf registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

Using a single performance profile

The resources needed by the DPDK tests are higher than those required by the performance test suite. To make the execution faster, the performance profile used by tests can be overridden using one that also serves the DPDK test suite.

To do this, a profile like the following one can be mounted inside the container, and the performance tests can be instructed to deploy it.

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "4-15"
    reserved: "0-3"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      count: 16
      node: 0
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.

To override the performance profile used, the manifest must be mounted inside the container and the tests must be instructed by setting the PERFORMANCE_PROFILE_MANIFEST_OVERRIDE parameter as follows:

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
PERFORMANCE_PROFILE_MANIFEST_OVERRIDE=/kubeconfig/manifest.yaml registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

Disabling the performance profile cleanup

When not running in discovery mode, the suite cleans up all the created artifacts and configurations. This includes the performance profile.

When deleting the performance profile, the machine config pool is modified and nodes are rebooted. After a new iteration, a new profile is created. This causes long test cycles between runs.

To speed up this process, set CLEAN_PERFORMANCE_PROFILE="false" to instruct the tests not to clean the performance profile. In this way, the next iteration will not need to create it and wait for it to be applied.

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
CLEAN_PERFORMANCE_PROFILE="false" registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

Running the latency tests

If the kubeconfig file is in the current folder, you can run the test suite by using the following command:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e \
DISCOVERY_MODE=true registry.redhat.io/openshift4/cnf-tests-rhel8:v{product-version} \
/usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"

This allows the running container to use the kubeconfig file from inside the container.

You must run the latency tests in Discovery mode. The latency tests can change the configuration of your cluster if you do not run in Discovery mode.

In OKD 4.11, you can also run latency tests from the CNF-test container. The latency test allows you to validate node tuning for your workload.

Three tools measure the latency of the system:

hwlatdetect
cyclictest
oslat

Each tool has a specific use. Use the tools in sequence to achieve reliable test results.

The hwlatdetect tool measures the baseline that the bare metal hardware can achieve. Before proceeding with the next latency test, ensure that the number measured by hwlatdetect meets the required threshold because you cannot fix hardware latency spikes by operating system tuning.
The cyclictest tool verifies the real-time kernel scheduler latency after hwlatdetect passes validation. The cyclictest tool schedules a repeated timer and measures the difference between the desired and the actual trigger times. The difference can uncover basic issues with the tuning caused by interrupts or process priorities. The tool must run on a real-time kernel.
The oslat tool behaves similarly to a CPU-intensive DPDK application and measures all the interruptions and disruptions to the busy loop that simulates CPU heavy data processing.

By default, the latency tests are disabled. To enable the latency test, you must add the LATENCY_TEST_RUN environment variable to the test invocation and set its value to true. For example, LATENCY_TEST_RUN=true.

The test introduces the following environment variables:

LATENCY_TEST_DELAY

The variable specifies the amount of time in seconds after which the test starts running. You can use the variable to allow the CPU manager reconcile loop to update the default CPU pool. The default value is 0.

LATENCY_TEST_CPUS

The variable specifies the number of CPUs that the pod running the latency tests uses. If you do not set the variable, the default configuration includes all isolated CPUs.

LATENCY_TEST_RUNTIME

The variable specifies the amount of time in seconds that the latency test must run. The default value is 300 seconds.

HWLATDETECT_MAXIMUM_LATENCY

The variable specifies the maximum acceptable hardware latency in microseconds for the workload and operating system. If you do not set the value of HWLATDETECT_MAXIMUM_LATENCY or MAXIMUM_LATENCY, the tool compares the default expected threshold (20μs) and the actual maximum latency in the tool itself. Then, the test fails or succeeds accordingly.

CYCLICTEST_MAXIMUM_LATENCY

The variable specifies the maximum latency in microseconds that all threads expect before waking up during the cyclictest run. If you do not set the value of CYCLICTEST_MAXIMUM_LATENCY or MAXIMUM_LATENCY, the tool skips the comparison of the expected and the actual maximum latency.

OSLAT_MAXIMUM_LATENCY

The variable specifies the maximum acceptable latency in microseconds for the oslat test results. If you do not set the value of OSLAT_MAXIMUM_LATENCY or MAXIMUM_LATENCY, the tool skips the comparison of the expected and the actual maximum latency.

MAXIMUM_LATENCY

This is a unified variable you can apply for all the available latency tools.

A variable that is specific to certain tests has precedence over the unified variable.

You can use the -ginkgo.v flag to run the tests with verbosity.

You can use the -ginkgo.focus flag to run a specific test.

Running hwlatdetect

The hwlatdetect tool is available in the rt-kernel package with a regular subscription of Red Hat Enterprise Linux 8.

Prerequisites:

You installed the real-time kernel
You logged into registry.redhat.io with your Customer Portal credentials

Procedure

Run the following command:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e \
LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf -e \
LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20  registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
/usr/bin/test-run.sh -ginkgo.focus="hwlatdetect"

The command runs the hwlatdetect tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower than MAXIMUM_LATENCY (20 μs), and the command line displays SUCCESS! when this test is completed.

For valid results, the test should run for at least 12 hours.

If the results exceed the latency threshold, the test fails and you can see the following output:

Example failure output

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e \
LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf -e \
LATENCY_TEST_RUNTIME=10 -e MAXIMUM_LATENCY=1  registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
/usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="hwlatdetect" (1)
running /usr/bin/validationsuite -ginkgo.v -ginkgo.focus=hwlatdetect
I0210 17:08:38.607699       7 request.go:668] Waited for 1.047200253s due to client-side throttling, not priority and fairness, request: GET:https://api.ocp.demo.lab:6443/apis/apps.openshift.io/v1?timeout=32s
Running Suite: CNF Features e2e validation
==========================================
Random Seed: 1644512917
Will run 0 of 48 specs
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
Ran 0 of 48 Specs in 0.001 seconds
SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 48 Skipped
PASS
Discovery mode enabled, skipping setup
running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=hwlatdetect
I0210 17:08:41.179269      40 request.go:668] Waited for 1.046001096s due to client-side throttling, not priority and fairness, request: GET:https://api.ocp.demo.lab:6443/apis/storage.k8s.io/v1beta1?timeout=32s
Running Suite: CNF Features e2e integration tests
=================================================
Random Seed: 1644512920
Will run 1 of 151 specs
SSSSSSS
------------------------------
[performance] Latency Test with the hwlatdetect image
  should succeed
  /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:221
STEP: Waiting two minutes to download the latencyTest image
STEP: Waiting another two minutes to give enough time for the cluster to move the pod to Succeeded phase
Feb 10 17:10:56.045: [INFO]: found mcd machine-config-daemon-dzpw7 for node ocp-worker-0.demo.lab
Feb 10 17:10:56.259: [INFO]: found mcd machine-config-daemon-dzpw7 for node ocp-worker-0.demo.lab
Feb 10 17:11:56.825: [ERROR]: timed out waiting for the condition
• Failure [193.903 seconds]
[performance] Latency Test
/remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:60
  with the hwlatdetect image
  /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:213
    should succeed [It]
    /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:221
    Log file created at: 2022/02/10 17:08:45
    Running on machine: hwlatdetect-cd8b6
    Binary: Built with gc go1.16.6 for linux/amd64
    Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
    I0210 17:08:45.716288       1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-56fabc639a679b757ebae30e5f01b2ebd38e9fde9ecae91c41be41d3e89b37f8/vmlinuz-4.18.0-305.34.2.rt7.107.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu ostree=/ostree/boot.0/rhcos/56fabc639a679b757ebae30e5f01b2ebd38e9fde9ecae91c41be41d3e89b37f8/0 root=UUID=56731f4f-f558-46a3-85d3-d1b579683385 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=3-5 tuned.non_isolcpus=ffffffc7 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,3-5 systemd.cpu_affinity=0,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 + +
    I0210 17:08:45.716782       1 node.go:44] Environment information: kernel version 4.18.0-305.34.2.rt7.107.el8_4.x86_64
    I0210 17:08:45.716861       1 main.go:50] running the hwlatdetect command with arguments [/usr/bin/hwlatdetect --threshold 1 --hardlimit 1 --duration 10 --window 10000000us --width 950000us]
    F0210 17:08:56.815204       1 main.go:53] failed to run hwlatdetect command; out: hwlatdetect:  test duration 10 seconds
       detector: tracer
       parameters:
            Latency threshold: 1us (2)
            Sample window:     10000000us
            Sample width:      950000us
         Non-sampling period:  9050000us
            Output File:       None
    Starting test
    test finished
    Max Latency: 24us (3)
    Samples recorded: 1
    Samples exceeding threshold: 1
    ts: 1644512927.163556381, inner:20, outer:24
    ; err: exit status 1
    goroutine 1 [running]:
    k8s.io/klog.stacks(0xc000010001, 0xc00012e000, 0x25b, 0x2710)
        /remote-source/app/vendor/k8s.io/klog/klog.go:875 +0xb9
    k8s.io/klog.(*loggingT).output(0x5bed00, 0xc000000003, 0xc0000121c0, 0x53ea81, 0x7, 0x35, 0x0)
        /remote-source/app/vendor/k8s.io/klog/klog.go:829 +0x1b0
    k8s.io/klog.(*loggingT).printf(0x5bed00, 0x3, 0x5082da, 0x33, 0xc000113f58, 0x2, 0x2)
        /remote-source/app/vendor/k8s.io/klog/klog.go:707 +0x153
    k8s.io/klog.Fatalf(...)
        /remote-source/app/vendor/k8s.io/klog/klog.go:1276
    main.main()
        /remote-source/app/cnf-tests/pod-utils/hwlatdetect-runner/main.go:53 +0x897
    goroutine 6 [chan receive]:
    k8s.io/klog.(*loggingT).flushDaemon(0x5bed00)
        /remote-source/app/vendor/k8s.io/klog/klog.go:1010 +0x8b
    created by k8s.io/klog.init.0
        /remote-source/app/vendor/k8s.io/klog/klog.go:411 +0xd8
    goroutine 7 [chan receive]:
    k8s.io/klog/v2.(*loggingT).flushDaemon(0x5bede0)
        /remote-source/app/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
    created by k8s.io/klog/v2.init.0
        /remote-source/app/vendor/k8s.io/klog/v2/klog.go:420 +0xdf
    Unexpected error:
        <*errors.errorString | 0xc000418ed0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    occurred
    /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:433
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
JUnit report was created: /junit.xml/cnftests-junit.xml
Summarizing 1 Failure:
[Fail] [performance] Latency Test with the hwlatdetect image [It] should succeed
/remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:433
Ran 1 of 151 Specs in 222.254 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 150 Skipped
--- FAIL: TestTest (222.45s)
FAIL

1	The `podman` arguments you provided.
2	You can configure the latency threshold by using the `MAXIMUM_LATENCY` or the `HWLATDETECT_MAXIMUM_LATENCY` environment variables.
3	The maximum latency value measured during the test.

Capturing the results

You can capture the following types of results:

Rough results that are gathered after each run to create a history of impact on any changes made throughout the test
The combined set of the rough tests with the best results and configuration settings

Example of good results

hwlatdetect: test duration 3600 seconds
detector: tracer
parameters:
Latency threshold: 10us
Sample window: 1000000us
Sample width: 950000us
Non-sampling period: 50000us
Output File: None
Starting test
test finished
Max Latency: Below threshold
Samples recorded: 0

The hwlatdetect tool only provides output if the sample exceeds the specified threshold.

Example of bad results

hwlatdetect: test duration 3600 seconds
detector: tracer
parameters:Latency threshold: 10usSample window: 1000000us
Sample width: 950000usNon-sampling period: 50000usOutput File: None
Starting tests:1610542421.275784439, inner:78, outer:81
ts: 1610542444.330561619, inner:27, outer:28
ts: 1610542445.332549975, inner:39, outer:38
ts: 1610542541.568546097, inner:47, outer:32
ts: 1610542590.681548531, inner:13, outer:17
ts: 1610543033.818801482, inner:29, outer:30
ts: 1610543080.938801990, inner:90, outer:76
ts: 1610543129.065549639, inner:28, outer:39
ts: 1610543474.859552115, inner:28, outer:35
ts: 1610543523.973856571, inner:52, outer:49
ts: 1610543572.089799738, inner:27, outer:30
ts: 1610543573.091550771, inner:34, outer:28
ts: 1610543574.093555202, inner:116, outer:63

The output of hwlatdetect shows that multiple samples exceed the threshold.

However, the same output can indicate different results based on the following factors:

The duration of the test
The number of CPU cores
The BIOS settings

Before proceeding with the next latency test, ensure that the number measured by hwlatdetect meets the required threshold. Fixing latencies introduced by hardware might require you to contact the support of your system vendor.

Running cyclictest

The cyclictest tool measures the real-time kernel scheduler latency on the specified CPUs.

Prerequisites

You logged into registry.redhat.io with your Customer Portal credentials
You installed the real-time kernel
You applied the performance profile by using the Node Tuning Operator

Procedure

To perform the cyclictest, run the following command:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e \
LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf -e \
LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh -ginkgo.focus="cyclictest"

The command runs the cyclictest tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower than MAXIMUM_LATENCY (20 μs), and the command line displays SUCCESS! when this test is completed.

For valid results, the test should run for at least 12 hours.

If the results exceed the latency threshold, the test fails and you can see the following output:

Example failure output

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e \
PERF_TEST_PROFILE=<performance_profile_name> -e ROLE_WORKER_CNF=worker-cnf -e \
LATENCY_TEST_RUN=true -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 -e \
LATENCY_TEST_CPUS=10 -e DISCOVERY_MODE=true \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 usr/bin/test-run.sh \
-ginkgo.v -ginkgo.focus="cyclictest" (1)
Discovery mode enabled, skipping setup
running /usr/bin//cnftests -ginkgo.v -ginkgo.focus=cyclictest
I0811 15:02:36.350033      20 request.go:668] Waited for 1.049965918s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/machineconfiguration.openshift.io/v1?timeout=32s
Running Suite: CNF Features e2e integration tests
=================================================
Random Seed: 1628694153
Will run 1 of 138 specs
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[performance] Latency Test with the cyclictest image
  should succeed
  /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:200
STEP: Waiting two minutes to download the latencyTest image
STEP: Waiting another two minutes to give enough time for the cluster to move the pod to Succeeded phase
Aug 11 15:03:06.826: [INFO]: found mcd machine-config-daemon-wf4w8 for node cnfdc8.clus2.t5g.lab.eng.bos.redhat.com
• Failure [22.527 seconds]
[performance] Latency Test
/go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:84
  with the cyclictest image
  /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:188
    should succeed [It]
    /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:200
    The current latency 17 is bigger than the expected one 20 (2)
    Expected
        <bool>: false
    to be true
    /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:219
Log file created at: 2021/08/11 15:02:51
Running on machine: cyclictest-knk7d
Binary: Built with gc go1.16.6 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0811 15:02:51.092254       1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/vmlinuz-4.18.0-305.10.2.rt7.83.el8_4.x86_64 ip=dhcp random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/0 ignition.platform.id=openstack root=UUID=5a4ddf16-9372-44d9-ac4e-3ee329e16ab3 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=000000ff,ffffffff,ffffffff,fffffff1 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 default_hugepagesz=1G hugepagesz=2M hugepages=128 nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0
I0811 15:02:51.092427       1 node.go:44] Environment information: kernel version 4.18.0-305.10.2.rt7.83.el8_4.x86_64
I0811 15:02:51.092450       1 main.go:48] running the cyclictest command with arguments \
[-D 600 -95 1 -t 10 -a 2,4,6,8,10,54,56,58,60,62 -h 30 -i 1000 --quiet] (3)
I0811 15:03:06.147253       1 main.go:54] succeeded to run the cyclictest command: # /dev/cpu_dma_latency set to 0us
# Histogram
000000 000000    000000    000000    000000    000000    000000    000000    000000    000000    000000
000001 000000    005561    027778    037704    011987    000000    120755    238981    081847    300186
000002 587440    581106    564207    554323    577416    590635    474442    357940    513895    296033
000003 011751    011441    006449    006761    008409    007904    002893    002066    003349    003089
000004 000527    001079    000914    000712    001451    001120    000779    000283    000350    000251
More histogram entries ...
# Min Latencies: 00002 00001 00001 00001 00001 00002 00001 00001 00001 00001
# Avg Latencies: 00002 00002 00002 00001 00002 00002 00001 00001 00001 00001
# Max Latencies: 00018 00465 00361 00395 00208 00301 02052 00289 00327 00114 (4)
# Histogram Overflows: 00000 00220 00159 00128 00202 00017 00069 00059 00045 00120
# Histogram Overflow at cycle number:
# Thread 0:
# Thread 1: 01142 01439 05305 … # 00190 others
# Thread 2: 20895 21351 30624 … # 00129 others
# Thread 3: 01143 17921 18334 … # 00098 others
# Thread 4: 30499 30622 31566 ... # 00172 others
# Thread 5: 145221 170910 171888 ...
# Thread 6: 01684 26291 30623 ...# 00039 others
# Thread 7: 28983 92112 167011 … 00029 others
# Thread 8: 45766 56169 56171 ...# 00015 others
# Thread 9: 02974 08094 13214 ... # 00090 others

1	The `podman` arguments you provided.
2	You can see the measured latency and the configured latency.
3	The arguments for the `cyclictest` command.
4	The maximum latencies measured on each thread.

Capturing the results

The same output can indicate different results for different workloads. For example, spikes up to 18μs is acceptable for 4G DU workloads but not for 5G DU workloads. Spikes above 20μs are not acceptable in any case.

Example of good results

running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m
# Histogram
000000 000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000
000001 000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000
000002 579506    535967    418614    573648    532870    529897    489306    558076    582350    585188    583793    223781    532480    569130    472250    576043
More histogram entries ...
# Total: 000600000 000600000 000600000 000599999 000599999 000599999 000599998 000599998 000599998 000599997 000599997 000599996 000599996 000599995 000599995 000599995
# Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Max Latencies: 00005 00005 00004 00005 00004 00004 00005 00005 00006 00005 00004 00005 00004 00004 00005 00004
# Histogram Overflows: 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
# Histogram Overflow at cycle number:
# Thread 0:
# Thread 1:
# Thread 2:
# Thread 3:
# Thread 4:
# Thread 5:
# Thread 6:
# Thread 7:
# Thread 8:
# Thread 9:
# Thread 10:
# Thread 11:
# Thread 12:
# Thread 13:
# Thread 14:
# Thread 15:

Example of bad results

running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m
# Histogram
000000 000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000
000001 000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000    000000
000002 564632    579686    354911    563036    492543    521983    515884    378266    592621    463547    482764    591976    590409    588145    589556    353518
More histogram entries ...
# Total: 000599999 000599999 000599999 000599997 000599997 000599998 000599998 000599997 000599997 000599996 000599995 000599996 000599995 000599995 000599995 000599993
# Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Max Latencies: 00493 00387 00271 00619 00541 00513 00009 00389 00252 00215 00539 00498 00363 00204 00068 00520
# Histogram Overflows: 00001 00001 00001 00002 00002 00001 00000 00001 00001 00001 00002 00001 00001 00001 00001 00002
# Histogram Overflow at cycle number:
# Thread 0: 155922
# Thread 1: 110064
# Thread 2: 110064
# Thread 3: 110063 155921
# Thread 4: 110063 155921
# Thread 5: 155920
# Thread 6:
# Thread 7: 110062
# Thread 8: 110062
# Thread 9: 155919
# Thread 10: 110061 155919
# Thread 11: 155918
# Thread 12: 155918
# Thread 13: 110060
# Thread 14: 110060
# Thread 15: 110059 155917

Running oslat

Prerequisites

You logged into registry.redhat.io with your Customer Portal credentials
You applied the performance profile by using the Node Tuning Operator

Procedure

To perform the oslat, run the following command:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e \
LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf -e \
LATENCY_TEST_CPUS=7 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh -ginkgo.focus="oslat"

The command runs the oslat tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower than MAXIMUM_LATENCY (20 μs), and the command line displays SUCCESS! when this test is completed.

If the results exceed the latency threshold, the test fails and you can see the following output:

Example failure output

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e \
IMAGE_REGISTRY="registry.redhat.io/openshift4/" -e CNF_TESTS_IMAGE=cnf-tests-rhel8:v4.11 -e \
PERF_TEST_PROFILE=<performance_profile_name> -e ROLE_WORKER_CNF=worker-cnf -e \
LATENCY_TEST_RUN=true -e LATENCY_TEST_RUNTIME=600 -e DISCOVERY_MODE=true -e \
MAXIMUM_LATENCY=20 -e LATENCY_TEST_CPUS=7 \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="oslat" (1)
running /usr/bin//validationsuite -ginkgo.v -ginkgo.focus=oslat
I0829 12:36:55.386776       8 request.go:668] Waited for 1.000303471s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/authentication.k8s.io/v1?timeout=32s
Running Suite: CNF Features e2e validation
==========================================
Discovery mode enabled, skipping setup
running /usr/bin//cnftests -ginkgo.v -ginkgo.focus=oslat
I0829 12:37:01.219077      20 request.go:668] Waited for 1.050010755s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s
Running Suite: CNF Features e2e integration tests
=================================================
Random Seed: 1630240617
Will run 1 of 142 specs
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[performance] Latency Test with the oslat image
  should succeed
  /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:134
STEP: Waiting two minutes to download the latencyTest image
STEP: Waiting another two minutes to give enough time for the cluster to move the pod to Succeeded phase
Aug 29 12:37:59.324: [INFO]: found mcd machine-config-daemon-wf4w8 for node cnfdc8.clus2.t5g.lab.eng.bos.redhat.com
• Failure [49.246 seconds]
[performance] Latency Test
/go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:59
  with the oslat image
  /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:112
    should succeed [It]
    /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:134
    The current latency 27 is bigger than the expected one 20 (2)
    Expected
        <bool>: false
    to be true
 /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:168
Log file created at: 2021/08/29 13:25:21
Running on machine: oslat-57c2g
Binary: Built with gc go1.16.6 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0829 13:25:21.569182       1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/vmlinuz-4.18.0-305.10.2.rt7.83.el8_4.x86_64 ip=dhcp random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.0/rhcos/612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/0 ignition.platform.id=openstack root=UUID=5a4ddf16-9372-44d9-ac4e-3ee329e16ab3 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=000000ff,ffffffff,ffffffff,fffffff1 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 default_hugepagesz=1G hugepagesz=2M hugepages=128 nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0
I0829 13:25:21.569345       1 node.go:44] Environment information: kernel version 4.18.0-305.10.2.rt7.83.el8_4.x86_64
I0829 13:25:21.569367       1 main.go:53] Running the oslat command with arguments \
[--duration 600 --rtprio 1 --cpu-list 4,6,52,54,56,58 --cpu-main-thread 2] (1)
I0829 13:35:22.632263       1 main.go:59] Succeeded to run the oslat command: oslat V 2.00
Total runtime:         600 seconds
Thread priority:     SCHED_FIFO:1
CPU list:         4,6,52,54,56,58
CPU for main thread:     2
Workload:         no
Workload mem:         0 (KiB)
Preheat cores:         6
Pre-heat for 1 seconds...
Test starts...
Test completed.
        Core:     4 6 52 54 56 58
    CPU Freq:     2096 2096 2096 2096 2096 2096 (Mhz)
    001 (us):     19390720316 19141129810 20265099129 20280959461 19391991159 19119877333
    002 (us):     5304 5249 5777 5947 6829 4971
    003 (us):     28 14 434 47 208 21
    004 (us):     1388 853 123568 152817 5576 0
    005 (us):     207850 223544 103827 91812 227236 231563
    006 (us):     60770 122038 277581 323120 122633 122357
    007 (us):     280023 223992 63016 25896 214194 218395
    008 (us):     40604 25152 24368 4264 24440 25115
    009 (us):     6858 3065 5815 810 3286 2116
    010 (us):     1947 936 1452 151 474 361
  ...
     Minimum:     1 1 1 1 1 1 (us)
     Average:     1.000 1.000 1.000 1.000 1.000 1.000 (us)
     Maximum:     37 38 49 28 28 19 (us) (3)
     Max-Min:     36 37 48 27 27 18 (us)
    Duration:     599.667 599.667 599.667 599.667 599.667 599.667 (sec)

1	The list of CPUs running the `oslat` command. The `LATENCY_TEST_CPUS` variable providessSeven CPUs. You can only see six CPUs in total because one runs the `oslat` tool.
2	You can see the measured latency and the configured latency.
3	The maximum latency values in microseconds that each CPU measures.

Troubleshooting

The cluster must be reached from within the container. You can verify this by running:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig
registry.redhat.io/openshift-kni/cnf-tests oc get nodes

If this does not work, it could be caused by spanning across DNS, MTU size, or firewall issues.

Test reports

CNF end-to-end tests produce two outputs: a JUnit test output and a test failure report.

JUnit test output

A JUnit-compliant XML is produced by passing the --junit parameter together with the path where the report is dumped:

$ docker run -v $(pwd)/:/kubeconfig -v $(pwd)/junitdest:/path/to/junit -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh --junit /path/to/junit

Test failure report

A report with information about the cluster state and resources for troubleshooting can be produced by passing the --report parameter with the path where the report is dumped:

$ docker run -v $(pwd)/:/kubeconfig -v $(pwd)/reportdest:/path/to/report -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh --report /path/to/report

A note on podman

When executing podman as non root and non privileged, mounting paths can fail with “permission denied” errors. To make it work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z to allow podman to do the proper SELinux relabeling.

Running on OKD 4.4

With the exception of the following, the CNF end-to-end tests are compatible with OKD 4.4:

[test_id:28466][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] Should contain configuration injected through openshift-node-performance profile
[test_id:28467][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] Should contain configuration injected through the openshift-node-performance profile

You can skip these tests by adding the -ginkgo.skip “28466|28467" parameter.

Using a single performance profile

The DPDK tests require more resources than what is required by the performance test suite. To make the execution faster, you can override the performance profile used by the tests using a profile that also serves the DPDK test suite.

To do this, use a profile like the following one that can be mounted inside the container, and the performance tests can be instructed to deploy it.

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
 name: performance
spec:
 cpu:
  isolated: "5-15"
  reserved: "0-4"
 hugepages:
  defaultHugepagesSize: "1G"
  pages:
  - size: "1G"
    count: 16
    node: 0
 realTimeKernel:
  enabled: true
 numa:
  topologyPolicy: "best-effort"
 nodeSelector:
  node-role.kubernetes.io/worker-cnf: ""

When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.

To override the performance profile, the manifest must be mounted inside the container and the tests must be instructed by setting the PERFORMANCE_PROFILE_MANIFEST_OVERRIDE:

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e PERFORMANCE_PROFILE_MANIFEST_OVERRIDE=/kubeconfig/manifest.yaml registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh

Debugging low latency CNF tuning status

The PerformanceProfile custom resource (CR) contains status fields for reporting tuning status and debugging latency degradation issues. These fields report on conditions that describe the state of the operator’s reconciliation functionality.

A typical issue can arise when the status of machine config pools that are attached to the performance profile are in a degraded state, causing the PerformanceProfile status to degrade. In this case, the machine config pool issues a failure message.

The Node Tuning Operator contains the performanceProfile.spec.status.Conditions status field:

Status:
  Conditions:
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                True
    Type:                  Available
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                True
    Type:                  Upgradeable
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                False
    Type:                  Degraded

The Status field contains Conditions that specify Type values that indicate the status of the performance profile:

Available

All machine configs and Tuned profiles have been created successfully and are available for cluster components are responsible to process them (NTO, MCO, Kubelet).

Upgradeable

Indicates whether the resources maintained by the Operator are in a state that is safe to upgrade.

Progressing

Indicates that the deployment process from the performance profile has started.

Degraded

Indicates an error if:

Validation of the performance profile has failed.
Creation of all relevant components did not complete successfully.

Each of these types contain the following fields:

Status

The state for the specific type (true or false).

Timestamp

The transaction timestamp.

Reason string

The machine readable reason.

Message string

The human readable reason describing the state and error details, if any.

Machine config pools

A performance profile and its created products are applied to a node according to an associated machine config pool (MCP). The MCP holds valuable information about the progress of applying the machine configurations created by performance profiles that encompass kernel args, kube config, huge pages allocation, and deployment of rt-kernel. The Performance Profile controller monitors changes in the MCP and updates the performance profile status accordingly.

The only conditions returned by the MCP to the performance profile status is when the MCP is Degraded, which leads to performaceProfile.status.condition.Degraded = true.

Example

The following example is for a performance profile with an associated machine config pool (worker-cnf) that was created for it:

The associated machine config pool is in a degraded state:

# oc get mcp

Example output

NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master       rendered-master-2ee57a93fa6c9181b546ca46e1571d2d       True      False      False      3              3                   3                     0                      2d21h
worker       rendered-worker-d6b2bdc07d9f5a59a6b68950acf25e5f       True      False      False      2              2                   2                     0                      2d21h
worker-cnf   rendered-worker-cnf-6c838641b8a08fff08dbd8b02fb63f7c   False     True       True       2              1                   1                     1                      2d20h

The describe section of the MCP shows the reason:

# oc describe mcp worker-cnf

Example output

  Message:               Node node-worker-cnf is reporting: "prepping update:
  machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not
  found"
    Reason:                1 nodes are reporting degraded status on sync

The degraded state should also appear under the performance profile status field marked as degraded = true:

# oc describe performanceprofiles performance

Example output

Message: Machine config pool worker-cnf Degraded Reason: 1 nodes are reporting degraded status on sync.
Machine config pool worker-cnf Degraded Message: Node yquinn-q8s5v-w-b-z5lqn.c.openshift-gce-devel.internal is
reporting: "prepping update: machineconfig.machineconfiguration.openshift.io
\"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found".    Reason:  MCPDegraded
   Status:  True
   Type:    Degraded

Collecting low latency tuning debugging data for Red Hat Support

When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.

The must-gather tool enables you to collect diagnostic information about your OKD cluster, including node tuning, NUMA topology, and other information needed to debug issues with low latency setup.

For prompt support, supply diagnostic information for both OKD and low latency tuning.

About the must-gather tool

The oc adm must-gather CLI command collects the information from your cluster that is most likely needed for debugging issues, such as:

Resource definitions
Audit logs
Service logs

You can specify one or more images when you run the command by including the --image argument. When you specify an image, the tool collects data related to that feature or product. When you run oc adm must-gather, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local. This directory is created in your current working directory.

About collecting low latency tuning data

Use the oc adm must-gather CLI command to collect information about your cluster, including features and objects associated with low latency tuning, including:

The Node Tuning Operator namespaces and child objects.
MachineConfigPool and associated MachineConfig objects.
The Node Tuning Operator and associated Tuned objects.
Linux Kernel command line options.
CPU and NUMA topology
Basic PCI device information and NUMA locality.

To collect debugging information with must-gather, you must specify the Performance Addon Operator must-gather image:

--image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.11.

In earlier versions of OKD, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OKD 4.11, these functions are part of the Node Tuning Operator. However, you must still use the performance-addon-operator-must-gather image when running the must-gather command.

Gathering data about specific features

You can gather debugging information about specific features by using the oc adm must-gather CLI command with the --image or --image-stream argument. The must-gather tool supports multiple images, so you can gather data about more than one feature by running a single command.

To collect the default must-gather data in addition to specific feature data, add the —image-stream=openshift/must-gather argument.

Prerequisites

Access to the cluster as a user with the cluster-admin role.
The OKD CLI (oc) installed.

Procedure

Navigate to the directory where you want to store the must-gather data.
Run the oc adm must-gather command with one or more --image or --image-stream arguments. For example, the following command gathers both the default cluster data and information specific to the Node Tuning Operator:
```
$ oc adm must-gather \
 --image-stream=openshift/must-gather \ (1)
 --image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.11 (2)
```
1 The default OKD must-gather image.
2 The must-gather image for low latency tuning diagnostics.
Create a compressed file from the must-gather directory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:
```
 $ tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/ (1)
```
1 Replace must-gather-local.5421342344627712289/ with the actual directory name.
Attach the compressed file to your support case on the Red Hat Customer Portal.

Additional resources

For more information about MachineConfig and KubeletConfig, see Managing nodes.
For more information about the Node Tuning Operator, see Using the Node Tuning Operator.
For more information about the PerformanceProfile, see Configuring huge pages.
For more information about consuming huge pages from your containers, see How huge pages are consumed by apps.

1	If `highPowerConsumption` is `true`, the node is tuned for very low latency at the cost of increased power consumption.
2	Disables some debugging and monitoring features that can affect system latency.

1	The default OKD `must-gather` image.
2	The `must-gather` image for low latency tuning diagnostics.

Low latency tuning

Low latency tuning

Understanding low latency

About hyperthreading for low latency and real-time applications

Provisioning real-time and low latency workloads

Known limitations for real-time

Provisioning a worker with real-time capabilities

Verifying the real-time kernel installation

Creating a workload that works in real-time

Creating a pod with a QoS class of Guaranteed

Optional: Disabling CPU load balancing for DPDK

Assigning a proper node selector

Scheduling a workload onto a worker with real-time capabilities

Reducing power consumption by taking CPUs offline

Managing device interrupt processing for guaranteed pod isolated CPUs

Disabling CPU CFS quota

Disabling global device interrupts handling in Node Tuning Operator

Disabling interrupt processing for individual pods

Upgrading the performance profile to use device interrupt processing

Supported API Versions

Upgrading Node Tuning Operator API from v1alpha1 to v1

Upgrading Node Tuning Operator API from v1alpha1 or v1 to v2

Tuning nodes for low latency with the performance profile

Configuring huge pages

Allocating multiple huge page sizes

Configuring a node for IRQ dynamic load balancing

Configuring hyperthreading for a cluster

Disabling hyperthreading for low latency applications

Understanding workload hints

Configuring workload hints manually

Restricting CPUs for infra and application containers

Reducing NIC queues using the Node Tuning Operator

Adjusting the NIC queues with the performance profile

Verifying the queue status

Logging associated with adjusting NIC queues

Performing end-to-end tests for platform verification

Prerequisites

Dry run

Disconnected mode

Mirroring the images to a custom registry accessible from the cluster

Instruct the tests to consume those images from a custom registry

Mirroring to the cluster internal registry

Mirroring a different set of images

Running in a single-node cluster

Required parameters

Impact of tests on the cluster

SCTP

XT_U32

SR-IOV

PTP

Performance

DPDK

Container-mount-namespace

Cleaning up

Override test image parameters

Ginkgo parameters

Available features

Discovery mode

Required environment configuration prerequisites

Limiting the nodes used during tests

Using a single performance profile

Disabling the performance profile cleanup

Running the latency tests

Running hwlatdetect

Capturing the results

Running cyclictest

Capturing the results

Running oslat

Troubleshooting

Test reports

JUnit test output

Test failure report

A note on podman

Running on OKD 4.4

Using a single performance profile

Debugging low latency CNF tuning status

Machine config pools

Collecting low latency tuning debugging data for Red Hat Support

About the must-gather tool

About collecting low latency tuning data

Creating a pod with a QoS class of `Guaranteed`