Scheduling NUMA-aware workloads

Learn about NUMA-aware scheduling and how you can use it to deploy high performance workloads in an OKD cluster.

The NUMA Resources Operator allows you to schedule high-performance workloads in the same NUMA zone. It deploys a node resources exporting agent that reports on available cluster node NUMA resources, and a secondary scheduler that manages the workloads.

About NUMA-aware scheduling

Non-Uniform Memory Access (NUMA) is a compute platform architecture that allows different CPUs to access different regions of memory at different speeds. NUMA resource topology refers to the locations of CPUs, memory, and PCI devices relative to each other in the compute node. Co-located resources are said to be in the same NUMA zone. For high-performance applications, the cluster needs to process pod workloads in a single NUMA zone.

NUMA architecture allows a CPU with multiple memory controllers to use any available memory across CPU complexes, regardless of where the memory is located. This allows for increased flexibility at the expense of performance. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. Also, for I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application. High-performance workloads, such as telecommunications workloads, cannot operate to specification under these conditions. NUMA-aware scheduling aligns the requested cluster compute resources (CPUs, memory, devices) in the same NUMA zone to process latency-sensitive or high-performance workloads efficiently. NUMA-aware scheduling also improves pod density per compute node for greater resource efficiency.

By integrating the Node Tuning Operator’s performance profile with NUMA-aware scheduling, you can further configure CPU affinity to optimize performance for latency-sensitive workloads.

The default OKD pod scheduler scheduling logic considers the available resources of the entire compute node, not individual NUMA zones. If the most restrictive resource alignment is requested in the kubelet topology manager, error conditions can occur when admitting the pod to a node. Conversely, if the most restrictive resource alignment is not requested, the pod can be admitted to the node without proper resource alignment, leading to worse or unpredictable performance. For example, runaway pod creation with Topology Affinity Error statuses can occur when the pod scheduler makes suboptimal scheduling decisions for guaranteed pod workloads by not knowing if the pod’s requested resources are available. Scheduling mismatch decisions can cause indefinite pod startup delays. Also, depending on the cluster state and resource allocation, poor pod scheduling decisions can cause extra load on the cluster because of failed startup attempts.

The NUMA Resources Operator deploys a custom NUMA resources secondary scheduler and other resources to mitigate against the shortcomings of the default OKD pod scheduler. The following diagram provides a high-level overview of NUMA-aware pod scheduling.

Diagram of NUMA-aware scheduling that shows how the various components interact with each other in the cluster

Figure 1. NUMA-aware scheduling overview

NodeResourceTopology API

The NodeResourceTopology API describes the available NUMA zone resources in each compute node.

NUMA-aware scheduler

The NUMA-aware secondary scheduler receives information about the available NUMA zones from the NodeResourceTopology API and schedules high-performance workloads on a node where it can be optimally processed.

Node topology exporter

The node topology exporter exposes the available NUMA zone resources for each compute node to the NodeResourceTopology API. The node topology exporter daemon tracks the resource allocation from the kubelet by using the PodResources API.

PodResources API

The PodResources API is local to each node and exposes the resource topology and available resources to the kubelet.

The List endpoint of the PodResources API exposes exclusive CPUs allocated to a particular container. The API does not expose CPUs that belong to a shared pool.

The GetAllocatableResources endpoint exposes allocatable resources available on a node.

Additional resources

Installing the NUMA Resources Operator

NUMA Resources Operator deploys resources that allow you to schedule NUMA-aware workloads and deployments. You can install the NUMA Resources Operator using the OKD CLI or the web console.

Installing the NUMA Resources Operator using the CLI

As a cluster administrator, you can install the Operator using the CLI.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a namespace for the NUMA Resources Operator:

    1. Save the following YAML in the nro-namespace.yaml file:

      1. apiVersion: v1
      2. kind: Namespace
      3. metadata:
      4. name: openshift-numaresources
    2. Create the Namespace CR by running the following command:

      1. $ oc create -f nro-namespace.yaml
  2. Create the Operator group for the NUMA Resources Operator:

    1. Save the following YAML in the nro-operatorgroup.yaml file:

      1. apiVersion: operators.coreos.com/v1
      2. kind: OperatorGroup
      3. metadata:
      4. name: numaresources-operator
      5. namespace: openshift-numaresources
      6. spec:
      7. targetNamespaces:
      8. - openshift-numaresources
    2. Create the OperatorGroup CR by running the following command:

      1. $ oc create -f nro-operatorgroup.yaml
  3. Create the subscription for the NUMA Resources Operator:

    1. Save the following YAML in the nro-sub.yaml file:

      1. apiVersion: operators.coreos.com/v1
      2. kind: Subscription
      3. metadata:
      4. name: numaresources-operator
      5. namespace: openshift-numaresources
      6. spec:
      7. channel: "4.13"
      8. name: numaresources-operator
      9. source: redhat-operators
      10. sourceNamespace: openshift-marketplace
    2. Create the Subscription CR by running the following command:

      1. $ oc create -f nro-sub.yaml

Verification

  1. Verify that the installation succeeded by inspecting the CSV resource in the openshift-numaresources namespace. Run the following command:

    1. $ oc get csv -n openshift-numaresources

    Example output

    1. NAME DISPLAY VERSION REPLACES PHASE
    2. numaresources-operator.v4.13.2 numaresources-operator 4.13.2 Succeeded

Installing the NUMA Resources Operator using the web console

As a cluster administrator, you can install the NUMA Resources Operator using the web console.

Procedure

  1. Install the NUMA Resources Operator using the OKD web console:

    1. In the OKD web console, click OperatorsOperatorHub.

    2. Choose NUMA Resources Operator from the list of available Operators, and then click Install.

  2. Optional: Verify that the NUMA Resources Operator installed successfully:

    1. Switch to the OperatorsInstalled Operators page.

    2. Ensure that NUMA Resources Operator is listed in the default project with a Status of InstallSucceeded.

      During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

      If the Operator does not appear as installed, to troubleshoot further:

      • Go to the OperatorsInstalled Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.

      • Go to the WorkloadsPods page and check the logs for pods in the default project.

Scheduling NUMA-aware workloads

Clusters running latency-sensitive workloads typically feature performance profiles that help to minimize workload latency and optimize performance. The NUMA-aware scheduler deploys workloads based on available node NUMA resources and with respect to any performance profile settings applied to the node. The combination of NUMA-aware deployments, and the performance profile of the workload, ensures that workloads are scheduled in a way that maximizes performance.

Creating the NUMAResourcesOperator custom resource

When you have installed the NUMA Resources Operator, then create the NUMAResourcesOperator custom resource (CR) that instructs the NUMA Resources Operator to install all the cluster infrastructure needed to support the NUMA-aware scheduler, including daemon sets and APIs.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator.

Procedure

  1. Create the NUMAResourcesOperator custom resource:

    1. Save the following YAML in the nrop.yaml file:

      1. apiVersion: nodetopology.openshift.io/v1
      2. kind: NUMAResourcesOperator
      3. metadata:
      4. name: numaresourcesoperator
      5. spec:
      6. nodeGroups:
      7. - machineConfigPoolSelector:
      8. matchLabels:
      9. pools.operator.machineconfiguration.openshift.io/worker: ""
    2. Create the NUMAResourcesOperator CR by running the following command:

      1. $ oc create -f nrop.yaml

Verification

  • Verify that the NUMA Resources Operator deployed successfully by running the following command:

    1. $ oc get numaresourcesoperators.nodetopology.openshift.io

    Example output

    1. NAME AGE
    2. numaresourcesoperator 10m

Deploying the NUMA-aware secondary pod scheduler

After you install the NUMA Resources Operator, do the following to deploy the NUMA-aware secondary pod scheduler:

  • Configure the performance profile.

  • Deploy the NUMA-aware secondary scheduler.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Create the required machine config pool.

  • Install the NUMA Resources Operator.

Procedure

  1. Create the PerformanceProfile custom resource (CR):

    1. Save the following YAML in the nro-perfprof.yaml file:

      1. apiVersion: performance.openshift.io/v2
      2. kind: PerformanceProfile
      3. metadata:
      4. name: perfprof-nrop
      5. spec:
      6. cpu: (1)
      7. isolated: "4-51,56-103"
      8. reserved: "0,1,2,3,52,53,54,55"
      9. nodeSelector:
      10. node-role.kubernetes.io/worker: ""
      11. numa:
      12. topologyPolicy: single-numa-node
      1The cpu.isolated and cpu.reserved specifications define ranges for isolated and reserved CPUs. Enter valid values for your CPU configuration. See the Additional resources section for more information about configuring a performance profile.
    2. Create the PerformanceProfile CR by running the following command:

      1. $ oc create -f nro-perfprof.yaml

      Example output

      1. performanceprofile.performance.openshift.io/perfprof-nrop created
  2. Create the NUMAResourcesScheduler custom resource that deploys the NUMA-aware custom pod scheduler:

    1. Save the following YAML in the nro-scheduler.yaml file:

      1. apiVersion: nodetopology.openshift.io/v1
      2. kind: NUMAResourcesScheduler
      3. metadata:
      4. name: numaresourcesscheduler
      5. spec:
      6. imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.13"
      7. cacheResyncPeriod: "5s" (1)
      1Enter an interval value in seconds for synchronization of the scheduler cache. A value of 5s is typical for most implementations.
      • Enable the cacheResyncPeriod specification to help the NUMA Resource Operator report more exact resource availability by monitoring pending resources on nodes and synchronizing this information in the scheduler cache at a defined interval. This also helps to minimize Topology Affinity Error errors because of sub-optimal scheduling decisions. The lower the interval the greater the network load. The cacheResyncPeriod specification is disabled by default.

      • Setting a value of Enabled for the podsFingerprinting specification in the NUMAResourcesOperator CR is a requirement for the implementation of the cacheResyncPeriod specification.

    2. Create the NUMAResourcesScheduler CR by running the following command:

      1. $ oc create -f nro-scheduler.yaml

Verification

  1. Verify that the performance profile was applied by running the following command:

    1. $ oc describe performanceprofile <performance-profile-name>
  2. Verify that the required resources deployed successfully by running the following command:

    1. $ oc get all -n openshift-numaresources

    Example output

    1. NAME READY STATUS RESTARTS AGE
    2. pod/numaresources-controller-manager-7575848485-bns4s 1/1 Running 0 13m
    3. pod/numaresourcesoperator-worker-dvj4n 2/2 Running 0 16m
    4. pod/numaresourcesoperator-worker-lcg4t 2/2 Running 0 16m
    5. pod/secondary-scheduler-56994cf6cf-7qf4q 1/1 Running 0 16m
    6. NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
    7. daemonset.apps/numaresourcesoperator-worker 2 2 2 2 2 node-role.kubernetes.io/worker= 16m
    8. NAME READY UP-TO-DATE AVAILABLE AGE
    9. deployment.apps/numaresources-controller-manager 1/1 1 1 13m
    10. deployment.apps/secondary-scheduler 1/1 1 1 16m
    11. NAME DESIRED CURRENT READY AGE
    12. replicaset.apps/numaresources-controller-manager-7575848485 1 1 1 13m
    13. replicaset.apps/secondary-scheduler-56994cf6cf 1 1 1 16m

Additional resources

Scheduling workloads with the NUMA-aware scheduler

You can schedule workloads with the NUMA-aware scheduler using Deployment CRs that specify the minimum required resources to process the workload.

The following example deployment uses NUMA-aware scheduling for a sample workload.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

Procedure

  1. Get the name of the NUMA-aware scheduler that is deployed in the cluster by running the following command:

    1. $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'

    Example output

    1. topo-aware-scheduler
  2. Create a Deployment CR that uses scheduler named topo-aware-scheduler, for example:

    1. Save the following YAML in the nro-deployment.yaml file:

      1. apiVersion: apps/v1
      2. kind: Deployment
      3. metadata:
      4. name: numa-deployment-1
      5. namespace: openshift-numaresources
      6. spec:
      7. replicas: 1
      8. selector:
      9. matchLabels:
      10. app: test
      11. template:
      12. metadata:
      13. labels:
      14. app: test
      15. spec:
      16. schedulerName: topo-aware-scheduler (1)
      17. containers:
      18. - name: ctnr
      19. image: quay.io/openshifttest/hello-openshift:openshift
      20. imagePullPolicy: IfNotPresent
      21. resources:
      22. limits:
      23. memory: "100Mi"
      24. cpu: "10"
      25. requests:
      26. memory: "100Mi"
      27. cpu: "10"
      28. - name: ctnr2
      29. image: gcr.io/google_containers/pause-amd64:3.0
      30. imagePullPolicy: IfNotPresent
      31. command: ["/bin/sh", "-c"]
      32. args: [ "while true; do sleep 1h; done;" ]
      33. resources:
      34. limits:
      35. memory: "100Mi"
      36. cpu: "8"
      37. requests:
      38. memory: "100Mi"
      39. cpu: "8"
      1schedulerName must match the name of the NUMA-aware scheduler that is deployed in your cluster, for example topo-aware-scheduler.
    2. Create the Deployment CR by running the following command:

      1. $ oc create -f nro-deployment.yaml

Verification

  1. Verify that the deployment was successful:

    1. $ oc get pods -n openshift-numaresources

    Example output

    1. NAME READY STATUS RESTARTS AGE
    2. numa-deployment-1-56954b7b46-pfgw8 2/2 Running 0 129m
    3. numaresources-controller-manager-7575848485-bns4s 1/1 Running 0 15h
    4. numaresourcesoperator-worker-dvj4n 2/2 Running 0 18h
    5. numaresourcesoperator-worker-lcg4t 2/2 Running 0 16h
    6. secondary-scheduler-56994cf6cf-7qf4q 1/1 Running 0 18h
  2. Verify that the topo-aware-scheduler is scheduling the deployed pod by running the following command:

    1. $ oc describe pod numa-deployment-1-56954b7b46-pfgw8 -n openshift-numaresources

    Example output

    1. Events:
    2. Type Reason Age From Message
    3. ---- ------ ---- ---- -------
    4. Normal Scheduled 130m topo-aware-scheduler Successfully assigned openshift-numaresources/numa-deployment-1-56954b7b46-pfgw8 to compute-0.example.com

    Deployments that request more resources than is available for scheduling will fail with a MinimumReplicasUnavailable error. The deployment succeeds when the required resources become available. Pods remain in the Pending state until the required resources are available.

  3. Verify that the expected allocated resources are listed for the node.

    1. Identify the node that is running the deployment pod by running the following command, replacing <namespace> with the namespace you specified in the Deployment CR:

      1. $ oc get pods -n <namespace> -o wide

      Example output

      1. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      2. numa-deployment-1-65684f8fcc-bw4bw 0/2 Running 0 82m 10.128.2.50 worker-0 <none> <none>
    2. Run the following command, replacing <node_name> with the name of that node that is running the deployment pod.

      1. $ oc describe noderesourcetopologies.topology.node.k8s.io

      Example output

      1. ...
      2. Zones:
      3. Costs:
      4. Name: node-0
      5. Value: 10
      6. Name: node-1
      7. Value: 21
      8. Name: node-0
      9. Resources:
      10. Allocatable: 39
      11. Available: 21 (1)
      12. Capacity: 40
      13. Name: cpu
      14. Allocatable: 6442450944
      15. Available: 6442450944
      16. Capacity: 6442450944
      17. Name: hugepages-1Gi
      18. Allocatable: 134217728
      19. Available: 134217728
      20. Capacity: 134217728
      21. Name: hugepages-2Mi
      22. Allocatable: 262415904768
      23. Available: 262206189568
      24. Capacity: 270146007040
      25. Name: memory
      26. Type: Node
      1The Available capacity is reduced because of the resources that have been allocated to the guaranteed pod.

      Resources consumed by guaranteed pods are subtracted from the available node resources listed under noderesourcetopologies.topology.node.k8s.io.

  4. Resource allocations for pods with a Best-effort or Burstable quality of service (qosClass) are not reflected in the NUMA node resources under noderesourcetopologies.topology.node.k8s.io. If a pod’s consumed resources are not reflected in the node resource calculation, verify that the pod has qosClass of Guaranteed and the CPU request is an integer value, not a decimal value. You can verify the that the pod has a qosClass of Guaranteed by running the following command:

    1. $ oc get pod <pod_name> -n <pod_namespace> -o jsonpath="{ .status.qosClass }"

    Example output

    1. Guaranteed

Scheduling NUMA-aware workloads with manual performance settings

Clusters running latency-sensitive workloads typically feature performance profiles that help to minimize workload latency and optimize performance. However, you can schedule NUMA-aware workloads in a pristine cluster that does not feature a performance profile. The following workflow features a pristine cluster that you can manually configure for performance by using the KubeletConfig resource. This is not the typical environment for scheduling NUMA-aware workloads.

Creating the NUMAResourcesOperator custom resource with manual performance settings

When you have installed the NUMA Resources Operator, then create the NUMAResourcesOperator custom resource (CR) that instructs the NUMA Resources Operator to install all the cluster infrastructure needed to support the NUMA-aware scheduler, including daemon sets and APIs.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator.

Procedure

  1. Optional: Create the MachineConfigPool custom resource that enables custom kubelet configurations for worker nodes:

    By default, OKD creates a MachineConfigPool resource for worker nodes in the cluster. You can create a custom MachineConfigPool resource if required.

    1. Save the following YAML in the nro-machineconfig.yaml file:

      1. apiVersion: machineconfiguration.openshift.io/v1
      2. kind: MachineConfigPool
      3. metadata:
      4. labels:
      5. cnf-worker-tuning: enabled
      6. machineconfiguration.openshift.io/mco-built-in: ""
      7. pools.operator.machineconfiguration.openshift.io/worker: ""
      8. name: worker
      9. spec:
      10. machineConfigSelector:
      11. matchLabels:
      12. machineconfiguration.openshift.io/role: worker
      13. nodeSelector:
      14. matchLabels:
      15. node-role.kubernetes.io/worker: ""
    2. Create the MachineConfigPool CR by running the following command:

      1. $ oc create -f nro-machineconfig.yaml
  2. Create the NUMAResourcesOperator custom resource:

    1. Save the following YAML in the nrop.yaml file:

      1. apiVersion: nodetopology.openshift.io/v1
      2. kind: NUMAResourcesOperator
      3. metadata:
      4. name: numaresourcesoperator
      5. spec:
      6. nodeGroups:
      7. - machineConfigPoolSelector:
      8. matchLabels:
      9. pools.operator.machineconfiguration.openshift.io/worker: "" (1)
      1Should match the label applied to worker nodes in the related MachineConfigPool CR.
    2. Create the NUMAResourcesOperator CR by running the following command:

      1. $ oc create -f nrop.yaml

Verification

  • Verify that the NUMA Resources Operator deployed successfully by running the following command:

    1. $ oc get numaresourcesoperators.nodetopology.openshift.io

    Example output

    1. NAME AGE
    2. numaresourcesoperator 10m

Deploying the NUMA-aware secondary pod scheduler with manual performance settings

After you install the NUMA Resources Operator, do the following to deploy the NUMA-aware secondary pod scheduler:

  • Configure the pod admittance policy for the required machine profile

  • Create the required machine config pool

  • Deploy the NUMA-aware secondary scheduler

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator.

Procedure

  1. Create the KubeletConfig custom resource that configures the pod admittance policy for the machine profile:

    1. Save the following YAML in the nro-kubeletconfig.yaml file:

      1. apiVersion: machineconfiguration.openshift.io/v1
      2. kind: KubeletConfig
      3. metadata:
      4. name: cnf-worker-tuning
      5. spec:
      6. machineConfigPoolSelector:
      7. matchLabels:
      8. cnf-worker-tuning: enabled
      9. kubeletConfig:
      10. cpuManagerPolicy: "static" (1)
      11. cpuManagerReconcilePeriod: "5s"
      12. reservedSystemCPUs: "0,1"
      13. memoryManagerPolicy: "Static" (2)
      14. evictionHard:
      15. memory.available: "100Mi"
      16. kubeReserved:
      17. memory: "512Mi"
      18. reservedMemory:
      19. - numaNode: 0
      20. limits:
      21. memory: "1124Mi"
      22. systemReserved:
      23. memory: "512Mi"
      24. topologyManagerPolicy: "single-numa-node" (3)
      25. topologyManagerScope: "pod"
      1For cpuManagerPolicy, static must use a lowercase s.
      2For memoryManagerPolicy, Static must use an uppercase S.
      3topologyManagerPolicy must be set to single-numa-node.
    2. Create the KubeletConfig custom resource (CR) by running the following command:

      1. $ oc create -f nro-kubeletconfig.yaml
  2. Create the NUMAResourcesScheduler custom resource that deploys the NUMA-aware custom pod scheduler:

    1. Save the following YAML in the nro-scheduler.yaml file:

      1. apiVersion: nodetopology.openshift.io/v1
      2. kind: NUMAResourcesScheduler
      3. metadata:
      4. name: numaresourcesscheduler
      5. spec:
      6. imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.13"
      7. cacheResyncPeriod: "5s" (1)
      1Enter an interval value in seconds for synchronization of the scheduler cache. A value of 5s is typical for most implementations.
      • Enable the cacheResyncPeriod specification to help the NUMA Resource Operator report more exact resource availability by monitoring pending resources on nodes and synchronizing this information in the scheduler cache at a defined interval. This also helps to minimize Topology Affinity Error errors because of sub-optimal scheduling decisions. The lower the interval the greater the network load. The cacheResyncPeriod specification is disabled by default.

      • Setting a value of Enabled for the podsFingerprinting specification in the NUMAResourcesOperator CR is a requirement for the implementation of the cacheResyncPeriod specification.

    2. Create the NUMAResourcesScheduler CR by running the following command:

      1. $ oc create -f nro-scheduler.yaml

Verification

  • Verify that the required resources deployed successfully by running the following command:

    1. $ oc get all -n openshift-numaresources

    Example output

    1. NAME READY STATUS RESTARTS AGE
    2. pod/numaresources-controller-manager-7575848485-bns4s 1/1 Running 0 13m
    3. pod/numaresourcesoperator-worker-dvj4n 2/2 Running 0 16m
    4. pod/numaresourcesoperator-worker-lcg4t 2/2 Running 0 16m
    5. pod/secondary-scheduler-56994cf6cf-7qf4q 1/1 Running 0 16m
    6. NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
    7. daemonset.apps/numaresourcesoperator-worker 2 2 2 2 2 node-role.kubernetes.io/worker= 16m
    8. NAME READY UP-TO-DATE AVAILABLE AGE
    9. deployment.apps/numaresources-controller-manager 1/1 1 1 13m
    10. deployment.apps/secondary-scheduler 1/1 1 1 16m
    11. NAME DESIRED CURRENT READY AGE
    12. replicaset.apps/numaresources-controller-manager-7575848485 1 1 1 13m
    13. replicaset.apps/secondary-scheduler-56994cf6cf 1 1 1 16m

Scheduling workloads with the NUMA-aware scheduler with manual performance settings

You can schedule workloads with the NUMA-aware scheduler using Deployment CRs that specify the minimum required resources to process the workload.

The following example deployment uses NUMA-aware scheduling for a sample workload.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

Procedure

  1. Get the name of the NUMA-aware scheduler that is deployed in the cluster by running the following command:

    1. $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'

    Example output

    1. topo-aware-scheduler
  2. Create a Deployment CR that uses scheduler named topo-aware-scheduler, for example:

    1. Save the following YAML in the nro-deployment.yaml file:

      1. apiVersion: apps/v1
      2. kind: Deployment
      3. metadata:
      4. name: numa-deployment-1
      5. namespace: <namespace> (1)
      6. spec:
      7. replicas: 1
      8. selector:
      9. matchLabels:
      10. app: test
      11. template:
      12. metadata:
      13. labels:
      14. app: test
      15. spec:
      16. schedulerName: topo-aware-scheduler (2)
      17. containers:
      18. - name: ctnr
      19. image: quay.io/openshifttest/hello-openshift:openshift
      20. imagePullPolicy: IfNotPresent
      21. resources:
      22. limits:
      23. memory: "100Mi"
      24. cpu: "10"
      25. requests:
      26. memory: "100Mi"
      27. cpu: "10"
      28. - name: ctnr2
      29. image: gcr.io/google_containers/pause-amd64:3.0
      30. imagePullPolicy: IfNotPresent
      31. resources:
      32. limits:
      33. memory: "100Mi"
      34. cpu: "8"
      35. requests:
      36. memory: "100Mi"
      37. cpu: "8"
      1Replace with the namespace for your deployment.
      2schedulerName must match the name of the NUMA-aware scheduler that is deployed in your cluster, for example topo-aware-scheduler.
    2. Create the Deployment CR by running the following command:

      1. $ oc create -f nro-deployment.yaml

Verification

  1. Verify that the deployment was successful:

    1. $ oc get pods -n openshift-numaresources

    Example output

    1. NAME READY STATUS RESTARTS AGE
    2. numa-deployment-1-56954b7b46-pfgw8 2/2 Running 0 129m
    3. numaresources-controller-manager-7575848485-bns4s 1/1 Running 0 15h
    4. numaresourcesoperator-worker-dvj4n 2/2 Running 0 18h
    5. numaresourcesoperator-worker-lcg4t 2/2 Running 0 16h
    6. secondary-scheduler-56994cf6cf-7qf4q 1/1 Running 0 18h
  2. Verify that the topo-aware-scheduler is scheduling the deployed pod by running the following command:

    1. $ oc describe pod numa-deployment-1-56954b7b46-pfgw8 -n openshift-numaresources

    Example output

    1. Events:
    2. Type Reason Age From Message
    3. ---- ------ ---- ---- -------
    4. Normal Scheduled 130m topo-aware-scheduler Successfully assigned openshift-numaresources/numa-deployment-1-56954b7b46-pfgw8 to compute-0.example.com

    Deployments that request more resources than is available for scheduling will fail with a MinimumReplicasUnavailable error. The deployment succeeds when the required resources become available. Pods remain in the Pending state until the required resources are available.

  3. Verify that the expected allocated resources are listed for the node.

    1. Identify the node that is running the deployment pod by running the following command, replacing <namespace> with the namespace you specified in the Deployment CR:

      1. $ oc get pods -n <namespace> -o wide

      Example output

      1. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      2. numa-deployment-1-65684f8fcc-bw4bw 0/2 Running 0 82m 10.128.2.50 worker-0 <none> <none>
    2. Run the following command, replacing <node_name> with the name of that node that is running the deployment pod:

      1. $ oc describe noderesourcetopologies.topology.node.k8s.io <node_name>

      Example output

      1. ...
      2. Zones:
      3. Costs:
      4. Name: node-0
      5. Value: 10
      6. Name: node-1
      7. Value: 21
      8. Name: node-0
      9. Resources:
      10. Allocatable: 39
      11. Available: 21 (1)
      12. Capacity: 40
      13. Name: cpu
      14. Allocatable: 6442450944
      15. Available: 6442450944
      16. Capacity: 6442450944
      17. Name: hugepages-1Gi
      18. Allocatable: 134217728
      19. Available: 134217728
      20. Capacity: 134217728
      21. Name: hugepages-2Mi
      22. Allocatable: 262415904768
      23. Available: 262206189568
      24. Capacity: 270146007040
      25. Name: memory
      26. Type: Node
      1The Available capacity is reduced because of the resources that have been allocated to the guaranteed pod.

      Resources consumed by guaranteed pods are subtracted from the available node resources listed under noderesourcetopologies.topology.node.k8s.io.

  4. Resource allocations for pods with a Best-effort or Burstable quality of service (qosClass) are not reflected in the NUMA node resources under noderesourcetopologies.topology.node.k8s.io. If a pod’s consumed resources are not reflected in the node resource calculation, verify that the pod has qosClass of Guaranteed and the CPU request is an integer value, not a decimal value. You can verify the that the pod has a qosClass of Guaranteed by running the following command:

    1. $ oc get pod <pod_name> -n <pod_namespace> -o jsonpath="{ .status.qosClass }"

    Example output

    1. Guaranteed

Optional: Configuring polling operations for NUMA resources updates

The daemons controlled by the NUMA Resources Operator in their nodeGroup poll resources to retrieve updates about available NUMA resources. You can fine-tune polling operations for these daemons by configuring the spec.nodeGroups specification in the NUMAResourcesOperator custom resource (CR). This provides advanced control of polling operations. Configure these specifications to improve scheduling behaviour and troubleshoot suboptimal scheduling decisions.

The configuration options are the following:

  • infoRefreshMode: Determines the trigger condition for polling the kubelet. The NUMA Resources Operator reports the resulting information to the API server.

  • infoRefreshPeriod: Determines the duration between polling updates.

  • podsFingerprinting: Determines if point-in-time information for the current set of pods running on a node is exposed in polling updates.

    podsFingerprinting is enabled by default. podsFingerprinting is a requirement for the cacheResyncPeriod specification in the NUMAResourcesScheduler CR. The cacheResyncPeriod specification helps to report more exact resource availability by monitoring pending resources on nodes.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator.

Procedure

  • Configure the spec.nodeGroups specification in your NUMAResourcesOperator CR:

    1. apiVersion: nodetopology.openshift.io/v1
    2. kind: NUMAResourcesOperator
    3. metadata:
    4. name: numaresourcesoperator
    5. spec:
    6. nodeGroups:
    7. - config:
    8. infoRefreshMode: Periodic (1)
    9. infoRefreshPeriod: 10s (2)
    10. podsFingerprinting: Enabled (3)
    11. name: worker
    1Valid values are Periodic, Events, PeriodicAndEvents. Use Periodic to poll the kubelet at intervals that you define in infoRefreshPeriod. Use Events to poll the kubelet at every pod lifecycle event. Use PeriodicAndEvents to enable both methods.
    2Define the polling interval for Periodic or PeriodicAndEvents refresh modes. The field is ignored if the refresh mode is Events.
    3Valid values are Enabled or Disabled. Setting to Enabled is a requirement for the cacheResyncPeriod specification in the NUMAResourcesScheduler.

Verification

  1. After you deploy the NUMA Resources Operator, verify that the node group configurations were applied by running the following command:

    1. $ oc get numaresop numaresourcesoperator -o json | jq '.status'

    Example output

    1. ...
    2. "config": {
    3. "infoRefreshMode": "Periodic",
    4. "infoRefreshPeriod": "10s",
    5. "podsFingerprinting": "Enabled"
    6. },
    7. "name": "worker"
    8. ...

Troubleshooting NUMA-aware scheduling

To troubleshoot common problems with NUMA-aware pod scheduling, perform the following steps.

Prerequisites

  • Install the OKD CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

Procedure

  1. Verify that the noderesourcetopologies CRD is deployed in the cluster by running the following command:

    1. $ oc get crd | grep noderesourcetopologies

    Example output

    1. NAME CREATED AT
    2. noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06Z
  2. Check that the NUMA-aware scheduler name matches the name specified in your NUMA-aware workloads by running the following command:

    1. $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'

    Example output

    1. topo-aware-scheduler
  3. Verify that NUMA-aware scheduable nodes have the noderesourcetopologies CR applied to them. Run the following command:

    1. $ oc get noderesourcetopologies.topology.node.k8s.io

    Example output

    1. NAME AGE
    2. compute-0.example.com 17h
    3. compute-1.example.com 17h

    The number of nodes should equal the number of worker nodes that are configured by the machine config pool (mcp) worker definition.

  4. Verify the NUMA zone granularity for all scheduable nodes by running the following command:

    1. $ oc get noderesourcetopologies.topology.node.k8s.io -o yaml

    Example output

    1. apiVersion: v1
    2. items:
    3. - apiVersion: topology.node.k8s.io/v1
    4. kind: NodeResourceTopology
    5. metadata:
    6. annotations:
    7. k8stopoawareschedwg/rte-update: periodic
    8. creationTimestamp: "2022-06-16T08:55:38Z"
    9. generation: 63760
    10. name: worker-0
    11. resourceVersion: "8450223"
    12. uid: 8b77be46-08c0-4074-927b-d49361471590
    13. topologyPolicies:
    14. - SingleNUMANodeContainerLevel
    15. zones:
    16. - costs:
    17. - name: node-0
    18. value: 10
    19. - name: node-1
    20. value: 21
    21. name: node-0
    22. resources:
    23. - allocatable: "38"
    24. available: "38"
    25. capacity: "40"
    26. name: cpu
    27. - allocatable: "134217728"
    28. available: "134217728"
    29. capacity: "134217728"
    30. name: hugepages-2Mi
    31. - allocatable: "262352048128"
    32. available: "262352048128"
    33. capacity: "270107316224"
    34. name: memory
    35. - allocatable: "6442450944"
    36. available: "6442450944"
    37. capacity: "6442450944"
    38. name: hugepages-1Gi
    39. type: Node
    40. - costs:
    41. - name: node-0
    42. value: 21
    43. - name: node-1
    44. value: 10
    45. name: node-1
    46. resources:
    47. - allocatable: "268435456"
    48. available: "268435456"
    49. capacity: "268435456"
    50. name: hugepages-2Mi
    51. - allocatable: "269231067136"
    52. available: "269231067136"
    53. capacity: "270573244416"
    54. name: memory
    55. - allocatable: "40"
    56. available: "40"
    57. capacity: "40"
    58. name: cpu
    59. - allocatable: "1073741824"
    60. available: "1073741824"
    61. capacity: "1073741824"
    62. name: hugepages-1Gi
    63. type: Node
    64. - apiVersion: topology.node.k8s.io/v1
    65. kind: NodeResourceTopology
    66. metadata:
    67. annotations:
    68. k8stopoawareschedwg/rte-update: periodic
    69. creationTimestamp: "2022-06-16T08:55:37Z"
    70. generation: 62061
    71. name: worker-1
    72. resourceVersion: "8450129"
    73. uid: e8659390-6f8d-4e67-9a51-1ea34bba1cc3
    74. topologyPolicies:
    75. - SingleNUMANodeContainerLevel
    76. zones: (1)
    77. - costs:
    78. - name: node-0
    79. value: 10
    80. - name: node-1
    81. value: 21
    82. name: node-0
    83. resources: (2)
    84. - allocatable: "38"
    85. available: "38"
    86. capacity: "40"
    87. name: cpu
    88. - allocatable: "6442450944"
    89. available: "6442450944"
    90. capacity: "6442450944"
    91. name: hugepages-1Gi
    92. - allocatable: "134217728"
    93. available: "134217728"
    94. capacity: "134217728"
    95. name: hugepages-2Mi
    96. - allocatable: "262391033856"
    97. available: "262391033856"
    98. capacity: "270146301952"
    99. name: memory
    100. type: Node
    101. - costs:
    102. - name: node-0
    103. value: 21
    104. - name: node-1
    105. value: 10
    106. name: node-1
    107. resources:
    108. - allocatable: "40"
    109. available: "40"
    110. capacity: "40"
    111. name: cpu
    112. - allocatable: "1073741824"
    113. available: "1073741824"
    114. capacity: "1073741824"
    115. name: hugepages-1Gi
    116. - allocatable: "268435456"
    117. available: "268435456"
    118. capacity: "268435456"
    119. name: hugepages-2Mi
    120. - allocatable: "269192085504"
    121. available: "269192085504"
    122. capacity: "270534262784"
    123. name: memory
    124. type: Node
    125. kind: List
    126. metadata:
    127. resourceVersion: ""
    128. selfLink: ""
    1Each stanza under zones describes the resources for a single NUMA zone.
    2resources describes the current state of the NUMA zone resources. Check that resources listed under items.zones.resources.available correspond to the exclusive NUMA zone resources allocated to each guaranteed pod.

Checking the NUMA-aware scheduler logs

Troubleshoot problems with the NUMA-aware scheduler by reviewing the logs. If required, you can increase the scheduler log level by modifying the spec.logLevel field of the NUMAResourcesScheduler resource. Acceptable values are Normal, Debug, and Trace, with Trace being the most verbose option.

To change the log level of the secondary scheduler, delete the running scheduler resource and re-deploy it with the changed log level. The scheduler is unavailable for scheduling new workloads during this downtime.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Delete the currently running NUMAResourcesScheduler resource:

    1. Get the active NUMAResourcesScheduler by running the following command:

      1. $ oc get NUMAResourcesScheduler

      Example output

      1. NAME AGE
      2. numaresourcesscheduler 90m
    2. Delete the secondary scheduler resource by running the following command:

      1. $ oc delete NUMAResourcesScheduler numaresourcesscheduler

      Example output

      1. numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
  2. Save the following YAML in the file nro-scheduler-debug.yaml. This example changes the log level to Debug:

    1. apiVersion: nodetopology.openshift.io/v1
    2. kind: NUMAResourcesScheduler
    3. metadata:
    4. name: numaresourcesscheduler
    5. spec:
    6. imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.13"
    7. logLevel: Debug
  3. Create the updated Debug logging NUMAResourcesScheduler resource by running the following command:

    1. $ oc create -f nro-scheduler-debug.yaml

    Example output

    1. numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created

Verification steps

  1. Check that the NUMA-aware scheduler was successfully deployed:

    1. Run the following command to check that the CRD is created succesfully:

      1. $ oc get crd | grep numaresourcesschedulers

      Example output

      1. NAME CREATED AT
      2. numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
    2. Check that the new custom scheduler is available by running the following command:

      1. $ oc get numaresourcesschedulers.nodetopology.openshift.io

      Example output

      1. NAME AGE
      2. numaresourcesscheduler 3h26m
  2. Check that the logs for the scheduler shows the increased log level:

    1. Get the list of pods running in the openshift-numaresources namespace by running the following command:

      1. $ oc get pods -n openshift-numaresources

      Example output

      1. NAME READY STATUS RESTARTS AGE
      2. numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h
      3. numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h
      4. numaresourcesoperator-worker-pb75c 2/2 Running 0 45h
      5. secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
    2. Get the logs for the secondary scheduler pod by running the following command:

      1. $ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources

      Example output

      1. ...
      2. I0223 11:04:55.614788 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received
      3. I0223 11:04:56.609114 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received
      4. I0223 11:05:22.626818 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
      5. I0223 11:05:31.610356 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received
      6. I0223 11:05:31.713032 1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
      7. I0223 11:05:53.461016 1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"

Troubleshooting the resource topology exporter

Troubleshoot noderesourcetopologies objects where unexpected results are occurring by inspecting the corresponding resource-topology-exporter logs.

It is recommended that NUMA resource topology exporter instances in the cluster are named for nodes they refer to. For example, a worker node with the name worker should have a corresponding noderesourcetopologies object called worker.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Get the daemonsets managed by the NUMA Resources Operator. Each daemonset has a corresponding nodeGroup in the NUMAResourcesOperator CR. Run the following command:

    1. $ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"

    Example output

    1. {"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}
  2. Get the label for the daemonset of interest using the value for name from the previous step:

    1. $ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"

    Example output

    1. {"name":"resource-topology"}
  3. Get the pods using the resource-topology label by running the following command:

    1. $ oc get pods -n openshift-numaresources -l name=resource-topology -o wide

    Example output

    1. NAME READY STATUS RESTARTS AGE IP NODE
    2. numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com
    3. numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.com
  4. Examine the logs of the resource-topology-exporter container running on the worker pod that corresponds to the node you are troubleshooting. Run the following command:

    1. $ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c

    Example output

    1. I0221 13:38:18.334140 1 main.go:206] using sysinfo:
    2. reservedCpus: 0,1
    3. reservedMemory:
    4. "0": 1178599424
    5. I0221 13:38:18.334370 1 main.go:67] === System information ===
    6. I0221 13:38:18.334381 1 sysinfo.go:231] cpus: reserved "0-1"
    7. I0221 13:38:18.334493 1 sysinfo.go:237] cpus: online "0-103"
    8. I0221 13:38:18.546750 1 main.go:72]
    9. cpus: allocatable "2-103"
    10. hugepages-1Gi:
    11. numa cell 0 -> 6
    12. numa cell 1 -> 1
    13. hugepages-2Mi:
    14. numa cell 0 -> 64
    15. numa cell 1 -> 128
    16. memory:
    17. numa cell 0 -> 45758Mi
    18. numa cell 1 -> 48372Mi

Correcting a missing resource topology exporter config map

If you install the NUMA Resources Operator in a cluster with misconfigured cluster settings, in some circumstances, the Operator is shown as active but the logs of the resource topology exporter (RTE) daemon set pods show that the configuration for the RTE is missing, for example:

  1. Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"

This log message indicates that the kubeletconfig with the required configuration was not properly applied in the cluster, resulting in a missing RTE configmap. For example, the following cluster is missing a numaresourcesoperator-worker configmap custom resource (CR):

  1. $ oc get configmap

Example output

  1. NAME DATA AGE
  2. 0e2a6bd3.openshift-kni.io 0 6d21h
  3. kube-root-ca.crt 1 6d21h
  4. openshift-service-ca.crt 1 6d21h
  5. topo-aware-scheduler-config 1 6d18h

In a correctly configured cluster, oc get configmap also returns a numaresourcesoperator-worker configmap CR.

Prerequisites

  • Install the OKD CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

Procedure

  1. Compare the values for spec.machineConfigPoolSelector.matchLabels in kubeletconfig and metadata.labels in the MachineConfigPool (mcp) worker CR using the following commands:

    1. Check the kubeletconfig labels by running the following command:

      1. $ oc get kubeletconfig -o yaml

      Example output

      1. machineConfigPoolSelector:
      2. matchLabels:
      3. cnf-worker-tuning: enabled
    2. Check the mcp labels by running the following command:

      1. $ oc get mcp worker -o yaml

      Example output

      1. labels:
      2. machineconfiguration.openshift.io/mco-built-in: ""
      3. pools.operator.machineconfiguration.openshift.io/worker: ""

      The cnf-worker-tuning: enabled label is not present in the MachineConfigPool object.

  2. Edit the MachineConfigPool CR to include the missing label, for example:

    1. $ oc edit mcp worker -o yaml

    Example output

    1. labels:
    2. machineconfiguration.openshift.io/mco-built-in: ""
    3. pools.operator.machineconfiguration.openshift.io/worker: ""
    4. cnf-worker-tuning: enabled
  3. Apply the label changes and wait for the cluster to apply the updated configuration. Run the following command:

Verification

  • Check that the missing numaresourcesoperator-worker configmap CR is applied:

    1. $ oc get configmap

    Example output

    1. NAME DATA AGE
    2. 0e2a6bd3.openshift-kni.io 0 6d21h
    3. kube-root-ca.crt 1 6d21h
    4. numaresourcesoperator-worker 1 5m
    5. openshift-service-ca.crt 1 6d21h
    6. topo-aware-scheduler-config 1 6d18h