NUMA

FEATURE STATE: KubeVirt v0.43

NUMA support in KubeVirt is at this stage limited to a small set of special use-cases and will improve over time together with improvements made to Kubernetes.

In general, the goal is to map the host NUMA topology as efficiently as possible to the Virtual Machine topology to improve the performance.

The following NUMA mapping strategies can be used:

Preconditions

In order to use current NUMA support, the following preconditions must be met:

GuestMappingPassthrough

GuestMappingPassthrough will pass through the node numa topology to the guest. The topology is based on the dedicated CPUs which the VMI got assigned from the kubelet via the CPU Manager. It can be requested by setting spec.domain.cpu.numa.guestMappingPassthrough on the VMI.

Since KubeVirt does not know upfront which exclusive CPUs the VMI will get from the kubelet, there are some limitations:

  • Guests may see different NUMA topologies when being rescheduled.
  • The resulting NUMA topology may be asymmetrical.
  • The VMI may fail to start on the node if not enough hugepages are available on the assigned NUMA nodes.

While this NUMA modelling strategy has its limitations, aligning the guest’s NUMA architecture with the node’s can be critical for high-performance applications.

An example VMI may look like this:

  1. apiVersion: kubevirt.io/v1
  2. kind: VirtualMachineInstance
  3. metadata:
  4. name: numavm
  5. spec:
  6. domain:
  7. cpu:
  8. cores: 4
  9. dedicatedCpuPlacement: true
  10. numa:
  11. guestMappingPassthrough: { }
  12. devices:
  13. disks:
  14. - disk:
  15. bus: virtio
  16. name: containerdisk
  17. - disk:
  18. bus: virtio
  19. name: cloudinitdisk
  20. resources:
  21. requests:
  22. memory: 64Mi
  23. memory:
  24. hugepages:
  25. pageSize: 2Mi
  26. volumes:
  27. - containerDisk:
  28. image: quay.io/kubevirt/cirros-container-disk-demo
  29. name: containerdisk
  30. - cloudInitNoCloud:
  31. userData: |
  32. #!/bin/sh
  33. echo 'printed from cloud-init userdata'
  34. name: cloudinitdisk

Running real-time workloads

Overview

It is possible to deploy Virtual Machines that run a real-time kernel and make use of libvirtd’s guest cpu and memory optimizations that improve the overall latency. These changes leverage mostly on already available settings in KubeVirt, as we will see shortly, but the VMI manifest now exposes two new settings that instruct KubeVirt to configure the generated libvirt XML with the recommended tuning settings for running real-time workloads.

To make use of the optimized settings, two new settings have been added to the VMI schema:

  • spec.domain.cpu.realtime: When defined, it instructs KubeVirt to configure the linux scheduler for the VCPUS to run processes in FIFO scheduling policy (SCHED_FIFO) with priority 1. This setting guarantees that all processes running in the host will be executed with real-time priority.

  • spec.domain.cpu.realtime.mask: It defines which VCPUs assigned to the VM are used for real-time. If not defined, libvirt will define all VCPUS assigned to run processes in FIFO scheduling and in the highest priority (1).

Preconditions

A prerequisite to running real-time workloads include locking resources in the cluster to allow the real-time VM exclusive usage. This translates into nodes, or node, that have been configured with a dedicated set of CPUs and also provides support for NUMA with a free number of hugepages of 2Mi or 1Gi size (depending on the configuration in the VMI). Additionally, the node must be configured to allow the scheduler to run processes with real-time policy.

Nodes capable of running real-time workloads

When the KubeVirt pods are deployed in a node, it will check if it is capable of running processes in real-time scheduling policy and label the node as real-time capable (kubevirt.io/realtime). If, on the other hand, the node is not able to deliver such capability, the label is not applied. To check which nodes are able to host real-time VM workloads run this command:

  1. $>kubectl get nodes -l kubevirt.io/realtime
  2. NAME STATUS ROLES AGE VERSION
  3. worker-0-0 Ready worker 12d v1.20.0+df9c838

Internally, the KubeVirt pod running in each node checks if the kernel setting kernel.sched_rt_runtime_us equals to -1, which grants processes to run in real-time scheduling policy for an unlimited amount of time.

Configuring a VM Manifest

Here is an example of a VM manifest that runs a custom fedora container disk configured to run with a real-time kernel. The settings have been configured for optimal efficiency.

  1. ---
  2. apiVersion: kubevirt.io/v1
  3. kind: VirtualMachine
  4. metadata:
  5. labels:
  6. kubevirt.io/vm: fedora-realtime
  7. name: fedora-realtime
  8. spec:
  9. runStrategy: Always
  10. template:
  11. metadata:
  12. labels:
  13. kubevirt.io/vm: fedora-realtime
  14. spec:
  15. domain:
  16. devices:
  17. autoattachSerialConsole: true
  18. autoattachMemBalloon: false
  19. autoattachGraphicsDevice: false
  20. disks:
  21. - disk:
  22. bus: virtio
  23. name: containerdisk
  24. - disk:
  25. bus: virtio
  26. name: cloudinitdisk
  27. machine:
  28. type: ""
  29. resources:
  30. requests:
  31. memory: 1Gi
  32. cpu: 2
  33. limits:
  34. memory: 1Gi
  35. cpu: 2
  36. cpu:
  37. model: host-passthrough
  38. dedicatedCpuPlacement: true
  39. isolateEmulatorThread: true
  40. ioThreadsPolicy: auto
  41. features:
  42. - name: tsc-deadline
  43. policy: require
  44. numa:
  45. guestMappingPassthrough: {}
  46. realtime: {}
  47. memory:
  48. hugepages:
  49. pageSize: 1Gi
  50. terminationGracePeriodSeconds: 0
  51. volumes:
  52. - containerDisk:
  53. image: quay.io/kubevirt/fedora-realtime-container-disk:v20211008-22109a3
  54. name: containerdisk
  55. - cloudInitNoCloud:
  56. userData: |-
  57. #cloud-config
  58. password: fedora
  59. chpasswd: { expire: False }
  60. bootcmd:
  61. - tuned-adm profile realtime
  62. name: cloudinitdisk

Breaking down the tuned sections, we have the following configuration:

Devices: - Disable the guest’s memory balloon capability - Avoid attaching a graphics device, to reduce the number of interrupts to the kernel.

  1. spec:
  2. domain:
  3. devices:
  4. autoattachSerialConsole: true
  5. autoattachMemBalloon: false
  6. autoattachGraphicsDevice: false

CPU: - model: host-passthrough to allow the guest to see host CPU without masking any capability. - dedicated CPU Placement: The VM needs to have dedicated CPUs assigned to it. The Kubernetes CPU Manager takes care of this aspect. - isolatedEmulatorThread: to request an additional CPU to run the emulator on it, thus avoid using CPU cycles from the workload CPUs. - ioThreadsPolicy: Set to auto to let the dedicated IO thread to run in the same CPU as the emulator thread. - NUMA: defining guestMappingPassthrough enables NUMA support for this VM. - realtime: instructs the virt-handler to configure this VM for real-time workloads, such as configuring the VCPUS to use FIFO scheduler policy and set priority to 1. cpu:

  1. cpu:
  2. model: host-passthrough
  3. dedicatedCpuPlacement: true
  4. isolateEmulatorThread: true
  5. ioThreadsPolicy: auto
  6. features:
  7. - name: tsc-deadline
  8. policy: require
  9. numa:
  10. guestMappingPassthrough: {}
  11. realtime: {}

Memory - pageSize: allocate the pod’s memory in hugepages of the given size, in this case of 1Gi.

  1. memory:
  2. hugepages:
  3. pageSize: 1Gi

How to dedicate VCPUS for real-time only

It is possible to pass a regular expression of the VCPUs to isolate to use real-time scheduling policy, by using the realtime.mask setting.

  1. cpu:
  2. numa:
  3. guestMappingPassthrough: {}
  4. realtime:
  5. mask: "0"

When applied this configuration, KubeVirt will only set the first VCPU for real-time scheduler policy, leaving the remaining VCPUS to use the default scheduler policy. Other examples of valid masks are: - 0-3: Use cores 0 to 3 for real-time scheduling, assuming that the VM has requested at least 3 cores. - 0-3,^1: Use cores 0, 2 and 3 for real-time scheduling only, assuming that the VM has requested at least 3 cores.

Additional Reading

Kubernetes provides additional NUMA components that may be relevant to your use-case but typically are not enabled by default. Please consult the Kubernetes documentation for details on configuration of these components.

Topology Manager

Topology Manager provides optimizations related to CPU isolation, memory and device locality. It is useful, for example, where an SR-IOV network adaptor VF allocation needs to be aligned with a NUMA node.

https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/

Memory Manager

Memory Manager is analogous to CPU Manager. It is useful, for example, where you want to align hugepage allocations with a NUMA node. It works in conjunction with Topology Manager.

The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.

https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/