Host Devices Assignment

KubeVirt provides a mechanism for assigning host devices to a virtual machine. This mechanism is generic and allows various types of PCI devices, such as accelerators (including GPUs) or any other devices attached to a PCI bus, to be assigned. It also allows Linux Mediated devices, such as pre-configured virtual GPUs to be assigned using the same mechanism.

Host preparation for PCI Passthrough

  • Host Devices passthrough requires the virtualization extension and the IOMMU extension (Intel VT-d or AMD IOMMU) to be enabled in the BIOS.

  • To enable IOMMU, depending on the CPU type, a host should be booted with an additional kernel parameter, intel_iommu=on for Intel and amd_iommu=on for AMD.

Append these parameters to the end of the GRUB_CMDLINE_LINUX line in the grub configuration file.

  1. # vi /etc/default/grub
  2. ...
  3. GRUB_CMDLINE_LINUX="nofb splash=quiet console=tty0 ... intel_iommu=on
  4. ...
  5. # grub2-mkconfig -o /boot/grub2/grub.cfg
  6. # reboot
  • The vfio-pci kernel module should be enabled on the host.

    1. # modprobe vfio-pci

Preparation of PCI devices for passthrough

At this time, KubeVirt is only able to assign PCI devices that are using the vfio-pci driver. To prepare a specific device for device assignment, it should first be unbound from its original driver and bound to the vfio-pci driver.

  • Find the PCI address of the desired device:
  1. $ lspci -DD|grep NVIDIA
  2. 0000.65:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
  • Bind that device to the vfio-pci driver:

    1. echo 0000:65:00.0 > /sys/bus/pci/drivers/nvidia/unbind
    2. echo "vfio-pci" > /sys/bus/pci/devices/0000\:65\:00.0/driver_override
    3. echo 0000:65:00.0 > /sys/bus/pci/drivers/vfio-pci/bind

Preparation of mediated devices such as vGPU

In general, configuration of a Mediated devices (mdevs), such as vGPUs, should be done according to the vendor directions. KubeVirt can now facilitate the creation of the mediated devices / vGPUs on the cluster nodes. This assumes that the required vendor driver is already installed on the nodes. See the Mediated devices and virtual GPUs to learn more about this functionality.

Once the mdev is configured, KubeVirt will be able to discover and use it for device assignment.

Listing permitted devices

Administrators can control which host devices are exposed and permitted to be used in the cluster. Permitted host devices in the cluster will need to be allowlisted in KubeVirt CR by its vendor:product selector for PCI devices or mediated device names.

  1. configuration:
  2. permittedHostDevices:
  3. pciHostDevices:
  4. - pciVendorSelector: "10DE:1EB8"
  5. resourceName: "nvidia.com/TU104GL_Tesla_T4"
  6. externalResourceProvider: true
  7. - pciVendorSelector: "8086:6F54"
  8. resourceName: "intel.com/qat"
  9. mediatedDevices:
  10. - mdevNameSelector: "GRID T4-1Q"
  11. resourceName: "nvidia.com/GRID_T4-1Q"
  • pciVendorSelector is a PCI vendor ID and product ID tuple in the form vendor_id:product_id. This tuple can identify specific types of devices on a host. For example, the identifier 10de:1eb8, shown above, can be found using lspci.

    1. $ lspci -nnv|grep -i nvidia
    2. 65:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
  • mdevNameSelector is a name of a Mediated device type that can identify specific types of Mediated devices on a host.

    You can see what mediated types a given PCI device supports by examining the contents of /sys/bus/pci/devices/SLOT:BUS:DOMAIN.FUNCTION/mdev_supported_types/TYPE/name. For example, if you have an NVIDIA T4 GPU on your system, and you substitute in the SLOT, BUS, DOMAIN, and FUNCTION values that are correct for your system into the above path name, you will see that a TYPE of nvidia-226 contains the selector string GRID T4-2A in its name file.

    Taking GRID T4-2A and specifying it as the mdevNameSelector allows KubeVirt to find a corresponding mediated device by matching it against /sys/class/mdev_bus/SLOT:BUS:DOMAIN.FUNCTION/$mdevUUID/mdev_type/name for some values of SLOT:BUS:DOMAIN.FUNCTION and $mdevUUID.

  • External providers: externalResourceProvider field indicates that this resource is being provided by an external device plugin. In this case, KubeVirt will only permit the usage of this device in the cluster but will leave the allocation and monitoring to an external device plugin.

Starting a Virtual Machine

Host devices can be assigned to virtual machines via the gpus and hostDevices fields. The deviceNames can reference both PCI and Mediated device resource names.

  1. kind: VirtualMachineInstance
  2. spec:
  3. domain:
  4. devices:
  5. gpus:
  6. - deviceName: nvidia.com/TU104GL_Tesla_T4
  7. name: gpu1
  8. - deviceName: nvidia.com/GRID_T4-1Q
  9. name: gpu2
  10. hostDevices:
  11. - deviceName: intel.com/qat
  12. name: quickaccess1

NVMe PCI passthrough

In order to passthrough an NVMe device the procedure is very similar to the gpu case. The device needs to be listed under the permittedHostDevice and under hostDevices in the VM declaration.

Currently, the KubeVirt device plugin doesn’t allow the user to select a specific device by specifying the address. Therefore, if multiple NVMe devices with the same vendor and product id exist in the cluster, they could be randomly assigned to a VM. If the devices are not on the same node, then the nodeSelector mitigates the issue.

Example:

Modify the permittedHostDevice

  1. configuration:
  2. permittedHostDevices:
  3. pciHostDevices:
  4. - pciVendorSelector: 8086:5845
  5. resourceName: devices.kubevirt.io/nvme

VMI declaration:

  1. kind: VirtualMachineInstance
  2. metadata:
  3. labels:
  4. special: vmi-nvme
  5. name: vmi-nvme
  6. spec:
  7. nodeSelector:
  8. kubernetes.io/hostname: node03 # <--
  9. domain:
  10. devices:
  11. disks:
  12. - disk:
  13. bus: virtio
  14. name: containerdisk
  15. - disk:
  16. bus: virtio
  17. name: cloudinitdisk
  18. hostDevices: # <--
  19. - name: nvme # <--
  20. deviceName: devices.kubevirt.io/nvme # <--
  21. resources:
  22. requests:
  23. memory: 1024M
  24. terminationGracePeriodSeconds: 0
  25. volumes:
  26. - containerDisk:
  27. image: registry:5000/kubevirt/fedora-with-test-tooling-container-disk:devel
  28. name: containerdisk
  29. - cloudInitNoCloud:
  30. userData: |-
  31. #cloud-config
  32. password: fedora
  33. chpasswd: { expire: False }
  34. name: cloudinitdisk