Configuring PCI passthrough
The Peripheral Component Interconnect (PCI) passthrough feature enables you to access and manage hardware devices from a virtual machine. When PCI passthrough is configured, the PCI devices function as if they were physically attached to the guest operating system.
Cluster administrators can expose and manage host devices that are permitted to be used in the cluster by using the oc
command-line interface (CLI).
About preparing a host device for PCI passthrough
To prepare a host device for PCI passthrough by using the CLI, create a MachineConfig
object and add kernel arguments to enable the Input-Output Memory Management Unit (IOMMU). Bind the PCI device to the Virtual Function I/O (VFIO) driver and then expose it in the cluster by editing the permittedHostDevices
field of the HyperConverged
custom resource (CR). The permittedHostDevices
list is empty when you first install the OKD Virtualization Operator.
To remove a PCI host device from the cluster by using the CLI, delete the PCI device information from the HyperConverged
CR.
Adding kernel arguments to enable the IOMMU driver
To enable the IOMMU (Input-Output Memory Management Unit) driver in the kernel, create the MachineConfig
object and add the kernel arguments.
Prerequisites
Administrative privilege to a working OKD cluster.
Intel or AMD CPU hardware.
Intel Virtualization Technology for Directed I/O extensions or AMD IOMMU in the BIOS (Basic Input/Output System) is enabled.
Procedure
Create a
MachineConfig
object that identifies the kernel argument. The following example shows a kernel argument for an Intel CPU.apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker (1)
name: 100-worker-iommu (2)
spec:
config:
ignition:
version: 3.2.0
kernelArguments:
- intel_iommu=on (3)
# ...
1 Applies the new kernel argument only to worker nodes. 2 The name
indicates the ranking of this kernel argument (100) among the machine configs and its purpose. If you have an AMD CPU, specify the kernel argument asamd_iommu=on
.3 Identifies the kernel argument as intel_iommu
for an Intel CPU.Create the new
MachineConfig
object:$ oc create -f 100-worker-kernel-arg-iommu.yaml
Verification
Verify that the new
MachineConfig
object was added.$ oc get MachineConfig
Binding PCI devices to the VFIO driver
To bind PCI devices to the VFIO (Virtual Function I/O) driver, obtain the values for vendor-ID
and device-ID
from each device and create a list with the values. Add this list to the MachineConfig
object. The MachineConfig
Operator generates the /etc/modprobe.d/vfio.conf
on the nodes with the PCI devices, and binds the PCI devices to the VFIO driver.
Prerequisites
- You added kernel arguments to enable IOMMU for the CPU.
Procedure
Run the
lspci
command to obtain thevendor-ID
and thedevice-ID
for the PCI device.$ lspci -nnv | grep -i nvidia
Example output
02:01.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:1eb8] (rev a1)
Create a Butane config file,
100-worker-vfiopci.bu
, binding the PCI device to the VFIO driver.See “Creating machine configs with Butane” for information about Butane.
Example
variant: openshift
version: 4.13.0
metadata:
name: 100-worker-vfiopci
labels:
machineconfiguration.openshift.io/role: worker (1)
storage:
files:
- path: /etc/modprobe.d/vfio.conf
mode: 0644
overwrite: true
contents:
inline: |
options vfio-pci ids=10de:1eb8 (2)
- path: /etc/modules-load.d/vfio-pci.conf (3)
mode: 0644
overwrite: true
contents:
inline: vfio-pci
1 Applies the new kernel argument only to worker nodes. 2 Specify the previously determined vendor-ID
value (10de
) and thedevice-ID
value (1eb8
) to bind a single device to the VFIO driver. You can add a list of multiple devices with their vendor and device information.3 The file that loads the vfio-pci kernel module on the worker nodes. Use Butane to generate a
MachineConfig
object file,100-worker-vfiopci.yaml
, containing the configuration to be delivered to the worker nodes:$ butane 100-worker-vfiopci.bu -o 100-worker-vfiopci.yaml
Apply the
MachineConfig
object to the worker nodes:$ oc apply -f 100-worker-vfiopci.yaml
Verify that the
MachineConfig
object was added.$ oc get MachineConfig
Example output
NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master d3da910bfa9f4b599af4ed7f5ac270d55950a3a1 3.2.0 25h
00-worker d3da910bfa9f4b599af4ed7f5ac270d55950a3a1 3.2.0 25h
01-master-container-runtime d3da910bfa9f4b599af4ed7f5ac270d55950a3a1 3.2.0 25h
01-master-kubelet d3da910bfa9f4b599af4ed7f5ac270d55950a3a1 3.2.0 25h
01-worker-container-runtime d3da910bfa9f4b599af4ed7f5ac270d55950a3a1 3.2.0 25h
01-worker-kubelet d3da910bfa9f4b599af4ed7f5ac270d55950a3a1 3.2.0 25h
100-worker-iommu 3.2.0 30s
100-worker-vfiopci-configuration 3.2.0 30s
Verification
Verify that the VFIO driver is loaded.
$ lspci -nnk -d 10de:
The output confirms that the VFIO driver is being used.
Example output
04:00.0 3D controller [0302]: NVIDIA Corporation GP102GL [Tesla P40] [10de:1eb8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1eb8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau
Exposing PCI host devices in the cluster using the CLI
To expose PCI host devices in the cluster, add details about the PCI devices to the spec.permittedHostDevices.pciHostDevices
array of the HyperConverged
custom resource (CR).
Procedure
Edit the
HyperConverged
CR in your default editor by running the following command:$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
Add the PCI device information to the
spec.permittedHostDevices.pciHostDevices
array. For example:Example configuration file
apiVersion: hco.kubevirt.io/v1
kind: HyperConverged
metadata:
name: kubevirt-hyperconverged
namespace: openshift-cnv
spec:
permittedHostDevices: (1)
pciHostDevices: (2)
- pciDeviceSelector: "10DE:1DB6" (3)
resourceName: "nvidia.com/GV100GL_Tesla_V100" (4)
- pciDeviceSelector: "10DE:1EB8"
resourceName: "nvidia.com/TU104GL_Tesla_T4"
- pciDeviceSelector: "8086:6F54"
resourceName: "intel.com/qat"
externalResourceProvider: true (5)
# ...
1 The host devices that are permitted to be used in the cluster. 2 The list of PCI devices available on the node. 3 The vendor-ID
and thedevice-ID
required to identify the PCI device.4 The name of a PCI host device. 5 Optional: Setting this field to true
indicates that the resource is provided by an external device plugin. OKD Virtualization allows the usage of this device in the cluster but leaves the allocation and monitoring to an external device plugin.The above example snippet shows two PCI host devices that are named
nvidia.com/GV100GL_Tesla_V100
andnvidia.com/TU104GL_Tesla_T4
added to the list of permitted host devices in theHyperConverged
CR. These devices have been tested and verified to work with OKD Virtualization.Save your changes and exit the editor.
Verification
Verify that the PCI host devices were added to the node by running the following command. The example output shows that there is one device each associated with the
nvidia.com/GV100GL_Tesla_V100
,nvidia.com/TU104GL_Tesla_T4
, andintel.com/qat
resource names.$ oc describe node <node_name>
Example output
Capacity:
cpu: 64
devices.kubevirt.io/kvm: 110
devices.kubevirt.io/tun: 110
devices.kubevirt.io/vhost-net: 110
ephemeral-storage: 915128Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131395264Ki
nvidia.com/GV100GL_Tesla_V100 1
nvidia.com/TU104GL_Tesla_T4 1
intel.com/qat: 1
pods: 250
Allocatable:
cpu: 63500m
devices.kubevirt.io/kvm: 110
devices.kubevirt.io/tun: 110
devices.kubevirt.io/vhost-net: 110
ephemeral-storage: 863623130526
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 130244288Ki
nvidia.com/GV100GL_Tesla_V100 1
nvidia.com/TU104GL_Tesla_T4 1
intel.com/qat: 1
pods: 250
Removing PCI host devices from the cluster using the CLI
To remove a PCI host device from the cluster, delete the information for that device from the HyperConverged
custom resource (CR).
Procedure
Edit the
HyperConverged
CR in your default editor by running the following command:$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
Remove the PCI device information from the
spec.permittedHostDevices.pciHostDevices
array by deleting thepciDeviceSelector
,resourceName
andexternalResourceProvider
(if applicable) fields for the appropriate device. In this example, theintel.com/qat
resource has been deleted.Example configuration file
apiVersion: hco.kubevirt.io/v1
kind: HyperConverged
metadata:
name: kubevirt-hyperconverged
namespace: openshift-cnv
spec:
permittedHostDevices:
pciHostDevices:
- pciDeviceSelector: "10DE:1DB6"
resourceName: "nvidia.com/GV100GL_Tesla_V100"
- pciDeviceSelector: "10DE:1EB8"
resourceName: "nvidia.com/TU104GL_Tesla_T4"
# ...
Save your changes and exit the editor.
Verification
Verify that the PCI host device was removed from the node by running the following command. The example output shows that there are zero devices associated with the
intel.com/qat
resource name.$ oc describe node <node_name>
Example output
Capacity:
cpu: 64
devices.kubevirt.io/kvm: 110
devices.kubevirt.io/tun: 110
devices.kubevirt.io/vhost-net: 110
ephemeral-storage: 915128Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131395264Ki
nvidia.com/GV100GL_Tesla_V100 1
nvidia.com/TU104GL_Tesla_T4 1
intel.com/qat: 0
pods: 250
Allocatable:
cpu: 63500m
devices.kubevirt.io/kvm: 110
devices.kubevirt.io/tun: 110
devices.kubevirt.io/vhost-net: 110
ephemeral-storage: 863623130526
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 130244288Ki
nvidia.com/GV100GL_Tesla_V100 1
nvidia.com/TU104GL_Tesla_T4 1
intel.com/qat: 0
pods: 250
Configuring virtual machines for PCI passthrough
After the PCI devices have been added to the cluster, you can assign them to virtual machines. The PCI devices are now available as if they are physically connected to the virtual machines.
Assigning a PCI device to a virtual machine
When a PCI device is available in a cluster, you can assign it to a virtual machine and enable PCI passthrough.
Procedure
Assign the PCI device to a virtual machine as a host device.
Example
apiVersion: kubevirt.io/v1
kind: VirtualMachine
spec:
domain:
devices:
hostDevices:
- deviceName: nvidia.com/TU104GL_Tesla_T4 (1)
name: hostdevices1
1 The name of the PCI device that is permitted on the cluster as a host device. The virtual machine can access this host device.
Verification
Use the following command to verify that the host device is available from the virtual machine.
$ lspci -nnk | grep NVIDIA
Example output
$ 02:01.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:1eb8] (rev a1)