About Single Root I/O Virtualization (SR-IOV) hardware networks
The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.
SR-IOV enables you to segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV device driver for the device determines how the VF is exposed in the container:
netdevice
driver: A regular kernel network device in thenetns
of the containervfio-pci
driver: A character device mounted in the container
You can use SR-IOV network devices with additional networks on your OKD cluster for application that require high bandwidth or low latency.
Components that manage SR-IOV network devices
The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. It performs the following functions:
Orchestrates discovery and management of SR-IOV network devices
Generates
NetworkAttachmentDefinition
custom resources for the SR-IOV Container Network Interface (CNI)Creates and updates the configuration of the SR-IOV network device plug-in
Creates node specific
SriovNetworkNodeState
custom resourcesUpdates the
spec.interfaces
field in eachSriovNetworkNodeState
custom resource
The Operator provisions the following components:
SR-IOV network configuration daemon
A DaemonSet that is deployed on worker nodes when the SR-IOV Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.
SR-IOV Operator webhook
A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.
SR-IOV Network resources injector
A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs.
SR-IOV network device plug-in
A device plug-in that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plug-ins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plug-ins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.
SR-IOV CNI plug-in
A CNI plug-in that attaches VF interfaces allocated from the SR-IOV device plug-in directly into a pod.
SR-IOV InfiniBand CNI plug-in
A CNI plug-in that attaches InfiniBand (IB) VF interfaces allocated from the SR-IOV device plug-in directly into a pod.
The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the |
Supported devices
OKD supports the following network interface controllers:
Manufacturer | Model | Vendor ID | Device ID |
---|---|---|---|
Intel | X710 | 8086 | 1572 |
Intel | XXV710 | 8086 | 158b |
Mellanox | MT27700 Family [ConnectX‑4] | 15b3 | 1013 |
Mellanox | MT27710 Family [ConnectX‑4 Lx] | 15b3 | 1015 |
Mellanox | MT27800 Family [ConnectX‑5] | 15b3 | 1017 |
Mellanox | MT28908 Family [ConnectX‑6] | 15b3 | 101b |
Automated discovery of SR-IOV network devices
The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.
The CR is assigned the same name as the worker node. The status.interfaces
list provides information about the network devices on a node.
Do not modify a |
Example SriovNetworkNodeState object
The following YAML is an example of a SriovNetworkNodeState
object created by the SR-IOV Network Operator:
An SriovNetworkNodeState object
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
name: node-25 (1)
namespace: openshift-sriov-network-operator
ownerReferences:
- apiVersion: sriovnetwork.openshift.io/v1
blockOwnerDeletion: true
controller: true
kind: SriovNetworkNodePolicy
name: default
spec:
dpConfigVersion: "39824"
status:
interfaces: (2)
- deviceID: "1017"
driver: mlx5_core
mtu: 1500
name: ens785f0
pciAddress: "0000:18:00.0"
totalvfs: 8
vendor: 15b3
- deviceID: "1017"
driver: mlx5_core
mtu: 1500
name: ens785f1
pciAddress: "0000:18:00.1"
totalvfs: 8
vendor: 15b3
- deviceID: 158b
driver: i40e
mtu: 1500
name: ens817f0
pciAddress: 0000:81:00.0
totalvfs: 64
vendor: "8086"
- deviceID: 158b
driver: i40e
mtu: 1500
name: ens817f1
pciAddress: 0000:81:00.1
totalvfs: 64
vendor: "8086"
- deviceID: 158b
driver: i40e
mtu: 1500
name: ens803f0
pciAddress: 0000:86:00.0
totalvfs: 64
vendor: "8086"
syncStatus: Succeeded
1 | The value of the name field is the same as the name of the worker node. |
2 | The interfaces stanza includes a list of all of the SR-IOV devices discovered by the Operator on the worker node. |
Example use of a virtual function in a pod
You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.
This example shows a pod using a virtual function (VF) in RDMA mode:
Pod
spec that uses RDMA mode
apiVersion: v1
kind: Pod
metadata:
name: rdma-app
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx
spec:
containers:
- name: testpmd
image: <RDMA_image>
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 0
capabilities:
add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
command: ["sleep", "infinity"]
The following example shows a pod with a VF in DPDK mode:
Pod
spec that uses DPDK mode
apiVersion: v1
kind: Pod
metadata:
name: dpdk-app
annotations:
k8s.v1.cni.cncf.io/networks: sriov-dpdk-net
spec:
containers:
- name: testpmd
image: <DPDK_image>
securityContext:
runAsUser: 0
capabilities:
add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
memory: "1Gi"
cpu: "2"
hugepages-1Gi: "4Gi"
requests:
memory: "1Gi"
cpu: "2"
hugepages-1Gi: "4Gi"
command: ["sleep", "infinity"]
volumes:
- name: hugepage
emptyDir:
medium: HugePages
DPDK library for use with container applications
An optional library, app-netutil
, provides several API methods for gathering network information about a pod from within a container running within that pod.
This library is intended to assist with integrating SR-IOV virtual functions (VFs) in Data Plane Development Kit (DPDK) mode into the container. The library provides both a Golang API and a C API.
Currently there are three API methods implemented:
GetCPUInfo()
This function determines which CPUs are available to the container and returns the list to the caller.
GetHugepages()
This function determines the amount of hugepage memory requested in the Pod
spec for each container and returns the values to the caller.
Exposing hugepages via Kubernetes Downward API is an alpha feature in Kubernetes 1.20 and is not enabled in OKD. The API can be tested by enabling the feature gate, |
GetInterfaces()
This function determines the set of interfaces in the container and returns the list, along with the interface type and type specific data.
There is also a sample Docker image, dpdk-app-centos
, which can run one of the following DPDK sample applications based on an environmental variable in the pod-spec: l2fwd
, l3wd
or testpmd
. This Docker image provides an example of integrating the app-netutil
into the container image itself. The library can also integrate into an init-container
which collects the required data and passes the data to an existing DPDK workload.
Next steps
Optional: Configuring the SR-IOV Network Operator
If you use OpenShift Virtualization: Configuring an SR-IOV network device for virtual machines