Configuring an SR-IOV network device
You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster.
SR-IOV network node configuration object
You specify the SR-IOV network device configuration for a node by creating an SR-IOV network node policy. The API object for the policy is part of the sriovnetwork.openshift.io
API group.
The following YAML describes an SR-IOV network node policy:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: <name> (1)
namespace: openshift-sriov-network-operator (2)
spec:
resourceName: <sriov_resource_name> (3)
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true" (4)
priority: <priority> (5)
mtu: <mtu> (6)
needVhostNet: false (7)
numVfs: <num> (8)
nicSelector: (9)
vendor: "<vendor_code>" (10)
deviceID: "<device_id>" (11)
pfNames: ["<pf_name>", ...] (12)
rootDevices: ["<pci_bus_id>", ...] (13)
netFilter: "<filter_string>" (14)
deviceType: <device_type> (15)
isRdma: false (16)
linkType: <link_type> (17)
eSwitchMode: "switchdev" (18)
1 | The name for the custom resource object. |
2 | The namespace where the SR-IOV Network Operator is installed. |
3 | The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name. When specifying a name, be sure to use the accepted syntax expression |
4 | The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only. |
5 | Optional: The priority is an integer value between 0 and 99 . A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99 . The default value is 99 . |
6 | Optional: The maximum transmission unit (MTU) of the virtual function. The maximum MTU value can vary for different network interface controller (NIC) models. |
7 | Optional: Set needVhostNet to true to mount the /dev/vhost-net device in the pod. Use the mounted /dev/vhost-net device with Data Plane Development Kit (DPDK) to forward traffic to the kernel network stack. |
8 | The number of the virtual functions (VF) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 128 . |
9 | The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally. If you specify |
10 | Optional: The vendor hexadecimal code of the SR-IOV network device. The only allowed values are 8086 and 15b3 . |
11 | Optional: The device hexadecimal code of the SR-IOV network device. For example, 101b is the device ID for a Mellanox ConnectX-6 device. |
12 | Optional: An array of one or more physical function (PF) names for the device. |
13 | Optional: An array of one or more PCI bus addresses for the PF of the device. Provide the address in the following format: 0000:02:00.1 . |
14 | Optional: The platform-specific network filter. The only supported platform is OpenStack. Acceptable values use the following format: openstack/NetworkID:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx . Replace xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx with the value from the /var/config/openstack/latest/network_data.json metadata file. |
15 | Optional: The driver type for the virtual functions. The only allowed values are netdevice and vfio-pci . The default value is netdevice .For a Mellanox NIC to work in DPDK mode on bare metal nodes, use the |
16 | Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false .If the Set |
17 | Optional: The link type for the VFs. The default value is eth for Ethernet. Change this value to ‘ib’ for InfiniBand.When Do not set linkType to ‘eth’ for SriovNetworkNodePolicy, because this can lead to an incorrect number of available devices reported by the device plugin. |
18 | Optional: To enable hardware offloading, the ‘eSwitchMode’ field must be set to “switchdev” . |
SR-IOV network node configuration examples
The following example describes the configuration for an InfiniBand device:
Example configuration for an InfiniBand device
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-ib-net-1
namespace: openshift-sriov-network-operator
spec:
resourceName: ibnic1
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
numVfs: 4
nicSelector:
vendor: "15b3"
deviceID: "101b"
rootDevices:
- "0000:19:00.0"
linkType: ib
isRdma: true
The following example describes the configuration for an SR-IOV network device in a OpenStack virtual machine:
Example configuration for an SR-IOV device in a virtual machine
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-sriov-net-openstack-1
namespace: openshift-sriov-network-operator
spec:
resourceName: sriovnic1
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
numVfs: 1 (1)
nicSelector:
vendor: "15b3"
deviceID: "101b"
netFilter: "openstack/NetworkID:ea24bd04-8674-4f69-b0ee-fa0b3bd20509" (2)
1 | The numVfs field is always set to 1 when configuring the node network policy for a virtual machine. |
2 | The netFilter field must refer to a network ID when the virtual machine is deployed on OpenStack. Valid values for netFilter are available from an SriovNetworkNodeState object. |
Virtual function (VF) partitioning for SR-IOV devices
In some cases, you might want to split virtual functions (VFs) from the same physical function (PF) into multiple resource pools. For example, you might want some of the VFs to load with the default driver and the remaining VFs load with the vfio-pci
driver. In such a deployment, the pfNames
selector in your SriovNetworkNodePolicy custom resource (CR) can be used to specify a range of VFs for a pool using the following format: <pfname>#<first_vf>-<last_vf>
.
For example, the following YAML shows the selector for an interface named netpf0
with VF 2
through 7
:
pfNames: ["netpf0#2-7"]
netpf0
is the PF interface name.2
is the first VF index (0-based) that is included in the range.7
is the last VF index (0-based) that is included in the range.
You can select VFs from the same PF by using different policy CRs if the following requirements are met:
The
numVfs
value must be identical for policies that select the same PF.The VF index must be in the range of
0
to<numVfs>-1
. For example, if you have a policy withnumVfs
set to8
, then the<first_vf>
value must not be smaller than0
, and the<last_vf>
must not be larger than7
.The VFs ranges in different policies must not overlap.
The
<first_vf>
must not be larger than the<last_vf>
.
The following example illustrates NIC partitioning for an SR-IOV device.
The policy policy-net-1
defines a resource pool net-1
that contains the VF 0
of PF netpf0
with the default VF driver. The policy policy-net-1-dpdk
defines a resource pool net-1-dpdk
that contains the VF 8
to 15
of PF netpf0
with the vfio
VF driver.
Policy policy-net-1
:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-net-1
namespace: openshift-sriov-network-operator
spec:
resourceName: net1
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
numVfs: 16
nicSelector:
pfNames: ["netpf0#0-0"]
deviceType: netdevice
Policy policy-net-1-dpdk
:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-net-1-dpdk
namespace: openshift-sriov-network-operator
spec:
resourceName: net1dpdk
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
numVfs: 16
nicSelector:
pfNames: ["netpf0#8-15"]
deviceType: vfio-pci
Verifying that the interface is successfully partitioned
Confirm that the interface partitioned to virtual functions (VFs) for the SR-IOV device by running the following command.
$ ip link show <interface> (1)
1 | Replace <interface> with the interface that you specified when partitioning to VFs for the SR-IOV device, for example, ens3f1 . |
Example output
5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:d1:bc:01 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 5a:e7:88:25:ea:a0 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1 link/ether 3e:1d:36:d7:3d:49 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2 link/ether ce:09:56:97:df:f9 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3 link/ether 5e:91:cf:88:d1:38 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 4 link/ether e6:06:a1:96:2f:de brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
Configuring SR-IOV network devices
The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io
CustomResourceDefinition to OKD. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).
When applying the configuration specified in a It might take several minutes for a configuration change to apply. |
Prerequisites
You installed the OpenShift CLI (
oc
).You have access to the cluster as a user with the
cluster-admin
role.You have installed the SR-IOV Network Operator.
You have enough available nodes in your cluster to handle the evicted workload from drained nodes.
You have not selected any control plane nodes for SR-IOV network device configuration.
Procedure
Create an
SriovNetworkNodePolicy
object, and then save the YAML in the<name>-sriov-node-network.yaml
file. Replace<name>
with the name for this configuration.Optional: Label the SR-IOV capable cluster nodes with
SriovNetworkNodePolicy.Spec.NodeSelector
if they are not already labeled. For more information about labeling nodes, see “Understanding how to update labels on nodes”.Create the
SriovNetworkNodePolicy
object:$ oc create -f <name>-sriov-node-network.yaml
where
<name>
specifies the name for this configuration.After applying the configuration update, all the pods in
sriov-network-operator
namespace transition to theRunning
status.To verify that the SR-IOV network device is configured, enter the following command. Replace
<node_name>
with the name of a node with the SR-IOV network device that you just configured.$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
Additional resources
Troubleshooting SR-IOV configuration
After following the procedure to configure an SR-IOV network device, the following sections address some error conditions.
To display the state of nodes, run the following command:
$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name>
where: <node_name>
specifies the name of a node with an SR-IOV network device.
Error output: Cannot allocate memory
"lastSyncError": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"
When a node indicates that it cannot allocate memory, check the following items:
Confirm that global SR-IOV settings are enabled in the BIOS for the node.
Confirm that VT-d is enabled in the BIOS for the node.
Assigning an SR-IOV network to a VRF
As a cluster administrator, you can assign an SR-IOV network interface to your VRF domain by using the CNI VRF plugin.
To do this, add the VRF configuration to the optional metaPlugins
parameter of the SriovNetwork
resource.
Applications that use VRFs need to bind to a specific device. The common usage is to use the Using a VRF through the |
Creating an additional SR-IOV network attachment with the CNI VRF plugin
The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition
custom resource (CR) automatically.
Do not edit |
To create an additional SR-IOV network attachment with the CNI VRF plugin, perform the following procedure.
Prerequisites
Install the OKD CLI (oc).
Log in to the OKD cluster as a user with cluster-admin privileges.
Procedure
Create the
SriovNetwork
custom resource (CR) for the additional SR-IOV network attachment and insert themetaPlugins
configuration, as in the following example CR. Save the YAML as the filesriov-network-attachment.yaml
.apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: example-network
namespace: additional-sriov-network-1
spec:
ipam: |
{
"type": "host-local",
"subnet": "10.56.217.0/24",
"rangeStart": "10.56.217.171",
"rangeEnd": "10.56.217.181",
"routes": [{
"dst": "0.0.0.0/0"
}],
"gateway": "10.56.217.1"
}
vlan: 0
resourceName: intelnics
metaPlugins : |
{
"type": "vrf", (1)
"vrfname": "example-vrf-name" (2)
}
1 type
must be set tovrf
.2 vrfname
is the name of the VRF that the interface is assigned to. If it does not exist in the pod, it is created.Create the
SriovNetwork
resource:$ oc create -f sriov-network-attachment.yaml
Verifying that the NetworkAttachmentDefinition
CR is successfully created
Confirm that the SR-IOV Network Operator created the
NetworkAttachmentDefinition
CR by running the following command.$ oc get network-attachment-definitions -n <namespace> (1)
1 Replace <namespace>
with the namespace that you specified when configuring the network attachment, for example,additional-sriov-network-1
.Example output
NAME AGE
additional-sriov-network-1 14m
There might be a delay before the SR-IOV Network Operator creates the CR.
Verifying that the additional SR-IOV network attachment is successful
To verify that the VRF CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:
Create an SR-IOV network that uses the VRF CNI.
Assign the network to a pod.
Verify that the pod network attachment is connected to the SR-IOV additional network. Remote shell into the pod and run the following command:
$ ip vrf show
Example output
Name Table
-----------------------
red 10
Confirm the VRF interface is master of the secondary interface:
$ ip link
Example output
...
5: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master red state UP mode
...