- Using DPDK and RDMA
- Using a virtual function in DPDK mode with an Intel NIC
- Using a virtual function in DPDK mode with a Mellanox NIC
- Using the TAP CNI to run a rootless DPDK workload with kernel access
- Overview of achieving a specific DPDK line rate
- Using SR-IOV and the Node Tuning Operator to achieve a DPDK line rate
- Using a virtual function in RDMA mode with a Mellanox NIC
- A test pod template for clusters that use OVS-DPDK on OpenStack
- A test pod template for clusters that use OVS hardware offloading on OpenStack
- Additional resources
Using DPDK and RDMA
The containerized Data Plane Development Kit (DPDK) application is supported on OKD. You can use Single Root I/O Virtualization (SR-IOV) network hardware with the Data Plane Development Kit (DPDK) and with remote direct memory access (RDMA).
For information about supported devices, see Supported devices.
Using a virtual function in DPDK mode with an Intel NIC
Prerequisites
Install the OpenShift CLI (
oc
).Install the SR-IOV Network Operator.
Log in as a user with
cluster-admin
privileges.
Procedure
Create the following
SriovNetworkNodePolicy
object, and then save the YAML in theintel-dpdk-node-policy.yaml
file.apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: intel-dpdk-node-policy
namespace: openshift-sriov-network-operator
spec:
resourceName: intelnics
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
priority: <priority>
numVfs: <num>
nicSelector:
vendor: "8086"
deviceID: "158b"
pfNames: ["<pf_name>", ...]
rootDevices: ["<pci_bus_id>", "..."]
deviceType: vfio-pci (1)
1 Specify the driver type for the virtual functions to vfio-pci
.See the
Configuring SR-IOV network devices
section for a detailed explanation on each option inSriovNetworkNodePolicy
.When applying the configuration specified in a
SriovNetworkNodePolicy
object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.After the configuration update is applied, all the pods in
openshift-sriov-network-operator
namespace will change to aRunning
status.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f intel-dpdk-node-policy.yaml
Create the following
SriovNetwork
object, and then save the YAML in theintel-dpdk-network.yaml
file.apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: intel-dpdk-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: <target_namespace>
ipam: |-
# ... (1)
vlan: <vlan>
resourceName: intelnics
1 Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition. See the “Configuring SR-IOV additional network” section for a detailed explanation on each option in
SriovNetwork
.An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.
Create the
SriovNetwork
object by running the following command:$ oc create -f intel-dpdk-network.yaml
Create the following
Pod
spec, and then save the YAML in theintel-dpdk-pod.yaml
file.apiVersion: v1
kind: Pod
metadata:
name: dpdk-app
namespace: <target_namespace> (1)
annotations:
k8s.v1.cni.cncf.io/networks: intel-dpdk-network
spec:
containers:
- name: testpmd
image: <DPDK_image> (2)
securityContext:
runAsUser: 0
capabilities:
add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] (3)
volumeMounts:
- mountPath: /dev/hugepages (4)
name: hugepage
resources:
limits:
openshift.io/intelnics: "1" (5)
memory: "1Gi"
cpu: "4" (6)
hugepages-1Gi: "4Gi" (7)
requests:
openshift.io/intelnics: "1"
memory: "1Gi"
cpu: "4"
hugepages-1Gi: "4Gi"
command: ["sleep", "infinity"]
volumes:
- name: hugepage
emptyDir:
medium: HugePages
1 Specify the same target_namespace
where theSriovNetwork
objectintel-dpdk-network
is created. If you would like to create the pod in a different namespace, changetarget_namespace
in both thePod
spec and theSriovNetwork
object.2 Specify the DPDK image which includes your application and the DPDK library used by application. 3 Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access. 4 Mount a hugepage volume to the DPDK pod under /dev/hugepages
. The hugepage volume is backed by the emptyDir volume type with the medium beingHugepages
.5 Optional: Specify the number of DPDK devices allocated to DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by the SR-IOV Operator. It is enabled by default and can be disabled by setting enableInjector
option tofalse
in the defaultSriovOperatorConfig
CR.6 Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to static
and creating a pod withGuaranteed
QoS.7 Specify hugepage size hugepages-1Gi
orhugepages-2Mi
and the quantity of hugepages that will be allocated to the DPDK pod. Configure2Mi
and1Gi
hugepages separately. Configuring1Gi
hugepage requires adding kernel arguments to Nodes. For example, adding kernel argumentsdefault_hugepagesz=1GB
,hugepagesz=1G
andhugepages=16
will result in16*1Gi
hugepages be allocated during system boot.Create the DPDK pod by running the following command:
$ oc create -f intel-dpdk-pod.yaml
Using a virtual function in DPDK mode with a Mellanox NIC
You can create a network node policy and create a Data Plane Development Kit (DPDK) pod using a virtual function in DPDK mode with a Mellanox NIC.
Prerequisites
You have installed the OpenShift CLI (
oc
).You have installed the Single Root I/O Virtualization (SR-IOV) Network Operator.
You have logged in as a user with
cluster-admin
privileges.
Procedure
Save the following
SriovNetworkNodePolicy
YAML configuration to anmlx-dpdk-node-policy.yaml
file:apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: mlx-dpdk-node-policy
namespace: openshift-sriov-network-operator
spec:
resourceName: mlxnics
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
priority: <priority>
numVfs: <num>
nicSelector:
vendor: "15b3"
deviceID: "1015" (1)
pfNames: ["<pf_name>", ...]
rootDevices: ["<pci_bus_id>", "..."]
deviceType: netdevice (2)
isRdma: true (3)
1 Specify the device hex code of the SR-IOV network device. 2 Specify the driver type for the virtual functions to netdevice
. A Mellanox SR-IOV Virtual Function (VF) can work in DPDK mode without using thevfio-pci
device type. The VF device appears as a kernel network interface inside a container.3 Enable Remote Direct Memory Access (RDMA) mode. This is required for Mellanox cards to work in DPDK mode. See Configuring an SR-IOV network device for a detailed explanation of each option in the
SriovNetworkNodePolicy
object.When applying the configuration specified in an
SriovNetworkNodePolicy
object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes. It might take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.After the configuration update is applied, all the pods in the
openshift-sriov-network-operator
namespace will change to aRunning
status.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f mlx-dpdk-node-policy.yaml
Save the following
SriovNetwork
YAML configuration to anmlx-dpdk-network.yaml
file:apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: mlx-dpdk-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: <target_namespace>
ipam: |- (1)
...
vlan: <vlan>
resourceName: mlxnics
1 Specify a configuration object for the IP Address Management (IPAM) Container Network Interface (CNI) plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition. See Configuring an SR-IOV network device for a detailed explanation on each option in the
SriovNetwork
object.The
app-netutil
option library provides several API methods for gathering network information about the parent pod of a container.Create the
SriovNetwork
object by running the following command:$ oc create -f mlx-dpdk-network.yaml
Save the following
Pod
YAML configuration to anmlx-dpdk-pod.yaml
file:apiVersion: v1
kind: Pod
metadata:
name: dpdk-app
namespace: <target_namespace> (1)
annotations:
k8s.v1.cni.cncf.io/networks: mlx-dpdk-network
spec:
containers:
- name: testpmd
image: <DPDK_image> (2)
securityContext:
runAsUser: 0
capabilities:
add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] (3)
volumeMounts:
- mountPath: /dev/hugepages (4)
name: hugepage
resources:
limits:
openshift.io/mlxnics: "1" (5)
memory: "1Gi"
cpu: "4" (6)
hugepages-1Gi: "4Gi" (7)
requests:
openshift.io/mlxnics: "1"
memory: "1Gi"
cpu: "4"
hugepages-1Gi: "4Gi"
command: ["sleep", "infinity"]
volumes:
- name: hugepage
emptyDir:
medium: HugePages
1 Specify the same target_namespace
whereSriovNetwork
objectmlx-dpdk-network
is created. To create the pod in a different namespace, changetarget_namespace
in both thePod
spec andSriovNetwork
object.2 Specify the DPDK image which includes your application and the DPDK library used by the application. 3 Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access. 4 Mount the hugepage volume to the DPDK pod under /dev/hugepages
. The hugepage volume is backed by theemptyDir
volume type with the medium beingHugepages
.5 Optional: Specify the number of DPDK devices allocated for the DPDK pod. If not explicitly specified, this resource request and limit is automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the enableInjector
option tofalse
in the defaultSriovOperatorConfig
CR.6 Specify the number of CPUs. The DPDK pod usually requires that exclusive CPUs be allocated from the kubelet. To do this, set the CPU Manager policy to static
and create a pod withGuaranteed
Quality of Service (QoS).7 Specify hugepage size hugepages-1Gi
orhugepages-2Mi
and the quantity of hugepages that will be allocated to the DPDK pod. Configure2Mi
and1Gi
hugepages separately. Configuring1Gi
hugepages requires adding kernel arguments to Nodes.Create the DPDK pod by running the following command:
$ oc create -f mlx-dpdk-pod.yaml
Using the TAP CNI to run a rootless DPDK workload with kernel access
DPDK applications can use virtio-user
as an exception path to inject certain types of packets, such as log messages, into the kernel for processing. For more information about this feature, see Virtio_user as Exception Path.
In OpenShift Container Platform version 4.14 and later, you can use non-privileged pods to run DPDK applications alongside the tap CNI plugin. To enable this functionality, you need to mount the vhost-net
device by setting the needVhostNet
parameter to true
within the SriovNetworkNodePolicy
object.
Figure 1. DPDK and TAP example configuration
Prerequisites
You have installed the OpenShift CLI (
oc
).You have installed the SR-IOV Network Operator.
You are logged in as a user with
cluster-admin
privileges.Ensure that
setsebools container_use_devices=on
is set as root on all nodes.Use the Machine Config Operator to set this SELinux boolean.
Procedure
Create a file, such as
test-namespace.yaml
, with content like the following example:apiVersion: v1
kind: Namespace
metadata:
name: test-namespace
labels:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/warn: privileged
security.openshift.io/scc.podSecurityLabelSync: "false"
Create the new
Namespace
object by running the following command:$ oc apply -f test-namespace.yaml
Create a file, such as
sriov-node-network-policy.yaml
, with content like the following example::apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: sriovnic
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice (1)
isRdma: true (2)
needVhostNet: true (3)
nicSelector:
vendor: "15b3" (4)
deviceID: "101b" (5)
rootDevices: ["00:05.0"]
numVfs: 10
priority: 99
resourceName: sriovnic
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
1 This indicates that the profile is tailored specifically for Mellanox Network Interface Controllers (NICs). 2 Setting isRdma
totrue
is only required for a Mellanox NIC.3 This mounts the /dev/net/tun
and/dev/vhost-net
devices into the container so the application can create a tap device and connect the tap device to the DPDK workload.4 The vendor hexadecimal code of the SR-IOV network device. The value 15b3 is associated with a Mellanox NIC. 5 The device hexadecimal code of the SR-IOV network device. Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f sriov-node-network-policy.yaml
Create the following
SriovNetwork
object, and then save the YAML in thesriov-network-attachment.yaml
file:apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: sriov-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: test-namespace
resourceName: sriovnic
spoofChk: "off"
trust: "on"
See the “Configuring SR-IOV additional network” section for a detailed explanation on each option in
SriovNetwork
.An optional library,
app-netutil
, provides several API methods for gathering network information about a container’s parent pod.Create the
SriovNetwork
object by running the following command:$ oc create -f sriov-network-attachment.yaml
Create a file, such as
tap-example.yaml
, that defines a network attachment definition, with content like the following example:apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: tap-one
namespace: test-namespace (1)
spec:
config: '{
"cniVersion": "0.4.0",
"name": "tap",
"plugins": [
{
"type": "tap",
"multiQueue": true,
"selinuxcontext": "system_u:system_r:container_t:s0"
},
{
"type":"tuning",
"capabilities":{
"mac":true
}
}
]
}'
1 Specify the same target_namespace
where theSriovNetwork
object is created.Create the
NetworkAttachmentDefinition
object by running the following command:$ oc apply -f tap-example.yaml
Create a file, such as
dpdk-pod-rootless.yaml
, with content like the following example:apiVersion: v1
kind: Pod
metadata:
name: dpdk-app
namespace: test-namespace (1)
annotations:
k8s.v1.cni.cncf.io/networks: '[
{"name": "sriov-network", "namespace": "test-namespace"},
{"name": "tap-one", "interface": "ext0", "namespace": "test-namespace"}]'
spec:
nodeSelector:
kubernetes.io/hostname: "worker-0"
securityContext:
fsGroup: 1001 (2)
runAsGroup: 1001 (3)
seccompProfile:
type: RuntimeDefault
containers:
- name: testpmd
image: <DPDK_image> (4)
securityContext:
capabilities:
drop: ["ALL"] (5)
add: (6)
- IPC_LOCK
- NET_RAW #for mlx only (7)
runAsUser: 1001 (8)
privileged: false (9)
allowPrivilegeEscalation: true (10)
runAsNonRoot: true (11)
volumeMounts:
- mountPath: /mnt/huge (12)
name: hugepages
resources:
limits:
openshift.io/sriovnic: "1" (13)
memory: "1Gi"
cpu: "4" (14)
hugepages-1Gi: "4Gi" (15)
requests:
openshift.io/sriovnic: "1"
memory: "1Gi"
cpu: "4"
hugepages-1Gi: "4Gi"
command: ["sleep", "infinity"]
runtimeClassName: performance-cnf-performanceprofile (16)
volumes:
- name: hugepages
emptyDir:
medium: HugePages
1 Specify the same target_namespace
in which theSriovNetwork
object is created. If you want to create the pod in a different namespace, changetarget_namespace
in both thePod
spec and theSriovNetwork
object.2 Sets the group ownership of volume-mounted directories and files created in those volumes. 3 Specify the primary group ID used for running the container. 4 Specify the DPDK image that contains your application and the DPDK library used by application. 5 Removing all capabilities ( ALL
) from the container’s securityContext means that the container has no special privileges beyond what is necessary for normal operation.6 Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access. These capabilities must also be set in the binary file by using the setcap
command.7 Mellanox network interface controller (NIC) requires the NET_RAW
capability.8 Specify the user ID used for running the container. 9 This setting indicates that the container or containers within the pod should not be granted privileged access to the host system. 10 This setting allows a container to escalate its privileges beyond the initial non-root privileges it might have been assigned. 11 This setting ensures that the container runs with a non-root user. This helps enforce the principle of least privilege, limiting the potential impact of compromising the container and reducing the attack surface. 12 Mount a hugepage volume to the DPDK pod under /mnt/huge
. The hugepage volume is backed by the emptyDir volume type with the medium beingHugepages
.13 Optional: Specify the number of DPDK devices allocated for the DPDK pod. If not explicitly specified, this resource request and limit is automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the enableInjector
option tofalse
in the defaultSriovOperatorConfig
CR.14 Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to static
and creating a pod withGuaranteed
QoS.15 Specify hugepage size hugepages-1Gi
orhugepages-2Mi
and the quantity of hugepages that will be allocated to the DPDK pod. Configure2Mi
and1Gi
hugepages separately. Configuring1Gi
hugepage requires adding kernel arguments to Nodes. For example, adding kernel argumentsdefault_hugepagesz=1GB
,hugepagesz=1G
andhugepages=16
will result in16*1Gi
hugepages be allocated during system boot.16 If your performance profile is not named cnf-performance profile
, replace that string with the correct performance profile name.Create the DPDK pod by running the following command:
$ oc create -f dpdk-pod-rootless.yaml
Additional resources
Overview of achieving a specific DPDK line rate
To achieve a specific Data Plane Development Kit (DPDK) line rate, deploy a Node Tuning Operator and configure Single Root I/O Virtualization (SR-IOV). You must also tune the DPDK settings for the following resources:
Isolated CPUs
Hugepages
The topology scheduler
In previous versions of OKD, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OKD applications. In OKD 4.11 and later, this functionality is part of the Node Tuning Operator. |
DPDK test environment
The following diagram shows the components of a traffic-testing environment:
Traffic generator: An application that can generate high-volume packet traffic.
SR-IOV-supporting NIC: A network interface card compatible with SR-IOV. The card runs a number of virtual functions on a physical interface.
Physical Function (PF): A PCI Express (PCIe) function of a network adapter that supports the SR-IOV interface.
Virtual Function (VF): A lightweight PCIe function on a network adapter that supports SR-IOV. The VF is associated with the PCIe PF on the network adapter. The VF represents a virtualized instance of the network adapter.
Switch: A network switch. Nodes can also be connected back-to-back.
testpmd
: An example application included with DPDK. Thetestpmd
application can be used to test the DPDK in a packet-forwarding mode. Thetestpmd
application is also an example of how to build a fully-fledged application using the DPDK Software Development Kit (SDK).worker 0 and worker 1: OKD nodes.
Using SR-IOV and the Node Tuning Operator to achieve a DPDK line rate
You can use the Node Tuning Operator to configure isolated CPUs, hugepages, and a topology scheduler. You can then use the Node Tuning Operator with Single Root I/O Virtualization (SR-IOV) to achieve a specific Data Plane Development Kit (DPDK) line rate.
Prerequisites
You have installed the OpenShift CLI (
oc
).You have installed the SR-IOV Network Operator.
You have logged in as a user with
cluster-admin
privileges.You have deployed a standalone Node Tuning Operator.
In previous versions of OKD, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OKD 4.11 and later, this functionality is part of the Node Tuning Operator.
Procedure
Create a
PerformanceProfile
object based on the following example:apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
globallyDisableIrqLoadBalancing: true
cpu:
isolated: 21-51,73-103 (1)
reserved: 0-20,52-72 (2)
hugepages:
defaultHugepagesSize: 1G (3)
pages:
- count: 32
size: 1G
net:
userLevelNetworking: true
numa:
topologyPolicy: "single-numa-node"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
1 If hyperthreading is enabled on the system, allocate the relevant symbolic links to the isolated
andreserved
CPU groups. If the system contains multiple non-uniform memory access nodes (NUMAs), allocate CPUs from both NUMAs to both groups. You can also use the Performance Profile Creator for this task. For more information, see Creating a performance profile.2 You can also specify a list of devices that will have their queues set to the reserved CPU count. For more information, see Reducing NIC queues using the Node Tuning Operator. 3 Allocate the number and size of hugepages needed. You can specify the NUMA configuration for the hugepages. By default, the system allocates an even number to every NUMA node on the system. If needed, you can request the use of a realtime kernel for the nodes. See Provisioning a worker with real-time capabilities for more information. Save the
yaml
file asmlx-dpdk-perfprofile-policy.yaml
.Apply the performance profile using the following command:
$ oc create -f mlx-dpdk-perfprofile-policy.yaml
Example SR-IOV Network Operator for virtual functions
You can use the Single Root I/O Virtualization (SR-IOV) Network Operator to allocate and configure Virtual Functions (VFs) from SR-IOV-supporting Physical Function NICs on the nodes.
For more information on deploying the Operator, see Installing the SR-IOV Network Operator. For more information on configuring an SR-IOV network device, see Configuring an SR-IOV network device.
There are some differences between running Data Plane Development Kit (DPDK) workloads on Intel VFs and Mellanox VFs. This section provides object configuration examples for both VF types. The following is an example of an sriovNetworkNodePolicy
object used to run DPDK applications on Intel NICs:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: dpdk-nic-1
namespace: openshift-sriov-network-operator
spec:
deviceType: vfio-pci (1)
needVhostNet: true (2)
nicSelector:
pfNames: ["ens3f0"]
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numVfs: 10
priority: 99
resourceName: dpdk_nic_1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: dpdk-nic-1
namespace: openshift-sriov-network-operator
spec:
deviceType: vfio-pci
needVhostNet: true
nicSelector:
pfNames: ["ens3f1"]
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numVfs: 10
priority: 99
resourceName: dpdk_nic_2
1 | For Intel NICs, deviceType must be vfio-pci . |
2 | If kernel communication with DPDK workloads is required, add needVhostNet: true . This mounts the /dev/net/tun and /dev/vhost-net devices into the container so the application can create a tap device and connect the tap device to the DPDK workload. |
The following is an example of an sriovNetworkNodePolicy
object for Mellanox NICs:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: dpdk-nic-1
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice (1)
isRdma: true (2)
nicSelector:
rootDevices:
- "0000:5e:00.1"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numVfs: 5
priority: 99
resourceName: dpdk_nic_1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: dpdk-nic-2
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice
isRdma: true
nicSelector:
rootDevices:
- "0000:5e:00.0"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numVfs: 5
priority: 99
resourceName: dpdk_nic_2
1 | For Mellanox devices the deviceType must be netdevice . |
2 | For Mellanox devices isRdma must be true . Mellanox cards are connected to DPDK applications using Flow Bifurcation. This mechanism splits traffic between Linux user space and kernel space, and can enhance line rate processing capability. |
Example SR-IOV network operator
The following is an example definition of an sriovNetwork
object. In this case, Intel and Mellanox configurations are identical:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: dpdk-network-1
namespace: openshift-sriov-network-operator
spec:
ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.1.0/24"}]],"dataDir":
"/run/my-orchestrator/container-ipam-state-1"}' (1)
networkNamespace: dpdk-test (2)
spoofChk: "off"
trust: "on"
resourceName: dpdk_nic_1 (3)
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: dpdk-network-2
namespace: openshift-sriov-network-operator
spec:
ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.2.0/24"}]],"dataDir":
"/run/my-orchestrator/container-ipam-state-1"}'
networkNamespace: dpdk-test
spoofChk: "off"
trust: "on"
resourceName: dpdk_nic_2
1 | You can use a different IP Address Management (IPAM) implementation, such as Whereabouts. For more information, see Dynamic IP address assignment configuration with Whereabouts. |
2 | You must request the networkNamespace where the network attachment definition will be created. You must create the sriovNetwork CR under the openshift-sriov-network-operator namespace. |
3 | The resourceName value must match that of the resourceName created under the sriovNetworkNodePolicy . |
Example DPDK base workload
The following is an example of a Data Plane Development Kit (DPDK) container:
apiVersion: v1
kind: Namespace
metadata:
name: dpdk-test
---
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: '[ (1)
{
"name": "dpdk-network-1",
"namespace": "dpdk-test"
},
{
"name": "dpdk-network-2",
"namespace": "dpdk-test"
}
]'
irq-load-balancing.crio.io: "disable" (2)
cpu-load-balancing.crio.io: "disable"
cpu-quota.crio.io: "disable"
labels:
app: dpdk
name: testpmd
namespace: dpdk-test
spec:
runtimeClassName: performance-performance (3)
containers:
- command:
- /bin/bash
- -c
- sleep INF
image: registry.redhat.io/openshift4/dpdk-base-rhel8
imagePullPolicy: Always
name: dpdk
resources: (4)
limits:
cpu: "16"
hugepages-1Gi: 8Gi
memory: 2Gi
requests:
cpu: "16"
hugepages-1Gi: 8Gi
memory: 2Gi
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
- NET_RAW
- NET_ADMIN
runAsUser: 0
volumeMounts:
- mountPath: /mnt/huge
name: hugepages
terminationGracePeriodSeconds: 5
volumes:
- emptyDir:
medium: HugePages
name: hugepages
1 | Request the SR-IOV networks you need. Resources for the devices will be injected automatically. |
2 | Disable the CPU and IRQ load balancing base. See Disabling interrupt processing for individual pods for more information. |
3 | Set the runtimeClass to performance-performance . Do not set the runtimeClass to HostNetwork or privileged . |
4 | Request an equal number of resources for requests and limits to start the pod with Guaranteed Quality of Service (QoS). |
Do not start the pod with |
Example testpmd script
The following is an example script for running testpmd
:
#!/bin/bash
set -ex
export CPU=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus)
echo ${CPU}
dpdk-testpmd -l ${CPU} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_1} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_2} -n 4 -- -i --nb-cores=15 --rxd=4096 --txd=4096 --rxq=7 --txq=7 --forward-mode=mac --eth-peer=0,50:00:00:00:00:01 --eth-peer=1,50:00:00:00:00:02
This example uses two different sriovNetwork
CRs. The environment variable contains the Virtual Function (VF) PCI address that was allocated for the pod. If you use the same network in the pod definition, you must split the pciAddress
. It is important to configure the correct MAC addresses of the traffic generator. This example uses custom MAC addresses.
Using a virtual function in RDMA mode with a Mellanox NIC
RDMA over Converged Ethernet (RoCE) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. |
RDMA over Converged Ethernet (RoCE) is the only supported mode when using RDMA on OKD.
Prerequisites
Install the OpenShift CLI (
oc
).Install the SR-IOV Network Operator.
Log in as a user with
cluster-admin
privileges.
Procedure
Create the following
SriovNetworkNodePolicy
object, and then save the YAML in themlx-rdma-node-policy.yaml
file.apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: mlx-rdma-node-policy
namespace: openshift-sriov-network-operator
spec:
resourceName: mlxnics
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
priority: <priority>
numVfs: <num>
nicSelector:
vendor: "15b3"
deviceID: "1015" (1)
pfNames: ["<pf_name>", ...]
rootDevices: ["<pci_bus_id>", "..."]
deviceType: netdevice (2)
isRdma: true (3)
1 Specify the device hex code of the SR-IOV network device. 2 Specify the driver type for the virtual functions to netdevice
.3 Enable RDMA mode. See the
Configuring SR-IOV network devices
section for a detailed explanation on each option inSriovNetworkNodePolicy
.When applying the configuration specified in a
SriovNetworkNodePolicy
object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.After the configuration update is applied, all the pods in the
openshift-sriov-network-operator
namespace will change to aRunning
status.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f mlx-rdma-node-policy.yaml
Create the following
SriovNetwork
object, and then save the YAML in themlx-rdma-network.yaml
file.apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: mlx-rdma-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: <target_namespace>
ipam: |- (1)
# ...
vlan: <vlan>
resourceName: mlxnics
1 Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition. See the “Configuring SR-IOV additional network” section for a detailed explanation on each option in
SriovNetwork
.An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.
Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f mlx-rdma-network.yaml
Create the following
Pod
spec, and then save the YAML in themlx-rdma-pod.yaml
file.apiVersion: v1
kind: Pod
metadata:
name: rdma-app
namespace: <target_namespace> (1)
annotations:
k8s.v1.cni.cncf.io/networks: mlx-rdma-network
spec:
containers:
- name: testpmd
image: <RDMA_image> (2)
securityContext:
runAsUser: 0
capabilities:
add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] (3)
volumeMounts:
- mountPath: /dev/hugepages (4)
name: hugepage
resources:
limits:
memory: "1Gi"
cpu: "4" (5)
hugepages-1Gi: "4Gi" (6)
requests:
memory: "1Gi"
cpu: "4"
hugepages-1Gi: "4Gi"
command: ["sleep", "infinity"]
volumes:
- name: hugepage
emptyDir:
medium: HugePages
1 Specify the same target_namespace
whereSriovNetwork
objectmlx-rdma-network
is created. If you would like to create the pod in a different namespace, changetarget_namespace
in bothPod
spec andSriovNetwork
object.2 Specify the RDMA image which includes your application and RDMA library used by application. 3 Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access. 4 Mount the hugepage volume to RDMA pod under /dev/hugepages
. The hugepage volume is backed by the emptyDir volume type with the medium beingHugepages
.5 Specify number of CPUs. The RDMA pod usually requires exclusive CPUs be allocated from the kubelet. This is achieved by setting CPU Manager policy to static
and create pod withGuaranteed
QoS.6 Specify hugepage size hugepages-1Gi
orhugepages-2Mi
and the quantity of hugepages that will be allocated to the RDMA pod. Configure2Mi
and1Gi
hugepages separately. Configuring1Gi
hugepage requires adding kernel arguments to Nodes.Create the RDMA pod by running the following command:
$ oc create -f mlx-rdma-pod.yaml
A test pod template for clusters that use OVS-DPDK on OpenStack
The following testpmd
pod demonstrates container creation with huge pages, reserved CPUs, and the SR-IOV port.
An example testpmd
pod
apiVersion: v1
kind: Pod
metadata:
name: testpmd-dpdk
namespace: mynamespace
annotations:
cpu-load-balancing.crio.io: "disable"
cpu-quota.crio.io: "disable"
# ...
spec:
containers:
- name: testpmd
command: ["sleep", "99999"]
image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
securityContext:
capabilities:
add: ["IPC_LOCK","SYS_ADMIN"]
privileged: true
runAsUser: 0
resources:
requests:
memory: 1000Mi
hugepages-1Gi: 1Gi
cpu: '2'
openshift.io/dpdk1: 1 (1)
limits:
hugepages-1Gi: 1Gi
cpu: '2'
memory: 1000Mi
openshift.io/dpdk1: 1
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
readOnly: False
runtimeClassName: performance-cnf-performanceprofile (2)
volumes:
- name: hugepage
emptyDir:
medium: HugePages
1 | The name dpdk1 in this example is a user-created SriovNetworkNodePolicy resource. You can substitute this name for that of a resource that you create. |
2 | If your performance profile is not named cnf-performance profile , replace that string with the correct performance profile name. |
A test pod template for clusters that use OVS hardware offloading on OpenStack
The following testpmd
pod demonstrates Open vSwitch (OVS) hardware offloading on OpenStack.
An example testpmd
pod
apiVersion: v1
kind: Pod
metadata:
name: testpmd-sriov
namespace: mynamespace
annotations:
k8s.v1.cni.cncf.io/networks: hwoffload1
spec:
runtimeClassName: performance-cnf-performanceprofile (1)
containers:
- name: testpmd
command: ["sleep", "99999"]
image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
securityContext:
capabilities:
add: ["IPC_LOCK","SYS_ADMIN"]
privileged: true
runAsUser: 0
resources:
requests:
memory: 1000Mi
hugepages-1Gi: 1Gi
cpu: '2'
limits:
hugepages-1Gi: 1Gi
cpu: '2'
memory: 1000Mi
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
readOnly: False
volumes:
- name: hugepage
emptyDir:
medium: HugePages
1 | If your performance profile is not named cnf-performance profile , replace that string with the correct performance profile name. |
Additional resources
Dynamic IP address assignment configuration with Whereabouts
The app-netutil library provides several API methods for gathering network information about a container’s parent pod.