GPU Support

You can use GPU Operator to install NVIDIA device drivers and tools to your cluster.

Creating a cluster with GPU nodes

Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the getting started documentation.

Once the cluster is running, add an instance group with GPUs:

  1. apiVersion: kops.k8s.io/v1alpha2
  2. kind: InstanceGroup
  3. metadata:
  4. labels:
  5. kops.k8s.io/cluster: <cluster name>
  6. name: gpu-nodes
  7. spec:
  8. image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
  9. nodeLabels:
  10. kops.k8s.io/instancegroup: gpu-nodes
  11. machineType: g4dn.xlarge
  12. maxSize: 1
  13. minSize: 1
  14. role: Node
  15. subnets:
  16. - eu-central-1c
  17. taints:
  18. - nvidia.com/gpu=present:NoSchedule

Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default. Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes.

Install GPU Operator

GPU Operator is installed using helm. See the general install instructions for GPU Operator.

In order to match the kOps environment, create a values.yaml file with the following content:

  1. operator:
  2. nodeSelector:
  3. kops.k8s.io/instancegroup: gpu-nodes
  4. tolerations:
  5. - key: nvidia.com/gpu
  6. operator: Exists
  7. driver:
  8. nodeSelector:
  9. kops.k8s.io/instancegroup: gpu-nodes
  10. tolerations:
  11. - key: nvidia.com/gpu
  12. operator: Exists
  13. toolkit:
  14. nodeSelector:
  15. kops.k8s.io/instancegroup: gpu-nodes
  16. tolerations:
  17. - key: nvidia.com/gpu
  18. operator: Exists
  19. devicePlugin:
  20. nodeSelector:
  21. kops.k8s.io/instancegroup: gpu-nodes
  22. tolerations:
  23. - key: nvidia.com/gpu
  24. operator: Exists
  25. dcgmExporter:
  26. nodeSelector:
  27. kops.k8s.io/instancegroup: gpu-nodes
  28. tolerations:
  29. - key: nvidia.com/gpu
  30. operator: Exists
  31. gfd:
  32. nodeSelector:
  33. kops.k8s.io/instancegroup: gpu-nodes
  34. tolerations:
  35. - key: nvidia.com/gpu
  36. operator: Exists
  37. node-feature-discovery:
  38. worker:
  39. nodeSelector:
  40. kops.k8s.io/instancegroup: gpu-nodes
  41. tolerations:
  42. - key: nvidia.com/gpu
  43. operator: Exists

Once you have installed the the helm chart you should be able to see the GPU operator resources being spawned in the gpu-operator-resources namespace.

You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:

  1. spec:
  2. nodeSelector:
  3. kops.k8s.io/instancegroup: gpu-nodes
  4. tolerations:
  5. - key: nvidia.com/gpu
  6. operator: Exists