Enabling GPU and TPU

Enable GPU and TPU for Kubeflow Pipelines on Google Kubernetes Engine (GKE)

Out of date

This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.

This page describes how to enable GPU or TPU for a pipeline on GKE by using the Pipelines DSL language.

Prerequisites

To enable GPU and TPU on your Kubeflow cluster, follow the instructions on how to customize the GKE cluster for Kubeflow before setting up the cluster.

Configure ContainerOp to consume GPUs

After enabling the GPU, the Kubeflow setup script installs a default GPU pool with type nvidia-tesla-k80 with auto-scaling enabled. The following code consumes 2 GPUs in a ContainerOp.

  1. import kfp.dsl as dsl
  2. gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)

The code above will be compiled into Kubernetes Pod spec:

  1. container:
  2. ...
  3. resources:
  4. limits:
  5. nvidia.com/gpu: "2"

If the cluster has multiple node pools with different GPU types, you can specify the GPU type by the following code.

  1. import kfp.dsl as dsl
  2. gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)
  3. gpu_op.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p4')

The code above will be compiled into Kubernetes Pod spec:

  1. container:
  2. ...
  3. resources:
  4. limits:
  5. nvidia.com/gpu: "2"
  6. nodeSelector:
  7. cloud.google.com/gke-accelerator: nvidia-tesla-p4

Check the GKE GPU guide to learn more about GPU settings.

Configure ContainerOp to consume TPUs

Use the following code to configure ContainerOp to consume TPUs on GKE:

  1. import kfp.dsl as dsl
  2. import kfp.gcp as gcp
  3. tpu_op = dsl.ContainerOp(name='tpu-op', ...).apply(gcp.use_tpu(
  4. tpu_cores = 8, tpu_resource = 'v2', tf_version = '1.12'))

The above code uses 8 v2 TPUs with TF version to be 1.12. The code above will be compiled into Kubernetes Pod spec:

  1. container:
  2. ...
  3. resources:
  4. limits:
  5. cloud-tpus.google.com/v2: "8"
  6. metadata:
  7. annotations:
  8. tf-version.cloud-tpus.google.com: "1.12"

See the GKE TPU Guide to learn more about TPU settings.

Last modified 20.04.2021: Apply Docs Restructure to `v1.2-branch` = update `v1.2-branch` to current `master` v2 (#2612) (4e2602bd)