Using Preemptible VMs and GPUs on GCP

Configuring preemptible VMs and GPUs for Kubeflow Pipelines on GCP

Out of date

This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.

This document describes how to configure preemptible virtual machines (preemptible VMs) and GPUs on preemptible VM instances (preemptible GPUs) for your workflows running on Kubeflow Pipelines on Google Cloud Platform (GCP).

Introduction

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. The pricing of preemptible VMs is lower than that of standard Compute Engine VMs.

GPUs attached to preemptible instances (preemptible GPUs) work like normal GPUs but persist only for the life of the instance.

Using preemptible VMs and GPUs can reduce costs on GCP. In addition to using preemptible VMs, your Google Kubernetes Engine (GKE) cluster can autoscale based on current workloads.

This guide assumes that you have already deployed Kubeflow Pipelines. If not, follow the guide to deploying Kubeflow on GCP.

Using preemptible VMs with Kubeflow Pipelines

In summary, the steps to schedule a pipeline to run on preemptible VMs are as follows:

  1. Create a node pool in your cluster that contains preemptible VMs.
  2. Configure your pipelines to run on the preemptible VMs.

The following sections contain more detail about the above steps.

1. Create a node pool with preemptible VMs

Use the gcloud command to create a node pool. The following example includes placeholders to illustrate the important configurations:

  1. gcloud container node-pools create PREEMPTIBLE_CPU_POOL \
  2. --cluster=CLUSTER_NAME \
  3. --enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
  4. --preemptible \
  5. --node-taints=preemptible=true:NoSchedule \
  6. --service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com

Where:

  • PREEMPTIBLE_CPU_POOL is the name of the node pool.
  • CLUSTER_NAME is the name of the GKE cluster.
  • MAX_NODES and MIN_NODES are the maximum and minimum number of nodes for the GKE autoscaling functionality.
  • DEPLOYMENT_NAME is the name of your Kubeflow deployment. If you used the CLI to deploy Kubeflow, this name is the value of the ${KF_NAME} environment variable. If you used the deployment UI, this name is the value you specified as the deployment name.
  • PROJECT_NAME is the name of your GCP project.

Below is an example of the command:

  1. gcloud container node-pools create preemptible-cpu-pool \
  2. --cluster=user-4-18 \
  3. --enable-autoscaling --max-nodes=4 --min-nodes=0 \
  4. --preemptible \
  5. --node-taints=preemptible=true:NoSchedule \
  6. --service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com

2. Schedule your pipeline to run on the preemptible VMs

After configuring a node pool with preemptible VMs, you must configure your pipelines to run on the preemptible VMs.

In the DSL code for your pipeline, add the following to the ContainerOp instance:

  1. .apply(gcp.use_preemptible_nodepool())

The above function works for both methods of generating the ContainerOp:

Note:

  • Call .set_retry(#NUM_RETRY) on your ContainerOp to retry the task after the task is preempted.
  • If you modified the node taint when creating the node pool, pass the same node toleration to the use_preemptible_nodepool() function.
  • use_preemptible_nodepool() also accepts a parameter hard_constraint. When the hard_constraint is True, the system will strictly schedule the task in preemptible VMs. When the hard_constraint is False, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.

For example:

  1. import kfp.dsl as dsl
  2. import kfp.gcp as gcp
  3. class FlipCoinOp(dsl.ContainerOp):
  4. """Flip a coin and output heads or tails randomly."""
  5. def __init__(self):
  6. super(FlipCoinOp, self).__init__(
  7. name='Flip',
  8. image='python:alpine3.6',
  9. command=['sh', '-c'],
  10. arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
  11. 'else \'tails\'; print(result)" | tee /tmp/output'],
  12. file_outputs={'output': '/tmp/output'})
  13. @dsl.pipeline(
  14. name='pipeline flip coin',
  15. description='shows how to use dsl.Condition.'
  16. )
  17. def flipcoin():
  18. flip = FlipCoinOp().apply(gcp.use_preemptible_nodepool())
  19. if __name__ == '__main__':
  20. import kfp.compiler as compiler
  21. compiler.Compiler().compile(flipcoin, __file__ + '.zip')

Using preemptible GPUs with Kubeflow Pipelines

This guide assumes that you have already deployed Kubeflow Pipelines. In summary, the steps to schedule a pipeline to run with preemptible GPUs are as follows:

  1. Make sure you have enough GPU quota.
  2. Create a node pool in your GKE cluster that contains preemptible VMs with preemptible GPUs.
  3. Configure your pipelines to run on the preemptible VMs with preemptible GPUs.

The following sections contain more detail about the above steps.

1. Make sure you have enough GPU quota

Add GPU quota to your GCP project. The GCP documentation lists the availability of GPUs across regions. To check the available quota for resources in your project, go to the Quotas page in the GCP Console.

2. Create a node pool of preemptible VMs with preemptible GPUs

Use the gcloud command to create a node pool. The following example includes placeholders to illustrate the important configurations:

  1. gcloud container node-pools create PREEMPTIBLE_GPU_POOL \
  2. --cluster=CLUSTER_NAME \
  3. --enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
  4. --preemptible \
  5. --node-taints=preemptible=true:NoSchedule \
  6. --service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com \
  7. --accelerator=type=GPU_TYPE,count=GPU_COUNT

Where:

  • PREEMPTIBLE_GPU_POOL is the name of the node pool.
  • CLUSTER_NAME is the name of the GKE cluster.
  • MAX_NODES and MIN_NODES are the maximum and minimum number of nodes for the GKE autoscaling functionality.
  • DEPLOYMENT_NAME is the name of your Kubeflow deployment. If you used the CLI to deploy Kubeflow, this name is the value of the ${KF_NAME} environment variable. If you used the deployment UI, this name is the value you specified as the deployment name.
  • PROJECT_NAME is the name of your GCP project.
  • GPU_TYPE is the type of GPU.
  • GPU_COUNT is the number of GPUs.

Below is an example of the command:

  1. gcloud container node-pools create preemptible-gpu-pool \
  2. --cluster=user-4-18 \
  3. --enable-autoscaling --max-nodes=4 --min-nodes=0 \
  4. --preemptible \
  5. --node-taints=preemptible=true:NoSchedule \
  6. --service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com \
  7. --accelerator=type=nvidia-tesla-t4,count=2

3. Schedule your pipeline to run on the preemptible VMs with preemptible GPUs

In the DSL code for your pipeline, add the following to the ContainerOp instance:

  1. .apply(gcp.use_preemptible_nodepool()

The above function works for both methods of generating the ContainerOp:

Note:

  • Call .set_gpu_limit(#NUM_GPUs, GPU_VENDOR) on your ContainerOp to specify the GPU limit (for example, 1) and vendor (for example, 'nvidia').
  • Call .set_retry(#NUM_RETRY) on your ContainerOp to retry the task after the task is preempted.
  • If you modified the node taint when creating the node pool, pass the same node toleration to the use_preemptible_nodepool() function.
  • use_preemptible_nodepool() also accepts a parameter hard_constraint. When the hard_constraint is True, the system will strictly schedule the task in preemptible VMs. When the hard_constraint is False, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.

For example:

  1. import kfp.dsl as dsl
  2. import kfp.gcp as gcp
  3. class FlipCoinOp(dsl.ContainerOp):
  4. """Flip a coin and output heads or tails randomly."""
  5. def __init__(self):
  6. super(FlipCoinOp, self).__init__(
  7. name='Flip',
  8. image='python:alpine3.6',
  9. command=['sh', '-c'],
  10. arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
  11. 'else \'tails\'; print(result)" | tee /tmp/output'],
  12. file_outputs={'output': '/tmp/output'})
  13. @dsl.pipeline(
  14. name='pipeline flip coin',
  15. description='shows how to use dsl.Condition.'
  16. )
  17. def flipcoin():
  18. flip = FlipCoinOp().set_gpu_limit(1, 'nvidia').apply(gcp.use_preemptible_nodepool())
  19. if __name__ == '__main__':
  20. import kfp.compiler as compiler
  21. compiler.Compiler().compile(flipcoin, __file__ + '.zip')

Comparison with Cloud AI Platform Training service

Cloud AI Platform Training is a GCP machine learning (ML) training service that supports distributed training and hyperparameter tuning, and requires no complex GKE configuration. Cloud AI Platform Training charges the Compute Engine costs only for the runtime of the job.

The table below compares Cloud AI Platform Training with Kubeflow Pipelines running preemptible VMs or GPUs:

Cloud AI Platform TrainingKubeflow Pipelines with preemption
ConfigurationNo GKE configurationRequires GKE configuration
CostCompute Engine costs for the job lifetimeLower price with preemptible VMs/GPUs/TPUs
AcceleratorSupports various VM types, GPUs, and CPUsSupport various VM types, GPUs, and CPUs
ScalabilityAutomates resource provisioning and supports distributed trainingRequires manual configuration such as GKE autoscaler and distributed training workflow
FeaturesOut-of-box support for hyperparameter tuningDo-it-yourself hyperparameter tuning with Katib

Next steps

Last modified 20.04.2021: Apply Docs Restructure to `v1.2-branch` = update `v1.2-branch` to current `master` v2 (#2612) (4e2602bd)