Using Preemptible VMs and GPUs on GCP
Configuring preemptible VMs and GPUs for Kubeflow Pipelines on GCP
This document describes how to configure preemptible virtual machines(preemptible VMs)and GPUs on preemptible VM instances(preemptible GPUs)for your workflows running on Kubeflow Pipelines on Google Cloud Platform (GCP).
Introduction
Preemptible VMs are Compute Engine VMinstances that last a maximumof 24 hours and provide no availability guarantees. Thepricing of preemptible VMs islower than that of standard Compute Engine VMs.
GPUs attached to preemptible instances(preemptible GPUs)work like normal GPUs but persist only for the life of the instance.
Using preemptible VMs and GPUs can reduce costs on GCP.In addition to using preemptible VMs, your Google Kubernetes Engine (GKE)cluster can autoscale based on current workloads.
This guide assumes that you have already deployed Kubeflow Pipelines. If not,follow the guide to deploying Kubeflow on GCP.
Using preemptible VMs with Kubeflow Pipelines
In summary, the steps to schedule a pipeline to run on preemptibleVMs are asfollows:
- Create anode poolin your cluster that contains preemptible VMs.
- Configure your pipelines to run on the preemptible VMs.The following sections contain more detail about the above steps.
1. Create a node pool with preemptible VMs
Use the gcloud
command tocreate a node pool.The following example includes placeholders to illustrate the importantconfigurations:
gcloud container node-pools create PREEMPTIBLE_CPU_POOL \
--cluster=CLUSTER_NAME \
--enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com
Where:
PREEMPTIBLE_CPU_POOL
is the name of the node pool.CLUSTER_NAME
is the name of the GKE cluster.MAX_NODES
andMIN_NODES
are the maximum and minimum number of nodesfor the GKEautoscalingfunctionality.DEPLOYMENT_NAME
is the name of your Kubeflow deployment. If you usedthe CLI to deploy Kubeflow,this name is the value of the${KF_NAME}
environment variable. If you usedthe deployment UI,this name is the value you specified as the deployment name.PROJECT_NAME
is the name of your GCP project.
Below is an example of the command:
gcloud container node-pools create preemptible-cpu-pool \
--cluster=user-4-18 \
--enable-autoscaling --max-nodes=4 --min-nodes=0 \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com
2. Schedule your pipeline to run on the preemptible VMs
After configuring a node pool with preemptible VMs, you must configure yourpipelines to run on the preemptible VMs.
In the DSL code foryour pipeline, add the following to the ContainerOp
instance:
.apply(gcp.use_preemptible_nodepool())
The above function works for both methods of generating the ContainerOp
:
- The
ContainerOp
generated fromkfp.components.func_to_container_op
. - The
ContainerOp
generated from the task factory function, which isloaded bycomponents.load_component_from_url
.
Note:
- Call
.set_retry(#NUM_RETRY)
on yourContainerOp
to retrythe task after the task is preempted. - If you modified thenode taintwhen creating the node pool, pass the same node toleration to the
use_preemptible_nodepool()
function. use_preemptible_nodepool()
also accepts a parameterhard_constraint
. When thehard_constraint
isTrue
, the system will strictly schedule the task in preemptible VMs. When thehard_constraint
isFalse
, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs,or the preemptible VMs are busy, the system will schedule the task in normal VMs.
For example:
import kfp.dsl as dsl
import kfp.gcp as gcp
class FlipCoinOp(dsl.ContainerOp):
"""Flip a coin and output heads or tails randomly."""
def __init__(self):
super(FlipCoinOp, self).__init__(
name='Flip',
image='python:alpine3.6',
command=['sh', '-c'],
arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
'else \'tails\'; print(result)" | tee /tmp/output'],
file_outputs={'output': '/tmp/output'})
@dsl.pipeline(
name='pipeline flip coin',
description='shows how to use dsl.Condition.'
)
def flipcoin():
flip = FlipCoinOp().apply(gcp.use_preemptible_nodepool())
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(flipcoin, __file__ + '.zip')
Using preemptible GPUs with Kubeflow Pipelines
This guide assumes that you have already deployed Kubeflow Pipelines. Insummary, the steps to schedule a pipeline to run withpreemptible GPUsare as follows:
- Make sure you have enough GPU quota.
- Create a node pool in your GKE cluster that contains preemptible VMs withpreemptible GPUs.
- Configure your pipelines to run on the preemptible VMs with preemptibleGPUs.The following sections contain more detail about the above steps.
1. Make sure you have enough GPU quota
Add GPU quota to your GCP project. The GCPdocumentation liststhe availability of GPUs across regions. To check the available quota forresources in your project, go to theQuotas page in the GCPConsole.
2. Create a node pool of preemptible VMs with preemptible GPUs
Use the gcloud
command tocreate a node pool.The following example includes placeholders to illustrate the importantconfigurations:
gcloud container node-pools create PREEMPTIBLE_GPU_POOL \
--cluster=CLUSTER_NAME \
--enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com \
--accelerator=type=GPU_TYPE,count=GPU_COUNT
Where:
PREEMPTIBLE_GPU_POOL
is the name of the node pool.CLUSTER_NAME
is the name of the GKE cluster.MAX_NODES
andMIN_NODES
are the maximum and minimum number of nodesfor theGKE autoscalingfunctionality.DEPLOYMENT_NAME
is the name of your Kubeflow deployment. If you usedthe CLI to deploy Kubeflow,this name is the value of the${KF_NAME}
environment variable. If you usedthe deployment UI,this name is the value you specified as the deployment name.PROJECT_NAME
is the name of your GCP project.GPU_TYPE
is the type ofGPU.GPU_COUNT
is the number of GPUs.
Below is an example of the command:
gcloud container node-pools create preemptible-gpu-pool \
--cluster=user-4-18 \
--enable-autoscaling --max-nodes=4 --min-nodes=0 \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com \
--accelerator=type=nvidia-tesla-t4,count=2
3. Schedule your pipeline to run on the preemptible VMs with preemptible GPUs
In the DSL code foryour pipeline, add the following to the ContainerOp
instance:
.apply(gcp.use_preemptible_nodepool()
The above function works for both methods of generating the ContainerOp
:
- The
ContainerOp
generated fromkfp.components.func_to_container_op
. - The
ContainerOp
generated from the task factory function, which isloaded bycomponents.load_component_from_url
.
Note:
- Call
.set_gpu_limit(#NUM_GPUs, GPU_VENDOR)
on yourContainerOp
to specify the GPU limit (for example,1
) and vendor (forexample,'nvidia'
). - Call
.set_retry(#NUM_RETRY)
on yourContainerOp
to retrythe task after the task is preempted. - If you modified thenode taintwhen creating the node pool, pass the same node toleration to the
use_preemptible_nodepool()
function. use_preemptible_nodepool()
also accepts a parameterhard_constraint
. When thehard_constraint
isTrue
, the system will strictly schedule the task in preemptible VMs. When thehard_constraint
isFalse
, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs,or the preemptible VMs are busy, the system will schedule the task in normal VMs.
For example:
import kfp.dsl as dsl
import kfp.gcp as gcp
class FlipCoinOp(dsl.ContainerOp):
"""Flip a coin and output heads or tails randomly."""
def __init__(self):
super(FlipCoinOp, self).__init__(
name='Flip',
image='python:alpine3.6',
command=['sh', '-c'],
arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
'else \'tails\'; print(result)" | tee /tmp/output'],
file_outputs={'output': '/tmp/output'})
@dsl.pipeline(
name='pipeline flip coin',
description='shows how to use dsl.Condition.'
)
def flipcoin():
flip = FlipCoinOp().set_gpu_limit(1, 'nvidia').apply(gcp.use_preemptible_nodepool())
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(flipcoin, __file__ + '.zip')
Comparison with Cloud AI Platform Training service
Cloud AI Platform Training is a GCPmachine learning (ML) training service that supports distributed training andhyperparameter tuning, and requires no complex GKE configuration. Cloud AIPlatform Training charges the Compute Engine costs only for the runtime of thejob.
The table below compares Cloud AI Platform Training with Kubeflow Pipelinesrunning preemptible VMs or GPUs:
Cloud AI Platform Training | Kubeflow Pipelines with preemption | |
---|---|---|
Configuration | No GKE configuration | Requires GKE configuration |
Cost | Compute Engine costs for the job lifetime | Lower price with preemptible VMs/GPUs/TPUs |
Accelerator | Supports various VM types, GPUs, and CPUs | Support various VM types, GPUs, and CPUs |
Scalability | Automates resource provisioning and supports distributed training | Requires manual configuration such as GKE autoscaler and distributed training workflow |
Features | Out-of-box support for hyperparameter tuning | Do-it-yourself hyperparameter tuning with Katib |
Next steps
- Explore further options for customizing Kubeflow on GCP.
- See how to build pipelines with the SDK.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.
Last modified 07.01.2020: Moved GCP-specific docs to GCP pipelines section. (#1498) (ab7ae7f9)