Customizing Kubeflow on GKE
Tailoring a GKE deployment of Kubeflow
Out of date
This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.
This guide describes how to customize your deployment of Kubeflow on Google Kubernetes Engine (GKE) on Google Cloud.
Customizing Kubeflow before deployment
The Kubeflow deployment process is divided into two steps, build and apply, so that you can modify your configuration before deploying your Kubeflow cluster.
Follow the guide to deploying Kubeflow on Google Cloud. When you reach the setup and deploy step, skip the kfctl apply
command and run the kfctl build
command instead, as described in that step. Now you can edit the configuration files before deploying Kubeflow.
Customizing an existing deployment
You can also customize an existing Kubeflow deployment. In that case, this guide assumes that you have already followed the guide to deploying Kubeflow on Google Cloud and have deployed Kubeflow to a GKE cluster.
Before you start
This guide assumes the following settings:
The
${KF_DIR}
environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example,/opt/my-kubeflow/
.export KF_DIR=<path to your Kubeflow application directory>
The
${CONFIG_FILE}
environment variable contains the path to your Kubeflow configuration file.export CONFIG_FILE=${KF_DIR}/kfctl_gcp_iap.v1.0.2.yaml
The
${KF_NAME}
environment variable contains the name of your Kubeflow deployment. You can find the name in your${CONFIG_FILE}
configuration file, as the value for themetadata.name
key.export KF_NAME=<the name of your Kubeflow deployment>
The
${PROJECT}
environment variable contains the ID of your Google Cloud project. You can find the project ID in your${CONFIG_FILE}
configuration file, as the value for theproject
key.export PROJECT=<your Google Cloud project ID>
For further background about the above settings, see the guide to deploying Kubeflow with the CLI.
Customizing Google Cloud resources
To customize Google Cloud resources, such as your Kubernetes Engine cluster, you can modify the Deployment Manager configuration settings in ${KF_DIR}/gcp_config
.
After modifying your existing configuration, run the following command to apply the changes:
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_FILE}
Alternatively, you can use Deployment Manager directly:
cd ${KF_DIR}/gcp_config
gcloud deployment-manager --project=${PROJECT} deployments update ${KF_NAME} --config=cluster-kubeflow.yaml
Some changes (such as the VM service account for Kubernetes Engine) can only be set at creation time; in this case you need to tear down your deployment before recreating it:
cd ${KF_DIR}
kfctl delete -f ${CONFIG_FILE}
kfctl apply -V -f ${CONFIG_FILE}
Customizing Kubernetes resources
You can use kustomize to customize Kubeflow. Make sure that you have the minimum required version of kustomize: 2.0.3 or later. For more information about kustomize in Kubeflow, see how Kubeflow uses kustomize.
To customize the Kubernetes resources running within the cluster, you can modify the kustomize manifests in ${KF_DIR}/kustomize
.
For example, to modify settings for the Jupyter web app:
Open
${KF_DIR}/kustomize/jupyter-web-app.yaml
in a text editor.Find and replace the parameter values:
apiVersion: v1
data:
ROK_SECRET_NAME: secret-rok-{username}
UI: default
clusterDomain: cluster.local
policy: Always
prefix: jupyter
kind: ConfigMap
metadata:
labels:
app: jupyter-web-app
kustomize.component: jupyter-web-app
name: jupyter-web-app-parameters
namespace: kubeflow
Redeploy Kubeflow using kfctl:
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_FILE}
Or use kubectl directly:
cd ${KF_DIR}/kustomize
kubectl apply -f jupyter-web-app.yaml
Common customizations
Add users to Kubeflow
You must grant each user the minimal permission scope that allows them to connect to the Kubernetes cluster.
For Google Cloud, you should grant the following Cloud Identity and Access Management (IAM) roles.
In the following commands, replace [PROJECT]
with your Google Cloud project and replace [EMAIL]
with the user’s email address:
To access the Kubernetes cluster, the user needs the Kubernetes Engine Cluster Viewer role:
gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/container.clusterViewer
To access the Kubeflow UI through IAP, the user needs the IAP-secured Web App User role:
gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/iap.httpsResourceAccessor
Note, you need to grant the user
IAP-secured Web App User
role even if the user is already an owner or editor of the project.IAP-secured Web App User
role is not implied by theProject Owner
orProject Editor
roles.To be able to run
gcloud container clusters get-credentials
and see logs in Cloud Logging (formerly Stackdriver), the user needs viewer access on the project:gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/viewer
Alternatively, you can also grant these roles on the IAM page in the Cloud Console. Make sure you are in the same project as your Kubeflow deployment.
Add GPU nodes to your cluster
To add GPU accelerators to your Kubeflow cluster, you have the following options:
- Pick a Google Cloud zone that provides NVIDIA Tesla K80 Accelerators (
nvidia-tesla-k80
). - Or disable node-autoprovisioning in your Kubeflow cluster.
- Or change your node-autoprovisioning configuration.
To see which accelerators are available in each zone, run the following command:
gcloud compute accelerator-types list
To disable node-autoprovisioning, run kfctl build
as described above. Then edit ${KF_DIR}/gcp_config/cluster-kubeflow.yaml
and set enabled
to false
:
...
gpu-type: nvidia-tesla-k80
autoprovisioning-config:
enabled: false
...
You must also set gpu-pool-initialNodeCount
.
Add GPU node pool to an existing kubeflow cluster
You can add a GPU node pool to your kubeflow cluster using the following command
export GPU_POOL_NAME=<name of the new gpu pool>
gcloud container node-pools create ${GPU_POOL_NAME} \
--accelerator type=nvidia-tesla-k80,count=1 \
--zone us-central1-a --cluster ${KF_NAME} \
--num-nodes=1 --machine-type=n1-standard-4 --min-nodes=0 --max-nodes=5 --enable-autoscaling
After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a DaemonSet that automatically installs the drivers for you.
To deploy the installation DaemonSet, run the following command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Add Cloud TPUs to your cluster
Set enable_tpu:true
in ${KF_DIR}/gcp_config/cluster-kubeflow.yaml
.
Specify a minimum CPU
Certain instruction sets or hardware features are only available on specific CPUs, so to ensure your cluster utilizes the appropriate hardware you need to set a minimum CPU value.
In brief, inside gcp_config/cluster.jinja
change the minCpuPlatform
property for the CPU node pool. For example, Intel Broadwell
becomes Intel Skylake
. Setting a minimum CPU needs to occur during cluster/node creation; it cannot be applied to an existing cluster/node.
More detailed instructions follow.
Choose a zone you want to deploy in that has your required CPU. Zones are listed in the Regions and Zones documentation.
Deploy Kubeflow normally as specified in the “Deploy using CLI” documentation, but stop at section “Set up and deploy Kubeflow”. Instead, navigate to section “Alternatively, set up your configuration for later deployment”. Then follow the steps until you are instructed to edit configuration files.
Navigate to the
gcp_config directory
and open thecluster.jinja
file. Change the cluster propertyminCpuPlatform
. For example, fromIntel Broadwell
toIntel Skylake
. Note: you may notice there are two minCpuPlatform properties in the file. One of them is for GPU node pools. Not all CPU/GPU combinations are compatible, so leave the GPU minCpuPlatform property untouched.Follow the remaining steps of “Alternatively, set up your configuration for later deployment”.
Add VMs with more CPUs or RAM
- Change the machineType.
- There are two node pools defined in the Google Cloud Deployment Manager:
- one for CPU only machines, in
cluster.jinja
. - one for GPU machines, in
cluster.jinja
.
- one for CPU only machines, in
- When making changes to the node pools you also need to bump the
pool-version
incluster-kubeflow.yaml
before you update the deployment.
More customizations
Refer to the navigation panel on the left of these docs for more customizations, including using your own domain, setting up Cloud Filestore, and more.
Last modified 20.04.2021: Apply Docs Restructure to `v1.2-branch` = update `v1.2-branch` to current `master` v2 (#2612) (4e2602bd)