Troubleshooting

Finding and fixing problems in your Kubeflow deployment

Out of date

This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.

This page presents some hints for troubleshooting specific problems that you may encounter.

macOS: kfctl cannot be opened because the developer cannot be verified

The kfctl binary is currently not signed. That is, kfctl is not registered with Apple. When you run kfctl from the command line on the latest versions of macOS, you may see a message like this:

“kfctl” cannot be opened because the developer cannot be verified. macOS cannot verify that this app is free from malware.

To run kfctl, go to the kfctl binary file in Finder, right-click, then select Open. Then click Open again to confirm that you want to open the app.

For more information, see the macOS user guide.

TensorFlow and AVX

There are some instances where you may encounter a TensorFlow-related Python installation or a pod launch issue that results in a SIGILL (illegal instruction core dump). Kubeflow uses the pre-built binaries from the TensorFlow project which, beginning with version 1.6, are compiled to make use of the AVX CPU instruction. This is a recent feature and your CPU might not support it. Check the host environment for your node to determine whether it has this support.

Linux:

  1. grep -ci avx /proc/cpuinfo

AVX2

Some components requirement AVX2 for better performance, e.g. TF Serving. To ensure the nodes support AVX2, we added minCpuPlatform arg in our deployment config.

On GCP this will fail in regions (e.g. us-central1-a) that do not explicitly have Intel Haswell (even when there are other newer platforms in the region). In that case, please choose another region, or change the config to other platform newer than Haswell.

RBAC clusters

If you are running on a Kubernetes cluster with RBAC enabled, you may get an error like the following when deploying Kubeflow:

  1. ERROR Error updating roles kubeflow-test-infra.jupyter-role: roles.rbac.authorization.k8s.io "jupyter-role" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:["*"], APIGroups:["*"], Verbs:["*"]}] user=&{your-user@acme.com [system:authenticated] map[]} ownerrules=[PolicyRule{Resources:["selfsubjectaccessreviews"], APIGroups:["authorization.k8s.io"], Verbs:["create"]} PolicyRule{NonResourceURLs:["/api" "/api/*" "/apis" "/apis/*" "/healthz" "/swagger-2.0.0.pb-v1" "/swagger.json" "/swaggerapi" "/swaggerapi/*" "/version"], Verbs:["get"]}] ruleResolutionErrors=[]

This error indicates you do not have sufficient permissions. In many cases you can resolve this just by creating an appropriate clusterrole binding like so and then redeploying kubeflow:

  1. kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin --user=your-user@acme.com
  • Replace your-user@acme.com with the user listed in the error message.

If you’re using GKE, you may want to refer to GKE’s RBAC docs to understand how RBAC interacts with IAM on GCP.

Problems spawning Jupyter pods

This section has been moved to Jupyter Notebooks Troubleshooting Guide.

Pods stuck in Pending state

There are three pods that have Persistent Volume Claims (PVCs) that will get stuck in pending state if they are unable to bind their PVC. The three pods are minio, mysql, and katib-mysql. Check the status of the PVC requests:

  1. kubectl -n ${NAMESPACE} get pvc
  • Look for the status of “Bound”
  • PVC requests in “Pending” state indicate that the scheduler was unable to bind the required PVC.

If you have not configured dynamic provisioning for your cluster, including a default storage class, then you must create a persistent volume for each of the PVCs.

You can use the example below to create local persistent volumes:

  1. sudo mkdir /mnt/pv{1..3}
  2. kubectl create -f - <<EOF
  3. kind: PersistentVolume
  4. apiVersion: v1
  5. metadata:
  6. name: pv-volume1
  7. spec:
  8. storageClassName:
  9. capacity:
  10. storage: 10Gi
  11. accessModes:
  12. - ReadWriteOnce
  13. hostPath:
  14. path: "/mnt/pv1"
  15. ---
  16. kind: PersistentVolume
  17. apiVersion: v1
  18. metadata:
  19. name: pv-volume2
  20. spec:
  21. storageClassName:
  22. capacity:
  23. storage: 20Gi
  24. accessModes:
  25. - ReadWriteOnce
  26. hostPath:
  27. path: "/mnt/pv2"
  28. ---
  29. kind: PersistentVolume
  30. apiVersion: v1
  31. metadata:
  32. name: pv-volume3
  33. spec:
  34. storageClassName:
  35. capacity:
  36. storage: 20Gi
  37. accessModes:
  38. - ReadWriteOnce
  39. hostPath:
  40. path: "/mnt/pv3"
  41. EOF

Once created the scheduler will successfully start the remaining three pods. The PVs may also be created prior to running any of the kfctl commands.

OpenShift

If you are deploying Kubeflow in an OpenShift environment which encapsulates Kubernetes, you will need to adjust the security contexts for the ambassador and Jupyter-hub deployments in order to get the pods to run:

  1. oc adm policy add-scc-to-user anyuid -z ambassador
  2. oc adm policy add-scc-to-user anyuid -z jupyter-hub

Once the anyuid policy has been set, you must delete the failing pods and allow them to be recreated in the project deployment.

You will also need to adjust the privileges of the tf-job-operator service account for TFJobs to run. Do this in the project where you are running TFJobs:

  1. oc adm policy add-role-to-user cluster-admin -z tf-job-operator

403 API rate limit exceeded error

Because kubectl uses GitHub to pull kubeflow, unless user specifies GitHub API token, it will quickly consume maximum API call quota for anonymous. To fix this issue first create GitHub API token using this guide, and assign this token to GITHUB_TOKEN environment variable:

  1. export GITHUB_TOKEN=<< token >>

Next steps

Visit the Kubeflow support page to find resources and community forums where you can ask for help.

Last modified 03.08.2020: Added outdated banner to non-index docs unchanged in last 30d (#2072) (e56f3650)