IBM Cloud Private for Kubeflow

Get Kubeflow running on IBM Cloud Private

This guide is a quick start to deploying Kubeflow on IBM Cloud Private 3.1.0 or later. IBM Cloud Private is an enterprise platform as a service (PaaS) layer for developing and managing on-premises, containerized applications. It is an integrated environment for managing containers that includes the container orchestrator Kubernetes, a private image registry, a management console, and monitoring frameworks.

Prerequisites

  • Get the system requirements from IBM Knowledge Center for IBM Cloud Private.

  • Set up NFS Server and export one or more paths for persistent volume(s).

Installing IBM Cloud Private

Follow the installation steps in IBM Knowledge Center to install IBM Cloud Private 3.1.0 or later with master, proxy, worker, and optional management and vulnerability advisory nodes in your cluster in standard or high availability configurations.

Please note IBM Cloud Private 3.2.0 is a prerequisite for installing Kubeflow v0.7.

The guide takes IBM Cloud Private 3.1.0 as example below. You can check the IBM Cloud Private after installation.

  1. # kubectl get node
  2. NAME STATUS ROLES AGE VERSION
  3. 10.43.0.38 Ready management 11d v1.11.1+icp-ee
  4. 10.43.0.39 Ready master,etcd,proxy 11d v1.11.1+icp-ee
  5. 10.43.0.40 Ready va 11d v1.11.1+icp-ee
  6. 10.43.0.44 Ready worker 11d v1.11.1+icp-ee
  7. 10.43.0.46 Ready worker 11d v1.11.1+icp-ee
  8. 10.43.0.49 Ready worker 11d v1.11.1+icp-ee

Creating image policy and persistent volume

Follow these steps to create an image policy for your Kubernetes namespace and a persistent volume (PV) for your Kubeflow components:

  • Create Kubernetes namespace.
  1. export K8S_NAMESPACE=kubeflow
  2. kubectl create namespace $K8S_NAMESPACE
  • K8S_NAMESPACE is namespace name that the Kubeflow will be installed in. By default should be “kubeflow”.
    • Create image policy for the namespace.

The image policy definition file (image-policy.yaml) is as following:

  1. apiVersion: securityenforcement.admission.cloud.ibm.com/v1beta1
  2. kind: ImagePolicy
  3. metadata:
  4. name: image-policy
  5. spec:
  6. repositories:
  7. - name: docker.io/*
  8. policy: null
  9. - name: k8s.gcr.io/*
  10. policy: null
  11. - name: gcr.io/*
  12. policy: null
  13. - name: ibmcom/*
  14. policy: null
  15. - name: quay.io/*
  16. policy: null

Create ImagePolicy for the specified namespace.

  1. kubectl create -n $K8S_NAMESPACE -f image-policy.yaml
  • Create persistent volume (PV) for Kubeflow components.

Some Kubeflow components need PVs to storage data, such as minio, mysql katib. We need to create PVs for those pods in advance.The PVs defination file (pv.yaml) is as following:

  1. apiVersion: v1
  2. kind: PersistentVolume
  3. metadata:
  4. name: kubeflow-pv1
  5. spec:
  6. capacity:
  7. storage: 20Gi
  8. accessModes:
  9. - ReadWriteOnce
  10. persistentVolumeReclaimPolicy: Retain
  11. nfs:
  12. path: ${NFS_SHARED_DIR}/pv1
  13. server: ${NFS_SERVER_IP}
  14. ---
  15. apiVersion: v1
  16. kind: PersistentVolume
  17. metadata:
  18. name: kubeflow-pv2
  19. spec:
  20. capacity:
  21. storage: 20Gi
  22. accessModes:
  23. - ReadWriteOnce
  24. persistentVolumeReclaimPolicy: Retain
  25. nfs:
  26. path: ${NFS_SHARED_DIR}/pv2
  27. server: ${NFS_SERVER_IP}
  28. ---
  29. apiVersion: v1
  30. kind: PersistentVolume
  31. metadata:
  32. name: kubeflow-pv3
  33. spec:
  34. capacity:
  35. storage: 20Gi
  36. accessModes:
  37. - ReadWriteOnce
  38. persistentVolumeReclaimPolicy: Retain
  39. nfs:
  40. path: ${NFS_SHARED_DIR}/pv3
  41. server: ${NFS_SERVER_IP}
  • NFS_SERVER_IP is the NFS server IP, that can be management node IP but need management node need to support NFS mounting.
  • NFS_SHARED_DIR is the NFS shared path that can be mounted by othe nodes in IBM Cloud Private cluster. And ensure the sub-folders(pv1, pv2,pv3) in defination above are created. Create PV by running below command:
  1. kubectl create -f pv.yaml

Installing Kubeflow

Follow these steps to deploy Kubeflow:

  1. tar -xvf kfctl_v0.7.1_<platform>.tar.gz
  • Run the following commands to set up and deploy Kubeflow. The code belowincludes an optional command to add the kfctl binary to your path. If you don’t add the binary to your path, you must use the full path to the kfctlbinary each time you run it.
  1. # The following command is optional, to make kfctl binary easier to use.
  2. export PATH=$PATH:<path to kfctl in your Kubeflow installation>
  3. # Set KF_NAME to the name of your Kubeflow deployment. This also becomes the
  4. # name of the directory containing your configuration.
  5. # For example, your deployment name can be 'my-kubeflow' or 'kf-test'.
  6. export KF_NAME=<your choice of name for the Kubeflow deployment>
  7. # Set the path to the base directory where you want to store one or more
  8. # Kubeflow deployments. For example, /opt/.
  9. # Then set the Kubeflow application directory for this deployment.
  10. export BASE_DIR=<path to a base directory>
  11. export KF_DIR=${BASE_DIR}/${KF_NAME}
  12. # Installs Istio by default. Comment out Istio components in the config file to skip Istio installation.
  13. export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml"
  14. mkdir ${KF_DIR}
  15. cd ${KF_DIR}
  16. kfctl apply -V -f ${CONFIG_URI}
  • ${KF_NAME} - the name of your Kubeflow deployment. This value alsobecomes the name of the directory where your Kubeflow configurations arestored.If you want a custom deployment name, specify that name here.For example, my-kubeflow or kf-test.The value of this variable cannot be greater than 25 characters. It mustcontain just the deployment name, not the full path to the directory.
    • Check the resources deployed in namespace kubeflow:
  1. kubectl -n kubeflow get all

Access Kubeflow dashboard

From Kubeflow v0.6, the Kubeflow Dashboard can be accessed via istio-ingressgateway service. If loadbalancer is not available in your environment, NodePort or Port forwarding can be used to access the Kubeflow Dashboard. Refer Ingress Gateway guide.

For Kubeflow version lower than v0.6, access the Kubeflow Dashboard via Ambassador service. Change the Ambassador service type to NodePort, then access the Kubeflow dashboard through Ambassador.

  1. kubectl -n kubeflow patch service ambassador -p '{"spec":{"type": "NodePort"}}'
  2. AMBASSADOR_PORT=$(kubectl -n kubeflow get service ambassador -ojsonpath='{.spec.ports[?(@.name=="ambassador")].nodePort}')

Then you will find the NodePort and access the Kubeflow dashboard by NodePort.

  1. http://${MANAGEMENT_IP}:$AMBASSADOR_PORT/
  • MANAGEMENT_IP is management node IP.
  • AMBASSADOR_PORT is the ambassador port.

Delete Kubeflow

Set the ${CONFIG_FILE} environment variable to the path for yourKubeflow configuration file:

  1. export CONFIG_FILE=${KF_DIR}/kfctl_k8s_istio.0.7.1.yaml

Run the following commands to delete your deployment and reclaim all resources:

  1. cd ${KF_DIR}
  2. # If you want to delete all the resources, including storage.
  3. kfctl delete -f ${CONFIG_FILE} --delete_storage
  4. # If you want to preserve storage, which contains metadata and information
  5. # from Kubeflow Pipelines.
  6. kfctl delete -f ${CONFIG_FILE}

TroubleShooting

Insufficient Pods

Default installation of IBM Cloud Private configures 10 allocatable pods per core. So if you have 8 cores, there will be 80 allocatable pods. However, execution of IBM Cloud Private (from default installation) requires ~40 pods and Kubeflow requires ~45 pods, so you may have insufficient pods issue.

To increase the number of pods, you can edit /etc/cfc/kubelet/kubelet-service-config, increase podsPerCore from 10 to 20 (or 30). Save your change and restart kubelet.

  1. systemctl restart kubelet

Kubeflow Central Dashboard: connection reset error or slow response

This may happen when there is not enough memory (e.g. if you only have 16G memory). You may consider turning off some non-essential services in IBM Cloud Private

  1. kubectl -n kube-system patch ds logging-elk-filebeat-ds --patch '{ "spec": { "template": { "spec": { "nodeSelector": { "switch": "down" } } } } }'
  2. kubectl -n kube-system patch ds nvidia-device-plugin --patch '{ "spec": { "template": { "spec": { "nodeSelector": { "switch": "down" } } } } }'
  3. kubectl -n kube-system patch ds audit-logging-fluentd-ds --patch '{ "spec": { "template": { "spec": { "nodeSelector": { "switch": "down" } } } } }'
  4. kubectl -n kube-system scale deploy logging-elk-client --replicas=0
  5. kubectl -n kube-system scale deploy logging-elk-kibana --replicas=0
  6. kubectl -n kube-system scale deploy logging-elk-logstash --replicas=0
  7. kubectl -n kube-system scale deploy logging-elk-master --replicas=0
  8. kubectl -n kube-system scale deploy secret-watcher --replicas=0

Deployed Service’s EXTERNAL-IP field stuck in PENDING state

If you have deployed a service of LoadBalancer type, but the EXTERNAL-IP field stucks in <pending> state, this is because IBM Cloud Private doesn’t provide a built-in support for the LoadBalancer service. To enable the LoadBalancer service on IBM Cloud Private, please see options described here.

CustomResourceDefinition.apiextensions.k8s.io “profiles.kubeflow.org” is invalid during v0.7 installation

If you install Kubeflow v0.7 in IBM Cloud Private 3.1.0, you may encounter the following error:

  1. failed to apply: (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout904105401": CustomResourceDefinition.apiextensions.k8s.io "profiles.kubeflow.org" is invalid: spec.validation.openAPIV3Schema: Invalid value:...
  2. must only have "properties", "required" or "description" at the root if the status subresource is enabled

The default Kubernetes version in IBM Cloud Private 3.1.0 is 1.11 which is incompatilbe with Kubeflow v0.7. Please upgrade IBM Cloud Private with a Kubernetes version that is compatible with the Kubeflow version as described in Minimum system requirements.

v.07 Kubeflow pods stuck in CreateContainerConfigError state

If you install Kubeflow v0.7 in IBM Cloud Private 3.2.0, you may find pods stuck in CreateContainerConfigError state and describe pod show the following error:

  1. Error: container has runAsNonRoot and image will run as root

If you install IBM Cloud Private version 3.2.0 or later as a new installation, the default pod security policy setting is the ibm-restricted-psp policy, which is applied to all of the existing and newly created namespaces. You can bypass this by deploying Kubeflow using a non-root user id. You can also use IBM Cloud private CLI cm commands to change the default setting. See Pod isolation for other options.