Hyperparameter Tuning (Katib)
Using Katib to tune your model’s hyperparameters on Kubernetes
The Katib project is inspired byGoogle vizier.Katib is a scalable and flexible hyperparameter tuning framework and is tightlyintegrated with Kubernetes. It does not depend on any specific deep learningframework (such as TensorFlow, MXNet, or PyTorch).
Installing Katib
To run Katib jobs, you must install the required packages as shown in thissection. You can do so by following the Kubeflow deployment guide,or by installing Katib directly from its repository:
git clone https://github.com/kubeflow/katib
./katib/scripts/v1alpha2/deploy.sh
Persistent Volumes
If you want to use Katib outside Google Kubernetes Engine (GKE) and you don’thave a StorageClass for dynamic volume provisioning in your cluster, you mustcreate a persistent volume (PV) to bind your persistent volume claim (PVC).
This is the YAML file for a PV:
apiVersion: v1
kind: PersistentVolume
metadata:
name: katib-mysql
labels:
type: local
app: katib
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/katib
After deploying the Katib package, run the following command to create the PV:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml
Running examples
After deploying everything, you can run some examples.
Example using random algorithm
You can create an Experiment for Katib by defining an Experiment config file. See therandom algorithm example.
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/random-example.yaml
Running this command launches an Experiment. It runs a series oftraining jobs to train models using different hyperparameters and save theresults.
The configurations for the experiment (hyperparameter feasible space, optimizationparameter, optimization goal, suggestion algorithm, and so on) are defined inrandom-example.yaml.
In this demo, hyperparameters are embedded as args.You can embed hyperparameters in another way (for example, environment values)by using the template defined in TrialTemplate.GoTemplate.RawTemplate
.It is written in go template format.
This demo randomly generates 3 hyperparameters:
- Learning Rate (–lr) - type: double
- Number of NN Layer (–num-layers) - type: int
- optimizer (–optimizer) - type: categorical
Check the experiment status:
$ kubectl -n kubeflow describe experiment random-example
Name: random-example
Namespace: kubeflow
Labels: controller-tools.k8s.io=1.0
Annotations: <none>
API Version: kubeflow.org/v1alpha2
Kind: Experiment
Metadata:
Creation Timestamp: 2019-01-18T16:30:46Z
Finalizers:
clean-data-in-db
Generation: 5
Resource Version: 1777650
Self Link: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/experiments/random-example
UID: 687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a
Spec:
Algorithm:
Algorithm Name: random
Algorithm Settings:
Max Failed Trial Count: 3
Max Trial Count: 100
Objective:
Additional Metric Names:
accuracy
Goal: 0.99
Objective Metric Name: Validation-accuracy
Type: maximize
Parallel Trial Count: 10
Parameters:
Feasible Space:
Max: 0.03
Min: 0.01
Name: --lr
Parameter Type: double
Feasible Space:
Max: 5
Min: 2
Name: --num-layers
Parameter Type: int
Feasible Space:
List:
sgd
adam
ftrl
Name: --optimizer
Parameter Type: categorical
Trial Template:
Go Template:
Template Spec:
Config Map Name: trial-template
Config Map Namespace: kubeflow
Template Path: mnist-trial-template
Status:
Completion Time: 2019-06-20T00:12:07Z
Conditions:
Last Transition Time: 2019-06-19T23:20:56Z
Last Update Time: 2019-06-19T23:20:56Z
Message: Experiment is created
Reason: ExperimentCreated
Status: True
Type: Created
Last Transition Time: 2019-06-20T00:12:07Z
Last Update Time: 2019-06-20T00:12:07Z
Message: Experiment is running
Reason: ExperimentRunning
Status: False
Type: Running
Last Transition Time: 2019-06-20T00:12:07Z
Last Update Time: 2019-06-20T00:12:07Z
Message: Experiment has succeeded because max trial count has reached
Reason: ExperimentSucceeded
Status: True
Type: Succeeded
Current Optimal Trial:
Observation:
Metrics:
Name: Validation-accuracy
Value: 0.982483983039856
Parameter Assignments:
Name: --lr
Value: 0.026666666666666665
Name: --num-layers
Value: 2
Name: --optimizer
Value: sgd
Start Time: 2019-06-19T23:20:55Z
Trials: 100
Trials Succeeded: 100
Events: <none>
The demo should start an experiment and run three jobs with different parameters.When the spec.Status.Condition
changes to Completed, the experiment isfinished.
TensorFlow operator example
To run the TensorFlow operator example, you must install a volume.
If you are using GKE and default StorageClass, you must create this PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfevent-volume
namespace: kubeflow
labels:
type: local
app: tfjob
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
If you are not using GKE and you don’t have StorageClass for dynamic volumeprovisioning in your cluster, you must create a PVC and a PV:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pvc.yaml
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pv.yaml
Now you can run the TensorFlow operator example:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfjob-example.yaml
You can check the status of the experiment:
kubectl -n kubeflow describe experiment tfjob-example
PyTorch example
This is an example for the PyTorch operator:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/pytorchjob-example.yaml
You can check the status of the experiment:
kubectl -n kubeflow describe experiment pytorchjob-example
Monitoring results
You can monitor your results in the Katib UI. If you installed Kubeflowusing the deployment guide, you can access the Katib UI at
https://<your kubeflow endpoint>/katib/
For example, if you deployed Kubeflow on GKE, your endpoint would be
https://<deployment_name>.endpoints.<project>.cloud.goog/
Otherwise, you can set port-forwarding for the Katib UI service:
kubectl port-forward svc/katib-ui -n kubeflow 8080:80
Now you can access the Katib UI at this URL: http://localhost:8080/katib/
.
Cleanup
Delete the installed components:
./scripts/v1alpha2/undeploy.sh
If you created a PV for Katib, delete it:
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml
If you created a PV and PVC for the TensorFlow operator, delete it:
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pvc.yaml
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pv.yaml
Metrics collector
Katib has a metrics collector to take metrics from each trial. Katib collectsmetrics from stdout of each trial. Metrics should print in the followingformat: {metrics name}={value}
. For example, when your objective value nameis loss
and the metrics are recall
and precision
, your training containershould print like this:
epoch 1:
loss=0.3
recall=0.5
precision=0.4
epoch 2:
loss=0.2
recall=0.55
precision=0.5
Katib periodically launches CronJobs to collect metrics from pods.