Running an experiment
How to configure and run a hyperparameter tuning or neural architecture search experiment in Katib
This page describes in detail how to configure and run a Katib experiment.The experiment can perform hyperparameter tuning or a neural architecture search(NAS) (alpha), depending on the configuration settings.
For an overview of the concepts involved, read the introduction toKatib.
Packaging your training code in a container image
Katib and Kubeflow are Kubernetes-based systems. To use Katib, you must packageyour training code in a Docker container image and make the image availablein a registry. See the Dockerdocumentation andthe Kubernetesdocumentation.
Configuring the experiment
To create a hyperparameter tuning or NAS experiment in Katib, you define theexperiment in a YAML configuration file. The YAML file defines the range ofpotential values (the search space) for the paramaters that you want tooptimize, the objective metric to use when determining optimal values, thesearch algorithm to use during optimization, and other configurations.
See the YAML file for the random algorithmexample.
The list below describes the fields in the YAML file for an experiment. TheKatib UI offers the corresponding fields. You can choose to configure and runthe experiment from the UI or from the command line.
Configuration spec
These are the fields in the experiment configuration spec:
parameters: The range of the hyperparameters or other parameters that youwant to tune for your ML model. The parameters define the search space,also known as the feasible set or the solution space.In this section of the spec, you define the name and the distribution(discrete or continuous) of every hyperparameter that you need to search.For example, you may provide a minimum and maximum value or a listof allowed values for each hyperparameter.Katib generates hyperparameter combinations in the range based on thehyperparameter tuning algorithm that you specify. See the
ParameterSpec
type.objective: The metric that you want to optimize.The objective metric is also called the target variable.A common metric is the model’s accuracy in the validation pass of the trainingjob (validation-accuracy). You also specify whether you want Katib tomaximize or minimize the metric.Katib uses the
objectiveMetricName
andadditionalMetricNames
to monitorhow the hyperparameters work with the model.Katib records the value of the bestobjectiveMetricName
metric (maximizedor minimized based ontype
) and the corresponding hyperparameter setinExperiment.status
. If theobjectiveMetricName
metric for a set ofhyperparameters reaches thegoal
, Katib stops trying more hyperparametercombinations. See the ObjectiveSpectype.algorithm: The search algorithm that you want Katib to use to find thebest hyperparameters or neural architecture configuration. Examples includerandom search, grid search, Bayesian optimization, and more.See the search algorithm details below.
trialTemplate: The template that defines the trial.You must package your ML training code into a Docker image, as describedabove. You must configure the model’shyperparameters either as command-line arguments or as environment variables,so that Katib can automatically set the values in each trial.
You can use one of the following job types to train your model:
- Kubernetes Job(does not support distributed execution).
- Kubeflow TFJob (supportsdistributed execution).
- Kubeflow PyTorchJob (supportsdistributed execution).See the TrialTemplatetype.The templateuses the Go template format.
You can define the job in raw string format or you can use aConfigMap.
parallelTrialCount: The maximum number of hyperparameter sets that Katibshould train in parallel.
maxTrialCount: The maximum number of trials to run.This is equivalent to the number of hyperparameter sets that Katib shouldgenerate to test the model.
maxFailedTrialCount: The maximum number of failed trials before Katibshould stop the experiment.This is equivalent to the number of failed hyperparameter sets that Katibshould test.If the number of failed trials exceeds
maxFailedTrialCount
, Katib stops theexperiment with a status ofFailed
.metricsCollectorSpec: A specification of how to collect the metrics fromeach trial, such as the accuracy and loss metrics.See the details of the metrics collector below.
nasConfig: The configuration for a neural architecture search (NAS).Note: NAS is currently in alpha with limited support.You can specify the configurations of the neural network design that you wantto optimize, including the number of layers in the network, the types ofoperations, and more.See the NasConfig type.As an example, see the YAML file for thenasjob-example-RL-gpu.The example aims to show all the possible operations. Due to the large searchspace, the example is not likely to generate a good result.
Background information about Katib’s Experiment
type: In Kubernetesterminology, Katib’sExperiment
type is a custom resource(CR).The YAML file that you create for your experiment is the CR specification.
Search algorithms in detail
Katib currently supports several search algorithms. See the AlgorithmSpectype.
Here’s a list of the search algorithms available in Katib. The links lead todescriptions on this page:
- Grid search
- Random search
- Bayesian optimization
- HYPERBAND
- Hyperopt TPE
- NAS based on reinforcement learning
More algorithms are under development. You can add an algorithm to Katibyourself. See the guide to adding a newalgorithm and the developerguide.
Grid search
The algorithm name in Katib is grid
.
Grid sampling is useful when all variables are discrete (as opposed tocontinuous) and the number of possibilities is low. A grid searchperforms an exhaustive combinatorial search over all possibilities,making the search process extremely long even for medium sized problems.
Katib uses the Chocolate optimizationframework for its grid search.
Random search
The algorithm name in Katib is random
.
Random sampling is an alternative to grid search, useful when the number ofdiscrete variables to optimize is large and the time required for eachevaluation is logn. When all parameters are discrete, random search performssampling without replacement. Random search is therefore the best algorithm touse when combinatorial exploration is not possible. If the number of continuousvariables is high, you should use quasi random sampling instead.
Katib uses the hyperopt optimizationframework for its random search.
Katib supports the following algorithm settings:
Setting name | Description | Example |
---|---|---|
random_state | [int]: Set random_state to something other than None for reproducible results. | 10 |
Bayesian optimization
The algorithm name in Katib is skopt-bayesian-optimization
.
The Bayesian optimization method usesgaussian process regression to model the search space. This technique calculatesan estimate of the loss function and the uncertainty of that estimate at everypoint in the search space. The method is suitable when the number ofdimensions in the search space is low. Since the method models boththe expected loss and the uncertainty, the search algorithm converges in a fewsteps, making it a good choice when the time tocomplete the evaluation of a parameter configuration is long.
Katib uses theScikit-Optimize libraryfor its Bayesian search. Scikit-Optimize is also known as skopt
.
Katib supports the following algorithm settings:
Setting Name | Description | Example |
---|---|---|
base_estimator | [“GP”, “RF”, “ET”, “GBRT” or sklearn regressor, default=“GP”]: Should inherit from sklearn.base.RegressorMixin . The predict method should have an optional return_std argument, which returns std(Y | x) along with E[Y | x] . If base_estimator is one of [“GP”, “RF”, “ET”, “GBRT”], the system uses a default surrogate model of the corresponding type. See more information in the skopt documentation. | GP |
n_initial_points | [int, default=10]: Number of evaluations of func with initialization points before approximating it with base_estimator . Points provided as x0 count as initialization points. If len(x0) < n_initial_points , the system samples additional points at random. See more information in the skopt documentation. | 10 |
acq_func | [string, default="gp_hedge" ]: The function to minimize over the posterior distribution. See more information in the skopt documentation. | gp_hedge |
acq_optimizer | [string, “sampling” or “lbfgs”, default=“auto”]: The method to minimize the acquistion function. The system updates the fit model with the optimal value obtained by optimizing acq_func with acq_optimizer . See more information in the skopt documentation. | auto |
random_state | [int]: Set random_state to something other than None for reproducible results. | 10 |
HYPERBAND
The algorithm name in Katib is hyperband
.
Katib supports the HYPERBANDoptimization framework.Instead of using Bayesian optimization to select configurations, HYPERBANDfocuses on early stopping as a strategy for optimizing resource allocation andthus for maximixing the number of configurations that it can evaluate.HYPERBAND also focuses on the speed of the search.
Hyperopt TPE
The algorithm name in Katib is tpe
.
Katib uses the Tree of Parzen Estimators (TPE) algorithm inhyperopt. This method provides aforward and reverse gradient-basedsearch.
NAS using reinforcement learning
Alpha version
Neural architecture search is currently in alpha with limited support.The Kubeflow team is interested in any feedback you may have, in particular withregards to usability of the feature. You can log issues and comments inthe Katib issue tracker.
The algorithm name in Katib is nasrl
.
For more information, see:
- Information in the Katib repository on NAS withreinforcement learning.
- The description of the
nasConfig
field in the configuration fileearlier on this page.
Metrics collector
In the metricsCollectorSpec
section of the YAML configuration file, you candefine how Katib should collect the metrics from each trial, such as theaccuracy and loss metrics.
Your training code can record the metrics into stdout
or into arbitrary outputfiles. Katib collects the metrics using a sidecar container. A sidecar isa utility container that supports the main container in the Kubernetes Pod.
To define the metrics collector for your experiment:
Specify the collector type in the
collector
field.Katib’s metrics collector supports the following collector types:StdOut
: Katib collects the metrics from the operating system’s defaultoutput location (standard output).File
: Katib collects the metrics from an arbitrary file, whichyou specify in thesource
field.TensorFlowEvent
: Katib collects the metrics from a directory pathcontaining atf.Event. Youshould specify the path in thesource
field.Custom
: Specify this value if you need to use custom way to collectmetrics. You must define your custom metrics collector containerin thecollector.customCollector
field.None
: Specify this value if you don’t need to use Katib’s metricscollector. For example, your training code may handle the persistentstorage of its own metrics.
Specify the metrics output location in the
source
field. See theMetricsCollectorSpec type for default values.Write code in your training container to print metrics in the formatspecified in the
metricsCollectorSpec.source.filter.metricsFormat
field. The default format is([\w|-]+)\s=\s((-?\d+)(.\d+)?)
.Each element is a regular expression with two subexpressions. The firstmatched expression is taken as the metric name. The second matchedexpression is taken as the metric value.
For example, using the default metrics format, if the name of your objective metricis loss
and the metrics are recall
and precision
, your training code shouldprint the following output:
epoch 1:
loss=0.3
recall=0.5
precision=0.4
epoch 2:
loss=0.2
recall=0.55
precision=0.5
Running the experiment
You can run a Katib experiment from the command line or from the Katib UI.
Running the experiment from the command line
You can use kubectlto launch an experiment from the command line:
kubectl apply -f <your-path/your-experiment-config.yaml>
For example, run the following command to launch an experiment using therandom algorithm example:
kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/random-example.yaml
Check the experiment status:
kubectl -n kubeflow describe experiment <your-experiment-name>
For example, to check the status of the random algorithm example:
kubectl -n kubeflow describe experiment random-example
Running the experiment from the Katib UI
Instead of using the command line, you can submit an experiment from the KatibUI. The following steps assume you want to run a hyperparameter tuningexperiment. If you want to run a neural architecture search, access the NASsection of the UI (instead of the HP section) and then follow a similarsequence of steps.
To run a hyperparameter tuning experiment from the Katib UI:
- Follow the getting-started guide to access the KatibUI.
- Click Hyperparameter Tuning on the Katib home page.
- Open the Katib menu panel on the left, then open the HP section andclick Submit:
Click on the right-hand panel to close the menu panel. You should seetabs offering you the following options:
- YAML file: Choose this option to supply an entire YAML file containingthe configuration for the experiment.
- Parameters: Choose this option to enter the configuration valuesinto a form.
View the results of the experiment in the Katib UI:
- Open the Katib menu panel on the left, then open the HP section andclick Monitor:
- Click on the right-hand panel to close the menu panel. You should seethe list of experiments:
Click the name of your experiment. For example, click random-example.
You should see a graph showing the level of accuracy for variouscombinations of the hyperparameter values. For example, the graph belowshows learning rate, number of layers, and optimizer:
- Below the graph is a list of trials that ran within the experiment.Click a trial name to see the trial data.
Next steps
See how to run the random algorithm and other Katibexamplesin the getting-started guide.
For an overview of the concepts involved in hyperparameter tuning andneural architecture search, read the introduction toKatib.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.
Last modified 14.02.2020: Fix Katib documentation links (#1690) (c3ef19cc)