Overview of Trial Templates
How to specify trial template parameters and support a custom resource (CRD) in Katib
This guide describes how to configure trial template parameters and use custom Kubernetes CRD in Katib. You will learn about changing trial template specification, how to use Kubernetes ConfigMaps to store templates and how to modify Katib controller to support your Kubernetes CRD in Katib experiments.
Katib has these CRD examples in upstream:
To use your own Kubernetes resource follow the steps below.
For the details on how to configure and run your experiment, follow the running an experiment guide.
Use trial template to submit experiment
To run the Katib experiment you have to specify a trial template for your worker job where actual training is running. Learn more about Katib concepts in the overview guide.
Configure trial template specification
Trial template specification is located under .spec.trialTemplate
of your experiment. For the API overview refer to the TrialTemplate
type.
To define experiment’s trial, you should specify these parameters in .spec.trialTemplate
:
trialParameters
- list of the parameters which are used in the trial template during experiment execution. Note: Your trial template must contain each parameter from thetrialParameters
. You can set these parameters in any field of your template, except.metadata.name
and.metadata.namespace
. Check below how you can use trialmetadata
parameters in your template. For example, your training container can receive hyperparameters as command-line or arguments or as environment variables.Your experiment’s suggestion produces
trialParameters
before running the trial. EachtrialParameter
has these structure:name
- the parameter name that is replaced in your template.description
(optional) - the description of the parameter.reference
- the parameter name that experiment’s suggestion returns. Usually, for the hyperparameter tuning parameter references are equal to the experiment search space. For example, in grid example search space has three parameters (lr
,num-layers
andoptimizer
) andtrialParameters
contains each of these parameters inreference
.
You have to define your experiment’s trial template in one of the
trialSpec
orconfigMap
sources. Note: Your template must omit.metadata.name
and.metadata.namespace
.To set the parameters from the
trialParameters
, you need to use this expression:${trialParameters.<parameter-name>}
in your template. Katib automatically replaces it with the appropriate values from the experiment’s suggestion.For example,
--lr=${trialParameters.learningRate}
is thelearningRate
parameter.trialSpec
- the experiment’s trial template in unstructured format. The template should be a valid YAML. Check the grid example.configMap
- Kubernetes ConfigMap specification where the experiment’s trial template is located. This ConfigMap must have the labelapp: katib-trial-templates
and contains key-value pairs, wherekey: <template-name>, value: <template-yaml>
. Check the example of the ConfigMap with trial templates.The
configMap
specification should have:configMapName
- the ConfigMap name with the trial templates.configMapNamespace
- the ConfigMap namespace with the trial templates.templatePath
- the ConfigMap’s data path to the template.
Check the example with ConfigMap source for the trial template.
.spec.trialTemplate
parameters below are used to control trial behavior. If parameter has the default value, it can be omitted in the experiment YAML.
retain
- indicates that trial’s resources are not clean-up after the trial is complete. Check the example withretain: true
parameter.The default value is
false
primaryPodLabels
- the trial worker’s Pod or Pods labels. These Pods are injected by Katib metrics collector. Note: IfprimaryPodLabels
is omitted, the metrics collector wraps all worker’s Pods. Learn more about Katib metrics collector in running an experiment guide. Check the example withprimaryPodLabels
.The default value for Kubeflow
TFJob
andPyTorchJob
isjob-role: master
The
primaryPodLabels
default value works only if you specify your template in.spec.trialTemplate.trialSpec
. For theconfigMap
template source you have to manually setprimaryPodLabels
.primaryContainerName
- the training container name where actual model training is running. Katib metrics collector wraps this container to collect required metrics for the single experiment optimization step.successCondition
- The trial worker’s object status in which trial’s job has succeeded. This condition must be in GJSON format. Check the example withsuccessCondition
.The default value for Kubernetes
Job
isstatus.conditions.#(type=="Complete")#|#(status=="True")#
The default value for Kubeflow
TFJob
andPyTorchJob
isstatus.conditions.#(type=="Succeeded")#|#(status=="True")#
The
successCondition
default value works only if you specify your template in.spec.trialTemplate.trialSpec
. For theconfigMap
template source you have to manually setsuccessCondition
.failureCondition
- The trial worker’s object status in which trial’s job has failed. This condition must be in GJSON format. Check the example withfailureCondition
.The default value for Kubernetes
Job
isstatus.conditions.#(type=="Failed")#|#(status=="True")#
The default value for Kubeflow
TFJob
andPyTorchJob
isstatus.conditions.#(type=="Failed")#|#(status=="True")#
The
failureCondition
default value works only if you specify your template in.spec.trialTemplate.trialSpec
. For theconfigMap
template source you have to manually setfailureCondition
.
Use trial metadata in template
You can’t specify .metadata.name
and .metadata.namespace
in your trial template, but you can get this data during the experiment run. For example, if you want to append the trial’s name to your model storage.
To do this, point .trialParameters[x].reference
to the appropriate metadata parameter and use .trialParameters[x].name
in your trial template.
The table below shows the connection between .trialParameters[x].reference
value and trial metadata.
Reference | Trial metadata |
---|---|
${trialSpec.Name} | Trial name |
${trialSpec.Namespace} | Trial namespace |
${trialSpec.Kind} | Kubernetes resource kind for the trial’s worker |
${trialSpec.APIVersion} | Kubernetes resource APIVersion for the trial’s worker |
${trialSpec.Labels[custom-key]} | Trial’s worker label with custom-key key |
${trialSpec.Annotations[custom-key]} | Trial’s worker annotation with custom-key key |
Check the example of using trial metadata.
Use custom Kubernetes resource as a trial template
By default, you can define your trial worker as Kubernetes Job
, Kubeflow TFJob
, Kubeflow PyTorchJob
, Kubeflow MPIJob
or Tekton Pipeline
.
Note: To use Tekton Pipeline
, you need to modify Tekton installation to change nop
image. Follow the Tekton integration guide to know more about it.
It is possible to use your own Kubernetes CRD or other Kubernetes resource (e.g. Kubernetes Deployment
) as a trial worker without modifying Katib controller source code and building the new image. As long as your CRD creates Kubernetes Pods, allows to inject the sidecar container on these Pods and has succeeded and failed status, you can use it in Katib.
To do that, you need to modify Katib components before installing it on your Kubernetes cluster. Accordingly, you have to know your CRD API group and version, the CRD object’s kind. Also, you need to know which resources your custom object is created. Check the Kubernetes guide to know more about CRDs.
Follow these two simple steps to integrate your custom CRD in Katib:
Modify Katib controller ClusterRole’s rules with the new rule to give Katib access to all resources that are created by the trial. To know more about ClusterRole, check Kubernetes guide.
In case of Tekton
Pipeline
, trial creates TektonPipelineRun
, then TektonPipelineRun
creates TektonTaskRun
. Therefore, Katib controller ClusterRole should have access to thepipelineruns
andtaskruns
:- apiGroups:
- tekton.dev
resources:
- pipelineruns
- taskruns
verbs:
- "*"
Modify Katib controller Deployment’s
args
with the new flag:--trial-resources=<object-kind>.<object-API-version>.<object-API-group>
.For example, to support Tekton
Pipeline
:- "--trial-resources=PipelineRun.v1beta1.tekton.dev"
After these changes, deploy Katib as described in the getting started guide and wait until the katib-controller
Pod is created. You can check logs from the Katib controller to check your resource integration:
kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow
Expected output for the Tekton Pipeline
:
{"level":"info","ts":1604325430.9762623,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: tekton.dev/v1beta1, Kind=PipelineRun"}
{"level":"info","ts":1604325430.9763885,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}
If you ran the above steps successfully, you should be able to use your custom object YAML in the experiment’s trial template source spec.
We appreciate your feedback on using various CRDs in Katib. It would be great, if you let us know about your experiments. The developer guide is a good starting point to know how to contribute to the project.
Next steps
Learn how to configure and run your Katib experiments.
Check the Katib Configuration (Katib config).
How to set up environment variables for each Katib component.
Last modified 13.03.2021: Modify links for the Katib manifests (#2540) (f7c82c16)