Build Components and Pipelines
Building your own component and adding it to a pipeline
This page describes how to create a component for Kubeflow Pipelines and howto combine components into a pipeline. For an easier start, experiment withthe Kubeflow Pipelines samples.
Overview of pipelines and components
A pipeline is a description of a machine learning (ML) workflow, including allof the components of the workflow and how they work together. The pipelineincludes the definition of the inputs (parameters) required to run the pipelineand the inputs and outputs of each component.
A pipeline component is an implementation of a pipeline task. A componentrepresents a step in the workflow. Each component takes one or more inputs andmay produce one or more outputs. A component consists of an interface(inputs/outputs), the implementation (a Docker container image and command-linearguments) and metadata (name, description).
For more information, see the conceptual guides topipelinesand components.
Before you start
Set up your environment:
- Install Docker.
- Install the Kubeflow Pipelines SDK.
The examples on this page come from theXGBoost Spark pipeline samplein the Kubeflow Pipelines sample repository.
Create a container image for each component
This section assumes that you have already created a program to perform thetask required in a particular step of your ML workflow. For example, if thetask is to train an ML model, then you must have a program that does thetraining, such as the program thattrains an XGBoost model.
Create a Docker container image thatpackages your program. See theDocker filefor the example XGBoost model training program mentioned above. You can alsoexamine the genericbuild_image.sh
script in the Kubeflow Pipelines repository of reusable components.
Your component can create outputs that the downstream components can use asinputs. Each output must be a string and the container image must write eachoutput to a separate local text file. For example, if a training component needsto output the path of the trained model, the component writes the path into alocal file, such as /output.txt
. In the Python class that defines yourpipeline (see below) you canspecify how to map the content of local files to component outputs.
Create a Python function to wrap your component
Define a Python function to describe the interactions with the Docker containerimage that contains your pipeline component. For example, the followingPython function describes a component that trains an XGBoost model:
def dataproc_train_op(
project,
region,
cluster_name,
train_data,
eval_data,
target,
analysis,
workers,
rounds,
output,
is_classification=True
):
if is_classification:
config='gs://ml-pipeline-playground/trainconfcla.json'
else:
config='gs://ml-pipeline-playground/trainconfreg.json'
return dsl.ContainerOp(
name='Dataproc - Train XGBoost model',
image='gcr.io/ml-pipeline/ml-pipeline-dataproc-train:ac833a084b32324b56ca56e9109e05cde02816a4',
arguments=[
'--project', project,
'--region', region,
'--cluster', cluster_name,
'--train', train_data,
'--eval', eval_data,
'--analysis', analysis,
'--target', target,
'--package', 'gs://ml-pipeline-playground/xgboost4j-example-0.8-SNAPSHOT-jar-with-dependencies.jar',
'--workers', workers,
'--rounds', rounds,
'--conf', config,
'--output', output,
],
file_outputs={
'output': '/output.txt',
}
)
The function must return a dsl.ContainerOp from theXGBoost Spark pipeline sample.
Note:
- Each component must inherit from
dsl.ContainerOp
. - Values in the
arguments
list that’s used by thedsl.ContainerOp
constructor above must be either Python scalar types (such asstr
andint
) ordsl.PipelineParam
types. Eachdsl.PipelineParam
represents a parameter whose value is usually only known at run time. The value iseither provided by the user at pipeline run time or received as an output from an upstream component. - Although the value of each
dsl.PipelineParam
is only available at run time,you can still use the parameters inline in thearguments
by using%s
variable substitution. At run time the argument contains the value of theparameter. file_outputs
is a mapping between labels and local file paths. In the aboveexample, the content of/output.txt
contains the string output of thecomponent. To reference the output in code:
op = dataproc_train_op(...)
op.outputs['label']
If there is only one output then you can also use op.output
.
Define your pipeline as a Python function
You must describe each pipeline as a Python function. For example:
@dsl.pipeline(
name='XGBoost Trainer',
description='A trainer that does end-to-end distributed training for XGBoost models.'
)
def xgb_train_pipeline(
output,
project,
region='us-central1',
train_data='gs://ml-pipeline-playground/sfpd/train.csv',
eval_data='gs://ml-pipeline-playground/sfpd/eval.csv',
schema='gs://ml-pipeline-playground/sfpd/schema.json',
target='resolution',
rounds=200,
workers=2,
true_label='ACTION',
)
Note:
- @dsl.pipeline is a required decoration including the
name
anddescription
properties. - Input arguments show up as pipeline parameters on the Kubeflow Pipelines UI.As a Python rule, positional arguments appear first, followed by keywordarguments.
- Each function argument is of type
dsl.PipelineParam
.The default values should all be of that type. The default values show up inthe Kubeflow Pipelines UI but the user can override them.
See the full code in theXGBoost Spark pipeline sample.
Compile the pipeline
After defining the pipeline in Python as described above, you must compile thepipeline to an intermediate representation before you can submit it to theKubeflow Pipelines service. The intermediate representation is a workflowspecification in the form of a YAML file compressed into a.tar.gz
file.
Use the dsl-compile
command to compile your pipeline:
dsl-compile --py [path/to/python/file] --output [path/to/output/tar.gz]
Deploy the pipeline
Upload the generated .tar.gz
file through the Kubeflow Pipelines UI. See theguide to getting started with the UI.
Next steps
- Build a reusable component forsharing in multiple pipelines.
- Learn more about theKubeflow Pipelines domain-specific language (DSL),a set of Python libraries that you can use to specify ML pipelines.
- See how to export metrics from yourpipeline.
- Visualize the output of your component byadding metadata for an outputviewer.
- For quick iteration,build lightweight componentsdirectly from Python functions.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.
Last modified 09.10.2019: updating broken pipelines references (#1199) (e2040afb)