Building Components
A tutorial on how to create components and use them in a pipeline
A pipeline component is a self-contained set of code that performs one step in your ML workflow. This document describes the concepts required to build components, and demonstrates how to get started building components.
Before you begin
Run the following command to install the Kubeflow Pipelines SDK.
$ pip3 install kfp --upgrade
For more information about the Kubeflow Pipelines SDK, see the SDK reference guide.
Understanding pipeline components
Pipeline components are self-contained sets of code that perform one step in your ML workflow, such as preprocessing data or training a model. To create a component, you must build the component’s implementation and define the component specification.
Your component’s implementation includes the component’s executable code and the Docker container image that the code runs in. Learn more about designing a pipeline component.
Once you have built your component’s implementation, you can define your component’s interface as a component specification. A component specification defines:
- The component’s inputs and outputs.
- The container image that your component’s code runs in, the command to use to run your component’s code, and the command-line arguments to pass to your component’s code.
- The component’s metadata, such as the name and description.
Learn more about creating a component specification.
If your component’s code is implemented as a Python function, use the Kubeflow Pipelines SDK to package your function as a component. Learn more about building Python function-based components.
Designing a pipeline component
When Kubeflow Pipelines executes a component, a container image is started in a Kubernetes Pod and your component’s inputs are passed in as command-line arguments. You can pass small inputs, such as strings and numbers, by value. Larger inputs, such as CSV data, must be passed as paths to files. When your component has finished, the component’s outputs are returned as files.
When you design your component’s code, consider the following:
- Which inputs can be passed to your component by value? Examples of inputs that you can pass by value include numbers, booleans, and short strings. Any value that you could reasonably pass as a command-line argument can be passed to your component by value. All other inputs are passed to your component by a reference to the input’s path.
- To return an output from your component, the output’s data must be stored as a file. When you define your component, you let Kubeflow Pipelines know what outputs your component produces. When your pipeline runs, Kubeflow Pipelines passes the paths that you use to store your component’s outputs as inputs to your component.
- Outputs are typically written to a single file. In some cases, you may need to return a directory of files as an output. In this case, create a directory at the output path and write the output files to that location. In both cases, it may be necessary to create parent directories if they do not exist.
- Your component’s goal may be to create a dataset in an external service, such as a BigQuery table. In this case, it may make sense for the component to output an identifier for the produced data, such as a table name, instead of the data itself. We recommend that you limit this pattern to cases where the data must be put into an external system instead of keeping it inside the Kubeflow Pipelines system.
- Since your inputs and output paths are passed in as command-line arguments, your component’s code must be able to read inputs from the command line. If your component is built with Python, libraries such as argparse and absl.flags make it easier to read your component’s inputs.
- Your component’s code can be implemented in any language, so long as it can run in a container image.
The following is an example program written using Python3. This program reads a given number of lines from an input file and writes those lines to an output file. This means that this function accepts three command-line parameters:
- The path to the input file.
- The number of lines to read.
- The path to the output file.
#!/usr/bin/env python3
import argparse
from pathlib import Path
# Function doing the actual work (Outputs first N lines from a text file)
def do_work(input1_file, output1_file, param1):
for x, line in enumerate(input1_file):
if x >= param1:
break
_ = output1_file.write(line)
# Defining and parsing the command-line arguments
parser = argparse.ArgumentParser(description='My program description')
# Paths must be passed in, not hardcoded
parser.add_argument('--input1-path', type=str,
help='Path of the local file containing the Input 1 data.')
parser.add_argument('--output1-path', type=str,
help='Path of the local file where the Output 1 data should be written.')
parser.add_argument('--param1', type=int, default=100,
help='The number of lines to read from the input and write to the output.')
args = parser.parse_args()
# Creating the directory where the output file is created (the directory
# may or may not exist).
Path(args.output1_path).parent.mkdir(parents=True, exist_ok=True)
with open(args.input1_path, 'r') as input1_file:
with open(args.output1_path, 'w') as output1_file:
do_work(input1_file, output1_file, args.param1)
If this program is saved as program.py
, the command-line invocation of this program is:
python3 program.py --input1-path <path-to-the-input-file> \
--param1 <number-of-lines-to-read> \
--output1-path <path-to-write-the-output-to>
Containerize your component’s code
For Kubeflow Pipelines to run your component, your component must be packaged as a Docker container image and published to a container registry that your Kubernetes cluster can access. The steps to create a container image are not specific to Kubeflow Pipelines. To make things easier for you, this section provides some guidelines on standard container creation.
Create a Dockerfile for your container. A Dockerfile specifies:
- The base container image. For example, the operating system that your code runs on.
- Any dependencies that need to be installed for your code to run.
- Files to copy into the container, such as the runnable code for this component.
The following is an example Dockerfile.
FROM python:3.7
RUN python3 -m pip install keras
COPY ./src /pipelines/component/src
In this example:
- The base container image is
python:3.7
. - The
keras
Python package is installed in the container image. - Files in your
./src
directory are copied into/pipelines/component/src
in the container image.
Create a script named
build_image.sh
that uses Docker to build your container image and push your container image to a container registry. Your Kubernetes cluster must be able to access your container registry to run your component. Examples of container registries include Google Container Registry and Docker Hub.The following example builds a container image, pushes it to a container registry, and outputs the strict image name. It is a best practice to use the strict image name in your component specification to ensure that you are using the expected version of a container image in each component execution.
#!/bin/bash -e
image_name=gcr.io/my-org/my-image
image_tag=latest
full_image_name=${image_name}:${image_tag}
cd "$(dirname "$0")"
docker build -t "${full_image_name}" .
docker push "$full_image_name"
# Output the strict image name, which contains the sha256 image digest
docker inspect --format="{{index .RepoDigests 0}}" "${full_image_name}"
In the preceding example:
- The
image_name
specifies the full name of your container image in the container registry. - The
image_tag
specifies that this image should be tagged as latest.
Save this file and run the following to make this script executable.
chmod +x build_image.sh
Run your
build_image.sh
script to build your container image and push it to a container registry.Use
docker run
to test your container image locally. If necessary, revise your application and Dockerfile until your application works as expected in the container.
Creating a component specification
To create a component from your containerized program, you must create a component specification that defines the component’s interface and implementation. The following sections provide an overview of how to create a component specification by demonstrating how to define the component’s implementation, interface, and metadata.
To learn more about defining a component specification, see the component specification reference guide.
Define your component’s implementation
The following example creates a component specification YAML and defines the component’s implementation.
Create a file named
component.yaml
and open it in a text editor.Create your component’s implementation section and specify the strict name of your container image. The strict image name is provided when you run your
build_image.sh
script.implementation:
container:
# The strict name of a container image that you've pushed to a container registry.
image: gcr.io/my-org/my-image@sha256:a172..752f
Define a
command
for your component’s implementation. This field specifies the command-line arguments that are used to run your program in the container.implementation:
container:
image: gcr.io/my-org/my-image@sha256:a172..752f
# command is a list of strings (command-line arguments).
# The YAML language has two syntaxes for lists and you can use either of them.
# Here we use the "flow syntax" - comma-separated strings inside square brackets.
command: [
python3,
# Path of the program inside the container
/pipelines/component/src/program.py,
--input1-path,
{inputPath: Input 1},
--param1,
{inputValue: Parameter 1},
--output1-path,
{outputPath: Output 1},
]
The
command
is formatted as a list of strings. Each string in thecommand
is a command-line argument or a placeholder. At runtime, placeholders are replaced with an input or output. In the preceding example, two inputs and one output path are passed into a Python script at/pipelines/component/src/program.py
.There are three types of input/output placeholders:
{inputValue: <input-name>}
: This placeholder is replaced with the value of the specified input. This is useful for small pieces of input data, such as numbers or small strings.{inputPath: <input-name>}
: This placeholder is replaced with the path to this input as a file. Your component can read the contents of that input at that path during the pipeline run.{outputPath: <output-name>}
: This placeholder is replaced with the path where your program writes this output’s data. This lets the Kubeflow Pipelines system read the contents of the file and store it as the value of the specified output.
The
<input-name>
name must match the name of an input in theinputs
section of your component specification. The<output-name>
name must match the name of an output in theoutputs
section of your component specification.
Define your component’s interface
The following examples demonstrate how to specify your component’s interface.
To define an input in your
component.yaml
, add an item to theinputs
list with the following attributes:name
: Human-readable name of this input. Each input’s name must be unique.description
: (Optional.) Human-readable description of the input.default
: (Optional.) Specifies the default value for this input.type
: (Optional.) Specifies the input’s type. Learn more about the types defined in the Kubeflow Pipelines SDK and how type checking works in pipelines and components.optional
: Specifies if this input is optional. The value of this attribute is of typeBool
, and defaults to False.
In this example, the Python program has two inputs:
Input 1
containsString
data.Parameter 1
contains anInteger
.
inputs:
- {name: Input 1, type: String, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}
Note:
Input 1
andParameter 1
do not specify any details about how they are stored or how much data they contain. Consider using naming conventions to indicate if inputs are expected to be small enough to pass by value.After your component finishes its task, the component’s outputs are passed to your pipeline as paths. At runtime, Kubeflow Pipelines creates a path for each of your component’s outputs. These paths are passed as inputs to your component’s implementation.
To define an output in your component specification YAML, add an item to the
outputs
list with the following attributes:name
: Human-readable name of this output. Each output’s name must be unique.description
: (Optional.) Human-readable description of the output.type
: (Optional.) Specifies the output’s type. Learn more about the types defined in the Kubeflow Pipelines SDK and how type checking works in pipelines and components.
In this example, the Python program returns one output. The output is named
Output 1
and it containsString
data.outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}
Note: Consider using naming conventions to indicate if this output is expected to be small enough to pass by value. You should limit the amount of data that is passed by value to 200 KB per pipeline run.
After you define your component’s interface, the
component.yaml
should be something like the following:inputs:
- {name: Input 1, type: String, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}
outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}
implementation:
container:
image: gcr.io/my-org/my-image@sha256:a172..752f
# command is a list of strings (command-line arguments).
# The YAML language has two syntaxes for lists and you can use either of them.
# Here we use the "flow syntax" - comma-separated strings inside square brackets.
command: [
python3,
# Path of the program inside the container
/pipelines/component/src/program.py,
--input1-path,
{inputPath: Input 1},
--param1,
{inputValue: Parameter 1},
--output1-path,
{outputPath: Output 1},
]
Specify your component’s metadata
To define your component’s metadata, add the name
and description
fields to your component.yaml
name: Get Lines
description: Gets the specified number of lines from the input file.
inputs:
- {name: Input 1, type: String, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}
outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}
implementation:
container:
image: gcr.io/my-org/my-image@sha256:a172..752f
# command is a list of strings (command-line arguments).
# The YAML language has two syntaxes for lists and you can use either of them.
# Here we use the "flow syntax" - comma-separated strings inside square brackets.
command: [
python3,
# Path of the program inside the container
/pipelines/component/src/program.py,
--input1-path,
{inputPath: Input 1},
--param1,
{inputValue: Parameter 1},
--output1-path,
{outputPath: Output 1},
]
Using your component in a pipeline
You can use the Kubeflow Pipelines SDK to load your component using methods such as the following:
kfp.components.load_component_from_file
: Use this method to load your component from acomponent.yaml
path.kfp.components.load_component_from_url
: Use this method to load acomponent.yaml
from a URL.kfp.components.load_component_from_text
: Use this method to load your component specification YAML from a string. This method is useful for rapidly iterating on your component specification.
These functions create a factory function that you can use to create ContainerOp
instances to use as steps in your pipeline. This factory function’s input arguments include your component’s inputs and the paths to your component’s outputs. The function signature may be modified in the following ways to ensure that it is valid and Pythonic.
- Inputs with default values will come after the inputs without default values and outputs.
- Input and output names are converted to Pythonic names (spaces and symbols are replaced with underscores and letters are converted to lowercase). For example, an input named
Input 1
is converted toinput_1
.
The following example demonstrates how to load the text of your component specification and run it in a single-step pipeline. Before you run this example, update the component specification to use the component specification you defined in the previous sections.
import kfp
import kfp.components as comp
create_step_get_lines = comp.load_component_from_text("""
name: Get Lines
description: Gets the specified number of lines from the input file.
inputs:
- {name: Input 1, type: String, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'}
outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}
implementation:
container:
image: gcr.io/my-org/my-image@sha256:a172..752f
# command is a list of strings (command-line arguments).
# The YAML language has two syntaxes for lists and you can use either of them.
# Here we use the "flow syntax" - comma-separated strings inside square brackets.
command: [
python3,
# Path of the program inside the container
/pipelines/component/src/program.py,
--input1-path,
{inputPath: Input 1},
--param1,
{inputValue: Parameter 1},
--output1-path,
{outputPath: Output 1},
]""")
# create_step_get_lines is a "factory function" that accepts the arguments
# for the component's inputs and output paths and returns a pipeline step
# (ContainerOp instance).
#
# To inspect the get_lines_op function in Jupyter Notebook, enter
# "get_lines_op(" in a cell and press Shift+Tab.
# You can also get help by entering `help(get_lines_op)`, `get_lines_op?`,
# or `get_lines_op??`.
# Define your pipeline
def my_pipeline():
get_lines_step = create_step_get_lines(
# Input name "Input 1" is converted to pythonic parameter name "input_1"
input_1='one\ntwo\nthree\nfour\nfive\nsix\nseven\neight\nnine\nten',
parameter_1='5',
)
# If you run this command on a Jupyter notebook running on Kubeflow,
# you can exclude the host parameter.
# client = kfp.Client()
client = kfp.Client(host='<your-kubeflow-pipelines-host-name>')
# Compile, upload, and submit this pipeline for execution.
client.create_run_from_pipeline_func(my_pipeline, arguments={})
Organizing the component files
This section provides a recommended way to organize a component’s files. There is no requirement that you must organize the files in this way. However, using the standard organization makes it possible to reuse the same scripts for testing, image building, and component versioning.
components/<component group>/<component name>/
src/* # Component source code files
tests/* # Unit tests
run_tests.sh # Small script that runs the tests
README.md # Documentation. If multiple files are needed, move to docs/.
Dockerfile # Dockerfile to build the component container image
build_image.sh # Small script that runs docker build and docker push
component.yaml # Component definition in YAML format
See this sample component for a real-life component example.
Next steps
- Consolidate what you’ve learned by reading the best practices for designing and writing components.
- For quick iteration, build lightweight Python function-based components directly from Python functions.
- See how to export metrics from your pipeline.
- Visualize the output of your component by adding metadata for an output viewer.
- Explore the reusable components and other shared resources.
Last modified 28.04.2021: KFP - Updates guide to building components (#2573) (78f52781)