Build Reusable Components

A detailed tutorial on creating components that you can use in various pipelines

This page describes how to author a reusable component that you canload and run in Kubeflow Pipelines. A reusable component is a pre-implementedstandalone component that is easy to add as a step in any pipeline.

If you’re new topipelines, see the conceptual guides to pipelinesand components.

Summary

Below is a summary of the steps involved in creating and using a component:

  • Write the program that contains your component’s logic. The program mustuse specific methods to pass data to and from the component.
  • Containerize the program.
  • Write a component specification in YAML format that describes thecomponent for the Kubeflow Pipelines system.
  • Use the Kubeflow Pipelines SDK to load and run the component in yourpipeline. The rest of this page gives some explanation about input and output data,followed by detailed descriptions of the above steps.

Passing the data to and from the containerized program

When planning to write a component you need to think about how the componentcommunicates with upstream and downstream components. That is, how it consumesinput data and produces output data.

Summary

For small pieces of data (smaller than 512 kibibyte (KiB)):

  • Inputs: Read the value from a command-line argument.
  • Outputs: Write the value to a local file, using a path provided as acommand-line argument.

For bigger pieces of data (larger than 512 KiB) or for a storage-specificcomponent:

  • Inputs: Read the data URI from a file provided as a command-line argument.Then read the data from that URI.
  • Outputs: Upload the data to the URI provided as a command-line argument. Thenwrite that URI to a local file, using a path provided as a command-lineargument.

More about input data

There are several ways to make input data available to a program running insidea container:

  • Small pieces of data (smaller than 512 kibibyte (KiB)): Pass the datacontent as a command-line argument:
  1. program.py --param 100
  • Bigger data (larger than 512 KiB): Kubeflow Pipelines doesn’t provide away of transferring larger pieces of data to the container running theprogram. Instead, the program (or the wrapper script) should receive dataURIs instead of the data itself and then access the data from the URIs. Forexample:
  1. program.py --train-uri [https://server.edu/datasets/1/train.tsv](https://server.edu/datasets/1/train.tsv) \
  2. --eval-uri [https://server.edu/datasets/1/eval.tsv](https://server.edu/datasets/1/train.tsv)
  3. program.py --train-gcs-uri gs://bucket/datasets/1/train.tsv
  4. program.py --big-query-table my_table

More about output data

The program must write the output data to some location and inform the systemabout that location so that the system can pass the data between steps.You should provide the paths to your output data as command-line arguments.That is, you should not hardcode the paths.

You can choose a suitable storage solution for your output data. Options includethe following:

  • Google Cloud Storage is therecommended default storage solution for writing output.
  • For structured data you can useBigQuery.You must provide the specific URI/path or table name to which to write theresults.

The program should do the following:

  • Upload the data to your chosen storage system.
  • Pass out a URI pointing to the data, by writing that URI to a file andinstructing the system to pick it up and treat it as the value of a particularcomponent output.

Note that the example below accepts both a URI for uploading the data into, anda file path to write that URI to.

  1. program.py --out-model-uri gs://bucket/163/output_model \
  2. --out-model-uri-file /outputs/output_model_uri/data

Why should the program output the URI it has just received as an input argument?The reason is that the URIs specified in the pipeline are usually not the realURIs, but rather URI templates containing UIDs. The system resolves the URIs atruntime when the containerized program starts. Only the containerized programsees the fully-resolved URI.

Below is an example of such a URI:

  1. gs://my-bucket/{{workflow.uid}}/{{pod.id}}/data

In cases where the program cannot control the URI/ID of the created object (forexample, where the URI is generated by the outside system), the program shouldjust accept the file path to write the resulting URI/ID:

  1. program.py --out-model-uri-file /outputs/output_model_uri/data

Future-proofing your code

The following guidelines help you avoid the need to modify the program code inthe near future or have different versions for different storage systems. If the program has access to the TensorFlow package, you can usetf.gfileto read and write files. The tf.gfile module supports both local and CloudStorage paths.

If you cannot use tf.gfile, a solution is to read inputs from and writeoutputs to local files, then add a storage-specific wrapper that downloads anduploads the data from/to a specific storage solution such asCloud Storage orAmazon S3.For example, create a wrapper script that uses the gsutil cp command todownload the input data before running the main program and to upload the outputdata after the program finishes.

Writing the program code

This section describes an example program that has two inputs (for small andlarge pieces of data) and one output. The programming language in this exampleis Python 3.

program.py

  1. #!/usr/bin/env python3
  2. import argparse
  3. import os
  4. from pathlib import Path
  5. from tensorflow import gfile # Supports both local paths and Cloud Storage (GCS) or S3
  6. # Function doing the actual work
  7. def do_work(input1_file, output1_file, param1):
  8. for x in range(param1):
  9. line = next(input1_file)
  10. if not line:
  11. break
  12. _ = output1_file.write(line)
  13. # Defining and parsing the command-line arguments
  14. parser = argparse.ArgumentParser(description='My program description')
  15. parser.add_argument('--input1-path', type=str, help='Path of the local file or GCS blob containing the Input 1 data.')
  16. parser.add_argument('--param1', type=int, default=100, help='Parameter 1.')
  17. parser.add_argument('--output1-path', type=str, help='Path of the local file or GCS blob where the Output 1 data should be written.')
  18. parser.add_argument('--output1-path-file', type=str, help='Path of the local file where the Output 1 URI data should be written.')
  19. args = parser.parse_args()
  20. gfile.MakeDirs(os.path.dirname(args.output1_path))
  21. # Opening the input/output files and performing the actual work
  22. with gfile.Open(args.input1_path, 'r') as input1_file, gfile.Open(args.output1_path, 'w') as output1_file:
  23. do_work(input1_file, output1_file, args.param1)
  24. # Writing args.output1_path to a file so that it will be passed to downstream tasks
  25. Path(args.output1_path_file).parent.mkdir(parents=True, exist_ok=True)
  26. Path(args.output1_path_file).write_text(args.output1_path)

The command line invocation of this program is:

  1. python3 program.py --input1-path <URI to Input 1 data> \
  2. --param1 <value of Param1 input> --output1-path <URI for Output 1 data> \
  3. --output1-path-file <local file path for the Output 1 URI>

You need to pass the URI for Output 1 data forward so that the downstreamsteps can access the UI. The program write the URI to a local file and tells thesystem to grab it and expose it as an output. You should avoid hard-coding anypaths, so the program receives the path to the local file path through the—output1-path-file command-line argument.

Writing a Dockerfile to containerize your application

You need a Docker container image thatpackages your program.

The instructions on creating container images are not specific to KubeflowPipelines. To make things easier for you, this section provides some guidelineson standard container creation. You can use any procedureof your choice to create the Docker containers.

Your Dockerfile mustcontain all program code, including the wrapper, and the dependencies (operatingsystem packages, Python packages etc).

Ensure you have write access to a container registry where you can pushthe container image. Examples includeGoogle Container Registryand Docker Hub.

Think of a name for your container image. This guide uses the name`gcr.io/my-org/my-image’.

Example Dockerfile

  1. ARG BASE_IMAGE_TAG=1.12.0-py3
  2. FROM tensorflow/tensorflow:$BASE_IMAGE_TAG
  3. RUN python3 -m pip install keras
  4. COPY ./src /pipelines/component/src

Create a build_image.sh script (see example below) to build the containerimage based on the Dockerfile and push the container image to some containerrepository.

Run the build_image.sh script to build the container image based on the Dockerfileand push it to your chosen container repository.

Best practice: After pushing the image, get the strict image name with digest,and use the strict image name for reproducibility.

Example build_image.sh:

  1. #!/bin/bash -e
  2. image_name=gcr.io/my-org/my-image # Specify the image name here
  3. image_tag=latest
  4. full_image_name=${image_name}:${image_tag}
  5. base_image_tag=1.12.0-py3
  6. cd "$(dirname "$0")"
  7. docker build --build-arg BASE_IMAGE_TAG=${base_image_tag} -t "${full_image_name}" .
  8. docker push "$full_image_name"
  9. # Output the strict image name (which contains the sha256 image digest)
  10. docker inspect --format="{{index .RepoDigests 0}}" "${IMAGE_NAME}"

Make your script executable:

  1. chmod +x build_image.sh

Writing your component definition file

You need a component specification in YAML format that describes thecomponent for the Kubeflow Pipelines system.

For the complete definition of a Kubeflow Pipelines component, see thecomponent specification.However, for this tutorial you don’t need to know the full schema of thecomponent specification. The tutorial provides enough information for therelevant the components.

Start writing the component definition (component.yaml) by specifying yourcontainer image in the component’s implementation section:

  1. implementation:
  2. container:
  3. image: gcr.io/my-org/my-image@sha256:a172..752f # Name of a container image that you've pushed to a container repo.

Complete the component’s implementation section based on your (wrapper) program:

  1. implementation:
  2. container:
  3. image: gcr.io/my-org/my-image@sha256:a172..752f
  4. # command is a list of strings (command-line arguments).
  5. # The YAML language has two syntaxes for lists and you can use either of them.
  6. # Here we use the "flow syntax" - comma-separated strings inside square brackets.
  7. command: [
  8. python3, /kfp/component/src/program.py, # Path of the program inside the container
  9. --input1-path, <URI to Input 1 data>,
  10. --param1, <value of Param1 input>,
  11. --output1-path, <URI template for Output 1 data>,
  12. --output1-path-file, <local file path for the Output 1 URI>,
  13. ]

The command section still contains some dummy placeholders (in anglebrackets). Let’s replace them with real placeholders. A placeholder representsa command-line argument that is replaced with some value or path before theprogram is executed. In component.yaml, you specify the placeholders usingYAML’s mapping syntax to distinguish them from the verbatim strings. There arethree placeholders available:

  • {inputValue: Some input name}This placeholder is replaced by the value of the argument to thespecified input. This is useful for small pieces of input data.
  • {outputPath: Some output name}This placeholder is replaced by the auto-generated local path where theprogram should write its output data. This instructs the system to read thecontent of the file and store it as the value of the specified output.

As well as putting real placeholders in the command line, you need to addcorresponding input and output specifications to the inputs and outputssections. The input/output specification contains the input name, type,description and default value. Only the name is required. The input and outputnames are free-form strings, but be careful with the YAML syntax and use quotesif necessary. The input/output names do not need to be the same as thecommand-line flags which are usually quite short.

Replace the placeholders as follows:

  • Replace <URI to Input 1 file> with {inputValue: Input 1 URI} andadd Input 1 URI to the inputs section. URLs are small, so we’re passingthem in as command-line arguments.
  • Replace <value of Param1 input> with {inputValue: Parameter 1} and addParameter 1 to the inputs section. Integers are small, so we’re passingthem in as command-line arguments.
  • Replace <URI template for Output 1 file> with {inputValue: Output 1 URI template} and add Output 1 URI template to the inputs section. Thislooks very confusing: you’re adding an output URI into the inputs section.The reason is that currently you must manually pass in URIs, so thisis input, not output.
  • Replace <local file path for the Output 1 URI> with {outputPath: Output 1 URI} and add Output 1 URI to the outputs section. Again, this looksquite confusing: you now have both input and output called Output 1 URI.(Note that you can use different names.) The reason is that the URI ispass through. It’s passed to the task as input and is then output fromthe task, so that downstream tasks have access to it.

After replacing the placeholders and adding inputs/outputs, yourcomponent.yaml looks like this:

  1. inputs: #List of input specs. Each input spec is a map.
  2. - {name: Input 1 URI}
  3. - {name: Parameter 1}
  4. - {name: Output 1 URI template}
  5. outputs:
  6. - {name: Output 1 URI}
  7. implementation:
  8. container:
  9. image: gcr.io/my-org/my-image@sha256:a172..752f
  10. command: [
  11. python3, /pipelines/component/src/program.py,
  12. --input1-path,
  13. {inputValue: Input 1 URI}, # Refers to the "Input 1 URI" input
  14. --param1,
  15. {inputValue: Parameter 1}, # Refers to the "Parameter 1" input
  16. --output1-path,
  17. {inputValue: Output 1 URI template}, # Refers to "Output 1 URI template" input
  18. --output1-path-file,
  19. {outputPath: Output 1 URI}, # Refers to the "Output 1 URI" output
  20. ]

The above component specification is sufficient, but you should add moremetadata to make it more useful. The example below includes the followingadditions:

  • Component name and description.
  • For each input and output: description, default value, and type.

Final version of component.yaml:

  1. name: Do dummy work
  2. description: Performs some dummy work.
  3. inputs:
  4. - {name: Input 1 URI, type: GCSPath, description='GCS path to Input 1'}
  5. - {name: Parameter 1, type: Integer, default='100', description='Parameter 1 description'} # The default values must be specified as YAML strings.
  6. - {name: Output 1 URI template, type: GCSPath, description='GCS path template for Output 1'}
  7. outputs:
  8. - {name: Output 1 URI, type: GCSPath, description='GCS path for Output 1'}
  9. implementation:
  10. container:
  11. image: gcr.io/my-org/my-image@sha256:a172..752f
  12. command: [
  13. python3, /pipelines/component/src/program.py,
  14. --input1-path, {inputValue: Input 1 URI},
  15. --param1, {inputValue: Parameter 1},
  16. --output1-path, {inputValue: Output 1 URI template},
  17. --output1-path-file, {outputPath: Output 1 URI},
  18. ]

Build your component into a pipeline with the Kubeflow Pipelines SDK

Here is a sample pipeline that shows how to load a component and use it tocompose a pipeline

  1. import kfp
  2. # Load the component by calling load_component_from_file or load_component_from_url
  3. # To load the component, the pipeline author only needs to have access to the component.yaml file.
  4. # The Kubernetes cluster executing the pipeline needs access to the container image specified in the component.
  5. dummy_op = kfp.components.load_component_from_file(os.path.join(component_root, 'component.yaml'))
  6. # dummy_op = kfp.components.load_component_from_url('http://....../component.yaml')
  7. # dummy_op is now a "factory function" that accepts the arguments for the component's inputs
  8. # and produces a task object (e.g. ContainerOp instance).
  9. # Inspect the dummy_op function in Jupyter Notebook by typing "dummy_op(" and pressing Shift+Tab
  10. # You can also get help by writing help(dummy_op) or dummy_op? or dummy_op??
  11. # The signature of the dummy_op function corresponds to the inputs section of the component.
  12. # Some tweaks are performed to make the signature valid and pythonic:
  13. # 1) All inputs with default values will come after the inputs without default values
  14. # 2) The input names are converted to pythonic names (spaces and symbols replaced
  15. # with underscores and letters lowercased).
  16. # Define a pipeline and create a task from a component:
  17. @kfp.dsl.pipeline(name='My pipeline', description='')
  18. def my_pipeline():
  19. dummy1_task = dummy_op(
  20. # Input name "Input 1 URI" is converted to pythonic parameter name "input_1_uri"
  21. input_1_uri='gs://my-bucket/datasets/train.tsv',
  22. parameter_1='100',
  23. # You must use Argo placeholders ("{{workflow.uid}}" and "{{pod.name}}")
  24. # to guarantee that the outputs from different pipeline runs and tasks write
  25. # to unique locations and do not overwrite each other.
  26. output_1_uri='gs://my-bucket/{{workflow.uid}}/{{pod.name}}/output_1/data',
  27. ).apply(kfp.gcp.use_gcp_secret('user-gcp-sa'))
  28. # To access GCS, you must configure the container to have access to a
  29. # GCS secret that grants required access to the bucket.
  30. # The outputs of the dummy1_task can be referenced using the
  31. # dummy1_task.outputs dictionary.
  32. # ! The output names are converted to lowercased dashed names.
  33. # Pass the outputs of the dummy1_task to some other component
  34. dummy2_task = dummy_op(
  35. input_1_uri=dummy1_task.outputs['output-1-uri'],
  36. parameter_1='200',
  37. output_1_uri='gs://my-bucket/{{workflow.uid}}/{{pod.name}}/output_1/data',
  38. ).apply(kfp.gcp.use_gcp_secret('user-gcp-sa'))
  39. # To access GCS, you must configure the container to have access to a
  40. # GCS secret that grants required access to the bucket.
  41. # This pipeline can be compiled, uploaded and submitted for execution.

Organizing the component files

This section provides a recommended way to organize the component files. Thereis no requirement that you must organize the files in this way. However, usingthe standard organization makes it possible to reuse the same scripts fortesting, image building and component versioning.See thissample componentfor an real-life component example.

  1. components/<component group>/<component name>/
  2. src/* #Component source code files
  3. tests/* #Unit tests
  4. run_tests.sh #Small script that runs the tests
  5. README.md #Documentation. Move to docs/ if multiple files needed
  6. Dockerfile #Dockerfile to build the component container image
  7. build_image.sh #Small script that runs docker build and docker push
  8. component.yaml #Component definition in YAML format

Next steps