TensorFlow Training (TFJob)

Using TFJob to train a model with TensorFlow

This page describes TFJob for training a machine learning model with TensorFlow.

What is TFJob?

TFJob is a Kubernetescustom resource that you can use to run TensorFlow training jobs on Kubernetes. The Kubeflowimplementation of TFJob is intf-operator.

A TFJob is a resource with a YAML representation like the one below (edit to use the container image and command for your own training code):

  1. apiVersion: kubeflow.org/v1
  2. kind: TFJob
  3. metadata:
  4. generateName: tfjob
  5. namespace: kubeflow
  6. spec:
  7. tfReplicaSpecs:
  8. PS:
  9. replicas: 1
  10. restartPolicy: OnFailure
  11. template:
  12. spec:
  13. containers:
  14. - name: tensorflow
  15. image: gcr.io/your-project/your-image
  16. command:
  17. - python
  18. - -m
  19. - trainer.task
  20. - --batch_size=32
  21. - --training_steps=1000
  22. Worker:
  23. replicas: 3
  24. restartPolicy: OnFailure
  25. template:
  26. spec:
  27. containers:
  28. - name: tensorflow
  29. image: gcr.io/your-project/your-image
  30. command:
  31. - python
  32. - -m
  33. - trainer.task
  34. - --batch_size=32
  35. - --training_steps=1000

If you want to give your TFJob pods access to credentials secrets, such as the GCP credentials automatically created when you do a GKE-based Kubeflow installation, you can mount and use a secret like this:

  1. apiVersion: kubeflow.org/v1
  2. kind: TFJob
  3. metadata:
  4. generateName: tfjob
  5. namespace: kubeflow
  6. spec:
  7. tfReplicaSpecs:
  8. PS:
  9. replicas: 1
  10. restartPolicy: OnFailure
  11. template:
  12. spec:
  13. containers:
  14. - name: tensorflow
  15. image: gcr.io/your-project/your-image
  16. command:
  17. - python
  18. - -m
  19. - trainer.task
  20. - --batch_size=32
  21. - --training_steps=1000
  22. env:
  23. - name: GOOGLE_APPLICATION_CREDENTIALS
  24. value: "/etc/secrets/user-gcp-sa.json"
  25. volumeMounts:
  26. - name: sa
  27. mountPath: "/etc/secrets"
  28. readOnly: true
  29. volumes:
  30. - name: sa
  31. secret:
  32. secretName: user-gcp-sa
  33. Worker:
  34. replicas: 1
  35. restartPolicy: OnFailure
  36. template:
  37. spec:
  38. containers:
  39. - name: tensorflow
  40. image: gcr.io/your-project/your-image
  41. command:
  42. - python
  43. - -m
  44. - trainer.task
  45. - --batch_size=32
  46. - --training_steps=1000
  47. env:
  48. - name: GOOGLE_APPLICATION_CREDENTIALS
  49. value: "/etc/secrets/user-gcp-sa.json"
  50. volumeMounts:
  51. - name: sa
  52. mountPath: "/etc/secrets"
  53. readOnly: true
  54. volumes:
  55. - name: sa
  56. secret:
  57. secretName: user-gcp-sa

If you are not familiar with Kubernetes resources please refer to the page Understanding Kubernetes Objects.

What makes TFJob different from built in controllers is the TFJob spec is designed to managedistributed TensorFlow training jobs.

A distributed TensorFlow job typically contains 0 or more of the following processes

  • Chief The chief is responsible for orchestrating training and performing taskslike checkpointing the model.
  • Ps The ps are parameter servers; these servers provide a distributed data storefor the model parameters.
  • Worker The workers do the actual work of training the model. In some cases,worker 0 might also act as the chief.
  • Evaluator The evaluators can be used to compute evaluation metrics as the modelis trained.

The field tfReplicaSpecs in TFJob spec contains a map from the type ofreplica (as listed above) to the TFReplicaSpec for that replica. TFReplicaSpecconsists of 3 fields

  • replicas The number of replicas of this type to spawn for this TFJob.
  • template A PodTemplateSpec that describes the pod to createfor each replica.

    • The pod must include a container named tensorflow.
  • restartPolicy Determines whether pods will be restarted when they exit. Theallowed values are as follows

    • Always means the pod will always be restarted. This policy is goodfor parameter servers since they never exit and should always be restartedin the event of failure.
    • OnFailure means the pod will be restarted if the pod exits due to failure.

      • A non-zero exit code indicates a failure.
      • An exit code of 0 indicates success and the pod will not be restarted.
      • This policy is good for chief and workers.
    • ExitCode means the restart behavior is dependent on the exit code ofthe tensorflow container as follows:

      • Exit code 0 indicates the process completed successfully and willnot be restarted.

      • The following exit codes indicate a permanent error and the containerwill not be restarted:

        • 1: general errors
        • 2: misuse of shell builtins
        • 126: command invoked cannot execute
        • 127: command not found
        • 128: invalid argument to exit
        • 139: container terminated by SIGSEGV (invalid memory reference)
      • The following exit codes indicate a retryable error and the containerwill be restarted:

        • 130: container terminated by SIGINT (keyboard Control-C)
        • 137: container received a SIGKILL
        • 143: container received a SIGTERM
      • Exit code 138 corresponds to SIGUSR1 and is reserved foruser-specified retryable errors.

      • Other exit codes are undefined and there is no guarantee about thebehavior.

For background information on exit codes, see the GNU guide totermination signalsand the Linux DocumentationProject.

  • Never means pods that terminate will never be restarted. This policyshould rarely be used because Kubernetes will terminate pods for any numberof reasons (e.g. node becomes unhealthy) and this will prevent the job fromrecovering.

Quick start

Submitting a TensorFlow training job

Note: Before submitting a training job, you should have deployed kubeflow to your cluster. Doing so ensures thatthe TFJob custom resource is available when you submit the training job.

Running the MNist example

Kubeflow ships with an example suitable for runninga simple MNist model.

  1. git clone https://github.com/kubeflow/tf-operator
  2. cd tf-operator/examples/v1/mnist_with_summaries
  3. # Deploy the event volume
  4. kubectl apply -f tfevent-volume
  5. # Submit the TFJob
  6. kubectl apply -f tf_job_mnist.yaml

Monitor the job (see the detailed guide below):

  1. kubectl -n kubeflow get tfjob mnist -o yaml

Delete it

  1. kubectl -n kubeflow delete tfjob mnist

Customizing the TFJob

Typically you can change the following values in the TFJob yaml file:

  • Change the image to point to the docker image containing your code
  • Change the number and types of replicas
  • Change the resources (requests and limits) assigned to each resource
  • Set any environment variables

    • For example, you might need to configure various environment variables to talk to datastores like GCS or S3
  • Attach PVs if you want to use PVs for storage.

Using GPUs

To use GPUs your cluster must be configured to use GPUs.

To attach GPUs specify the GPU resource on the container in the replicasthat should contain the GPUs; for example.

  1. apiVersion: "kubeflow.org/v1"
  2. kind: "TFJob"
  3. metadata:
  4. name: "tf-smoke-gpu"
  5. spec:
  6. tfReplicaSpecs:
  7. PS:
  8. replicas: 1
  9. template:
  10. metadata:
  11. creationTimestamp: null
  12. spec:
  13. containers:
  14. - args:
  15. - python
  16. - tf_cnn_benchmarks.py
  17. - --batch_size=32
  18. - --model=resnet50
  19. - --variable_update=parameter_server
  20. - --flush_stdout=true
  21. - --num_gpus=1
  22. - --local_parameter_device=cpu
  23. - --device=cpu
  24. - --data_format=NHWC
  25. image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
  26. name: tensorflow
  27. ports:
  28. - containerPort: 2222
  29. name: tfjob-port
  30. resources:
  31. limits:
  32. cpu: '1'
  33. workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
  34. restartPolicy: OnFailure
  35. Worker:
  36. replicas: 1
  37. template:
  38. metadata:
  39. creationTimestamp: null
  40. spec:
  41. containers:
  42. - args:
  43. - python
  44. - tf_cnn_benchmarks.py
  45. - --batch_size=32
  46. - --model=resnet50
  47. - --variable_update=parameter_server
  48. - --flush_stdout=true
  49. - --num_gpus=1
  50. - --local_parameter_device=cpu
  51. - --device=gpu
  52. - --data_format=NHWC
  53. image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
  54. name: tensorflow
  55. ports:
  56. - containerPort: 2222
  57. name: tfjob-port
  58. resources:
  59. limits:
  60. nvidia.com/gpu: 1
  61. workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
  62. restartPolicy: OnFailure

Follow TensorFlow’s instructionsfor using GPUs.

Monitoring your job

To get the status of your job

  1. kubectl get -o yaml tfjobs ${JOB}

Here is sample output for an example job

  1. apiVersion: kubeflow.org/v1
  2. kind: TFJob
  3. metadata:
  4. annotations:
  5. kubectl.kubernetes.io/last-applied-configuration: |
  6. {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"mnist","namespace":"kubeflow"},"spec":{"cleanPodPolicy":"None","tfReplicaSpecs":{"Worker":{"replicas":1,"restartPolicy":"Never","template":{"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir=/train","--learning_rate=0.01","--batch_size=150"],"image":"gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0","name":"tensorflow","volumeMounts":[{"mountPath":"/train","name":"training"}]}],"volumes":[{"name":"training","persistentVolumeClaim":{"claimName":"tfevent-volume"}}]}}}}}}
  7. creationTimestamp: "2019-07-16T02:44:38Z"
  8. generation: 1
  9. name: mnist
  10. namespace: kubeflow
  11. resourceVersion: "10429537"
  12. selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/tfjobs/mnist
  13. uid: a77b9fb4-a773-11e9-91fe-42010a960094
  14. spec:
  15. cleanPodPolicy: None
  16. tfReplicaSpecs:
  17. Worker:
  18. replicas: 1
  19. restartPolicy: Never
  20. template:
  21. spec:
  22. containers:
  23. - command:
  24. - python
  25. - /var/tf_mnist/mnist_with_summaries.py
  26. - --log_dir=/train
  27. - --learning_rate=0.01
  28. - --batch_size=150
  29. image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
  30. name: tensorflow
  31. volumeMounts:
  32. - mountPath: /train
  33. name: training
  34. volumes:
  35. - name: training
  36. persistentVolumeClaim:
  37. claimName: tfevent-volume
  38. status:
  39. completionTime: "2019-07-16T02:45:23Z"
  40. conditions:
  41. - lastTransitionTime: "2019-07-16T02:44:38Z"
  42. lastUpdateTime: "2019-07-16T02:44:38Z"
  43. message: TFJob mnist is created.
  44. reason: TFJobCreated
  45. status: "True"
  46. type: Created
  47. - lastTransitionTime: "2019-07-16T02:45:20Z"
  48. lastUpdateTime: "2019-07-16T02:45:20Z"
  49. message: TFJob mnist is running.
  50. reason: TFJobRunning
  51. status: "True"
  52. type: Running
  53. replicaStatuses:
  54. Worker:
  55. running: 1
  56. startTime: "2019-07-16T02:44:38Z"

Conditions

A TFJob has a TFJobStatus, which has an array of TFJobConditions through which the TFJob has or has not passed. Each element of the TFJobCondition array has six possible fields:

  • The lastUpdateTime field provides the last time this condition was updated.
  • The lastTransitionTime field provides the last time the condition transitioned from one status to another.
  • The message field is a human readable message indicating details about the transition.
  • The reason field is a unique, one-word, CamelCase reason for the condition’slast transition.
  • The status field is a string with possible values “True”, “False”, and “Unknown”.
  • The type field is a string with the following possible values:
    • TFJobCreated means the tfjob has been accepted by the system,but one or more of the pods/services has not been started.
    • TFJobRunning means all sub-resources (e.g. services/pods) of this TFJobhave been successfully scheduled and launched and the job is running.
    • TFJobRestarting means one or more sub-resources (e.g. services/pods)of this TFJob had a problem and is being restarted.
    • TFJobSucceeded means the job completed successfully.
    • TFJobFailed means the job has failed.

Success or failure of a job is determined as follows

  • If a job has a chief success or failure is determined by the statusof the chief.
  • If a job has no chief success or failure is determined by the workers.
  • In both cases the TFJob succeeds if the process being monitored exitswith exit code 0.
  • In the case of non-zero exit code the behavior is determined by the restartPolicyfor the replica.
  • If the restartPolicy allows for restarts then the process will just be restarted and the TFJob will continue to execute.
    • For the restartPolicy ExitCode the behavior is exit code dependent.
    • If the restartPolicy doesn’t allow restarts a non-zero exit code is considereda permanent failure and the job is marked failed.

tfReplicaStatuses

tfReplicaStatuses provides a map indicating the number of pods for eachreplica in a given state. There are three possible states

  • Active is the number of currently running pods.
  • Succeeded is the number of pods that completed successfully.
  • Failed is the number of pods that completed with an error.

Events

During execution, TFJob will emit events to indicate whats happening suchas the creation/deletion of pods and services. Kubernetes doesn’t retainevents older than 1 hour by default. To see recent events for a job run

  1. kubectl describe tfjobs ${JOB}

which will produce output like

  1. Name: mnist
  2. Namespace: kubeflow
  3. Labels: <none>
  4. Annotations: kubectl.kubernetes.io/last-applied-configuration:
  5. {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"mnist","namespace":"kubeflow"},"spec":{"cleanPodPolicy...
  6. API Version: kubeflow.org/v1
  7. Kind: TFJob
  8. Metadata:
  9. Creation Timestamp: 2019-07-16T02:44:38Z
  10. Generation: 1
  11. Resource Version: 10429537
  12. Self Link: /apis/kubeflow.org/v1/namespaces/kubeflow/tfjobs/mnist
  13. UID: a77b9fb4-a773-11e9-91fe-42010a960094
  14. Spec:
  15. Clean Pod Policy: None
  16. Tf Replica Specs:
  17. Worker:
  18. Replicas: 1
  19. Restart Policy: Never
  20. Template:
  21. Spec:
  22. Containers:
  23. Command:
  24. python
  25. /var/tf_mnist/mnist_with_summaries.py
  26. --log_dir=/train
  27. --learning_rate=0.01
  28. --batch_size=150
  29. Image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
  30. Name: tensorflow
  31. Volume Mounts:
  32. Mount Path: /train
  33. Name: training
  34. Volumes:
  35. Name: training
  36. Persistent Volume Claim:
  37. Claim Name: tfevent-volume
  38. Status:
  39. Completion Time: 2019-07-16T02:45:23Z
  40. Conditions:
  41. Last Transition Time: 2019-07-16T02:44:38Z
  42. Last Update Time: 2019-07-16T02:44:38Z
  43. Message: TFJob mnist is created.
  44. Reason: TFJobCreated
  45. Status: True
  46. Type: Created
  47. Last Transition Time: 2019-07-16T02:45:20Z
  48. Last Update Time: 2019-07-16T02:45:20Z
  49. Message: TFJob mnist is running.
  50. Reason: TFJobRunning
  51. Status: True
  52. Type: Running
  53. Replica Statuses:
  54. Worker:
  55. Running: 1
  56. Start Time: 2019-07-16T02:44:38Z
  57. Events:
  58. Type Reason Age From Message
  59. ---- ------ ---- ---- -------
  60. Normal SuccessfulCreatePod 8m6s tf-operator Created pod: mnist-worker-0
  61. Normal SuccessfulCreateService 8m6s tf-operator Created service: mnist-worker-0

Here the events indicate that the pods and services were successfully created.

TensorFlow Logs

Logging follows standard K8s logging practices.

You can use kubectl to get standard output/error for any podsthat haven’t been deleted.

First find the pod created by the job controller for the replica ofinterest. Pods will be named

  1. ${JOBNAME}-${REPLICA-TYPE}-${INDEX}

Once you’ve identified your pod you can get the logs using kubectl.

  1. kubectl logs ${PODNAME}

The CleanPodPolicy in the TFJob spec controls deletion of pods when a job terminates.The policy can be one of the following values

  • The Running policy means that only pods still running when a job completes(e.g. parameter servers) will be deleted immediately; completed pods willnot be deleted so that the logs will be preserved. This is the default value.
  • The All policy means all pods even completed pods will be deleted immediatelywhen the job finishes.
  • The None policy means that no pods will be deleted when the job completes.

If your cluster takes advantage of Kubernetescluster loggingthen your logs may also be shipped to an appropriate data store forfurther analysis.

Stackdriver on GKE

See the guide to logging and monitoring forinstructions on getting logs using Stackdriver.

As described in the guide tologging and monitoring,it’s possible to fetch the logs for a particular replica based on pod labels.

Using the Stackdriver UI you can use a query like

  1. resource.type="k8s_container"
  2. resource.labels.cluster_name="${CLUSTER}"
  3. metadata.userLabels.tf_job_name="${JOB_NAME}"
  4. metadata.userLabels.tf-replica-type="${TYPE}"
  5. metadata.userLabels.tf-replica-index="${INDEX}"

Alternatively using gcloud

  1. QUERY="resource.type=\"k8s_container\" "
  2. QUERY="${QUERY} resource.labels.cluster_name=\"${CLUSTER}\" "
  3. QUERY="${QUERY} metadata.userLabels.tf_job_name=\"${JOB_NAME}\" "
  4. QUERY="${QUERY} metadata.userLabels.tf-replica-type=\"${TYPE}\" "
  5. QUERY="${QUERY} metadata.userLabels.tf-replica-index=\"${INDEX}\" "
  6. gcloud --project=${PROJECT} logging read \
  7. --freshness=24h \
  8. --order asc ${QUERY}

Troubleshooting

Here are some steps to follow to troubleshoot your job

  • Is a status present for your job? Run the command
  1. kubectl -n ${NAMESPACE} get tfjobs -o yaml ${JOB_NAME}
  • If the resulting output doesn’t include a status for your job then this typicallyindicates the job spec is invalid.

  • If the TFJob spec is invalid there should be a log message in the tf operator logs

  1. kubectl -n ${KUBEFLOW_NAMESPACE} logs `kubectl get pods --selector=name=tf-job-operator -o jsonpath='{.items[0].metadata.name}'`
  1. - **KUBEFLOW_NAMESPACE** Is the namespace you deployed the TFJob operator in.
  • Check the events for your job to see if the pods were created

  • There are a number of ways to get the events; if your job is less than 1 hour oldthen you can do

  1. kubectl -n ${NAMESPACE} describe tfjobs -o yaml ${JOB_NAME}
  • The bottom of the output should include a list of events emitted by the job; e.g.
  1. Events:
  2. Type Reason Age From Message
  3. ---- ------ ---- ---- -------
  4. Warning SettedPodTemplateRestartPolicy 19s (x2 over 19s) tf-operator Restart policy in pod template will be overwritten by restart policy in replica spec
  5. Normal SuccessfulCreatePod 19s tf-operator Created pod: tfjob2-worker-0
  6. Normal SuccessfulCreateService 19s tf-operator Created service: tfjob2-worker-0
  7. Normal SuccessfulCreatePod 19s tf-operator Created pod: tfjob2-ps-0
  8. Normal SuccessfulCreateService 19s tf-operator Created service: tfjob2-ps-0
  • Kubernetes only preserves events for 1 hour (see kubernetes/kubernetes#52521)

    • Depending on your cluster setup events might be persisted to external storage and accessible for longer periods
    • On GKE events are persisted in stackdriver and can be accessed using the instructions in the previous section.
  • If the pods and services aren’t being created then this suggests the TFJob isn’t being processed; common causes are

    • The TFJob spec is invalid (see above)
    • The TFJob operator isn’t running
    • Check the events for the pods to ensure they are scheduled.
  • There are a number of ways to get the events; if your pod is less than 1 hour oldthen you can do

  1. kubectl -n ${NAMESPACE} describe pods ${POD_NAME}
  • The bottom of the output should contain events like the following
  1. Events:
  2. Type Reason Age From Message
  3. ---- ------ ---- ---- -------
  4. Normal Scheduled 18s default-scheduler Successfully assigned tfjob2-ps-0 to gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt
  5. Normal SuccessfulMountVolume 17s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt MountVolume.SetUp succeeded for volume "default-token-h8rnv"
  6. Normal Pulled 17s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt Container image "gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3" already present on machine
  7. Normal Created 17s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt Created container
  8. Normal Started 16s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt Started container
  • Some common problems that can prevent a container from starting are
    • Insufficient resources to schedule the pod
    • The pod tries to mount a volume (or secret) that doesn’t exist or is unavailable
    • The docker image doesn’t exist or can’t be accessed (e.g due to permission issues)
  • If the containers start; check the logs of the containers following the instructionsin the previous section.

More information