PyTorch Training

Instructions for using PyTorch

This guide walks you through using PyTorch with Kubeflow.

Installing PyTorch Operator

If you haven’t already done so please follow the Getting Started Guide to deploy Kubeflow.

An alpha version of PyTorch support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow between 0.2.0 and 0.3.5 to use this version.

More recently, a beta version of PyTorch support was introduced with Kubeflow 0.4.0. You must be using a version of Kubeflow newer than 0.4.0 to use this version.

Verify that PyTorch support is included in your Kubeflow deployment

Check that the PyTorch custom resource is installed

  1. kubectl get crd

The output should include pytorchjobs.kubeflow.org

  1. NAME AGE
  2. ...
  3. pytorchjobs.kubeflow.org 4d
  4. ...

If it is not included you can add it as follows

  1. export KF_DIR=<your Kubeflow installation directory>
  2. cd ${KF_DIR}/kustomize
  3. kubectl apply -f pytorch-job-crds.yaml
  4. kubectl apply -f pytorch-operator.yaml

Creating a PyTorch Job

You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

  1. cat pytorch_job_mnist.yaml

Deploy the PyTorchJob resource to start training:

  1. kubectl create -f pytorch_job_mnist.yaml

You should now be able to see the created pods matching the specified number of replicas.

  1. kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist

Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

  1. PODNAME=$(kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist,pytorch-replica-type=master,pytorch-replica-index=0 -o name)
  2. kubectl logs -f ${PODNAME}

Monitoring a PyTorch Job

  1. kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

  1. apiVersion: kubeflow.org/v1
  2. kind: PyTorchJob
  3. metadata:
  4. clusterName: ""
  5. creationTimestamp: 2018-12-16T21:39:09Z
  6. generation: 1
  7. name: pytorch-tcp-dist-mnist
  8. namespace: default
  9. resourceVersion: "15532"
  10. selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
  11. uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
  12. spec:
  13. cleanPodPolicy: None
  14. pytorchReplicaSpecs:
  15. Master:
  16. replicas: 1
  17. restartPolicy: OnFailure
  18. template:
  19. metadata:
  20. creationTimestamp: null
  21. spec:
  22. containers:
  23. - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
  24. name: pytorch
  25. ports:
  26. - containerPort: 23456
  27. name: pytorchjob-port
  28. resources: {}
  29. Worker:
  30. replicas: 3
  31. restartPolicy: OnFailure
  32. template:
  33. metadata:
  34. creationTimestamp: null
  35. spec:
  36. containers:
  37. - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
  38. name: pytorch
  39. ports:
  40. - containerPort: 23456
  41. name: pytorchjob-port
  42. resources: {}
  43. status:
  44. completionTime: 2018-12-16T21:43:27Z
  45. conditions:
  46. - lastTransitionTime: 2018-12-16T21:39:09Z
  47. lastUpdateTime: 2018-12-16T21:39:09Z
  48. message: PyTorchJob pytorch-tcp-dist-mnist is created.
  49. reason: PyTorchJobCreated
  50. status: "True"
  51. type: Created
  52. - lastTransitionTime: 2018-12-16T21:39:09Z
  53. lastUpdateTime: 2018-12-16T21:40:45Z
  54. message: PyTorchJob pytorch-tcp-dist-mnist is running.
  55. reason: PyTorchJobRunning
  56. status: "False"
  57. type: Running
  58. - lastTransitionTime: 2018-12-16T21:39:09Z
  59. lastUpdateTime: 2018-12-16T21:43:27Z
  60. message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
  61. reason: PyTorchJobSucceeded
  62. status: "True"
  63. type: Succeeded
  64. replicaStatuses:
  65. Master: {}
  66. Worker: {}
  67. startTime: 2018-12-16T21:40:45Z