MXNet Training
Instructions for using MXNet
This guide walks you through using MXNet with Kubeflow.
Installing MXNet Operator
If you haven’t already done so please follow the Getting Started Guide to deploy Kubeflow.
A version of MXNet support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.
Verify that MXNet support is included in your Kubeflow deployment
Check that the MXNet custom resource is installed
kubectl get crd
The output should include mxjobs.kubeflow.org
NAME AGE
...
mxjobs.kubeflow.org 4d
...
If it is not included you can add it as follows
git clone https://github.com/kubeflow/manifests
cd manifests/mxnet-job/mxnet-operator
kubectl kustomize base | kubectl apply -f -
Alternatively, you can deploy the operator with default settings without using kustomize by running the following from the repo:
git clone https://github.com/kubeflow/mxnet-operator.git
cd mxnet-operator
kubectl create -f manifests/crd-v1beta1.yaml
kubectl create -f manifests/rbac.yaml
kubectl create -f manifests/deployment.yaml
Creating a MXNet training job
You create a training job by defining a MXJob with MXTrain mode and then creating it with
kubectl create -f examples/v1beta1/train/mx_job_dist_gpu.yaml
Creating a TVM tuning job (AutoTVM)
TVM is a end to end deep learning compiler stack, you can easily run AutoTVM with mxnet-operator.You can create a auto tuning job by define a type of MXTune job and then creating it with
kubectl create -f examples/v1beta1/tune/mx_job_tune_gpu.yaml
Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. To let TVM tune your network, you should create a docker image which has TVM module. Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters, For more details, please see https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py. Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. We provide an example under examples/v1beta1/tune/, tuning result will be saved in a log file like resnet-18.log in the example we gave. You can refer it for details.
Monitoring a MXNet Job
To get the status of your job
kubectl get -o yaml mxjobs ${JOB}
Here is sample output for an example job
apiVersion: kubeflow.org/v1beta1
kind: MXJob
metadata:
creationTimestamp: 2019-03-19T09:24:27Z
generation: 1
name: mxnet-job
namespace: default
resourceVersion: "3681685"
selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/mxjobs/mxnet-job
uid: cb11013b-4a28-11e9-b7f4-704d7bb59f71
spec:
cleanPodPolicy: All
jobMode: MXTrain
mxReplicaSpecs:
Scheduler:
replicas: 1
restartPolicy: Never
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: mxjob/mxnet:gpu
name: mxnet
ports:
- containerPort: 9091
name: mxjob-port
resources: {}
Server:
replicas: 1
restartPolicy: Never
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: mxjob/mxnet:gpu
name: mxnet
ports:
- containerPort: 9091
name: mxjob-port
resources: {}
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- /incubator-mxnet/example/image-classification/train_mnist.py
- --num-epochs
- "10"
- --num-layers
- "2"
- --kv-store
- dist_device_sync
- --gpus
- "0"
command:
- python
image: mxjob/mxnet:gpu
name: mxnet
ports:
- containerPort: 9091
name: mxjob-port
resources:
limits:
nvidia.com/gpu: "1"
status:
completionTime: 2019-03-19T09:25:11Z
conditions:
- lastTransitionTime: 2019-03-19T09:24:27Z
lastUpdateTime: 2019-03-19T09:24:27Z
message: MXJob mxnet-job is created.
reason: MXJobCreated
status: "True"
type: Created
- lastTransitionTime: 2019-03-19T09:24:27Z
lastUpdateTime: 2019-03-19T09:24:29Z
message: MXJob mxnet-job is running.
reason: MXJobRunning
status: "False"
type: Running
- lastTransitionTime: 2019-03-19T09:24:27Z
lastUpdateTime: 2019-03-19T09:25:11Z
message: MXJob mxnet-job is successfully completed.
reason: MXJobSucceeded
status: "True"
type: Succeeded
mxReplicaStatuses:
Scheduler: {}
Server: {}
Worker: {}
startTime: 2019-03-19T09:24:29Z