TensorFlow on Volcano

最近更新于 Jul 31, 2021

TensorFlow简介

TensorFlow是一个基于数据流编程的符号数学系统,被广泛应用于各类机器学习算法的编程实现,其前身是谷歌的神经网络算法库DistBelief。

TensorFlow on Volcano

PS-worker模型:Parameter Server执行模型相关业务,Work Server训练相关业务,推理计算、梯度计算等[1]。

TensorFlow - 图1

ps-worker

TensorFlow on Kubernates存在诸多的问题

  • 资源隔离
  • 缺乏GPU调度、Gang schduler。
  • 进程遗留问题
  • 训练日志保存不方便

创建tftest.yaml

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: tensorflow-dist-mnist
  5. spec:
  6. minAvailable: 3
  7. schedulerName: volcano
  8. plugins:
  9. env: []
  10. svc: []
  11. policies:
  12. - event: PodEvicted
  13. action: RestartJob
  14. tasks:
  15. - replicas: 1
  16. name: ps
  17. template:
  18. spec:
  19. containers:
  20. - command:
  21. - sh
  22. - "-c"
  23. - |
  24. PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' |sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  25. WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' |sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  26. export TF_CONFIG={\"cluster\":{\"ps\":[${PS_SHOT}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
  27. python /var/tf_dist_mnist/dist_mnist.py
  28. image: volcanosh/dist-mnist-tf-example:0.0.1
  29. name: tensorflow
  30. ports:
  31. - containerPort: 2222
  32. name: tfjob-port
  33. resources:
  34. requests:
  35. cpu: "200m"
  36. restartPolicy: Never
  37. - replicas: 2
  38. name: worker
  39. policies:
  40. - event: TaskCompleted
  41. action: CompleteJob
  42. template:
  43. spec:
  44. containers:
  45. - command:
  46. - sh
  47. - "-c"
  48. - |
  49. PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' |sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  50. WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' |sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  51. export TF_CONFIG={\"cluster\":{\"ps\":[${PS_SHOT}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
  52. python /var/tf_dist_mnist/dist_mnist.py
  53. image: volcanosh/dist-mnist-tf-example:0.0.1
  54. name: tensorflow
  55. ports:
  56. - containerPort: 2222
  57. name: tfjob-port
  58. resources:
  59. requests:
  60. cpu: "200m"
  61. restartPolicy: Never

部署tftest.yaml

  1. kubectl apply -f tftest.yaml

查看作业运行情况

  1. kubectl get pod