PodGroup

Introduction

PodGroup is a group of pods with strong association and is mainly used in batch scheduling, for example, ps and worker tasks in TensorFlow. PodGroup is of a Custom Resource Definition (CRD) type.

Example

  1. apiVersion: scheduling.volcano.sh/v1beta1
  2. kind: PodGroup
  3. metadata:
  4. creationTimestamp: "2020-08-11T12:28:55Z"
  5. generation: 5
  6. name: test
  7. namespace: default
  8. ownerReferences:
  9. - apiVersion: batch.volcano.sh/v1alpha1
  10. blockOwnerDeletion: true
  11. controller: true
  12. kind: Job
  13. name: test
  14. uid: 028ecfe8-0ff9-477d-836c-ac5676491a38
  15. resourceVersion: "109074"
  16. selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/job-1
  17. uid: eb2508f5-3349-439c-b94d-4ac23afd71ff
  18. spec:
  19. minMember: 1
  20. minResources:
  21. cpu: "3"
  22. memory: "2048Mi"
  23. priorityClassName: high-priority
  24. queue: default
  25. status:
  26. conditions:
  27. - lastTransitionTime: "2020-08-11T12:28:57Z"
  28. message: '1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable.'
  29. reason: NotEnoughResources
  30. status: "True"
  31. transitionID: 77d5be3f-6169-4f86-8e65-0bdc621ce983
  32. type: Unschedulable
  33. - lastTransitionTime: "2020-08-11T12:29:02Z"
  34. reason: tasks in gang are ready to be scheduled
  35. status: "True"
  36. transitionID: 54514401-5c90-4b11-840d-90c1cda93096
  37. type: Scheduled
  38. phase: Running
  39. running: 1

Key Fields

minMember

minMember indicates the minimum number of pods or tasks running under the PodGroup. If the cluster resource cannot meet the demand of running the minimum number of pods or tasks, no pod or task in the PodGroup will be scheduled.

queue

queue indicates the queue to which the PodGroup belongs. The queue must be in the Open state.

priorityClassName

priorityClassName represents the priority of the PodGroup and is used by the scheduler to sort all the PodGroups in the queue during scheduling. Note that system-node-critical and system-cluster-critical are reserved values, which mean the highest priority. If priorityClassName is not specified, the default priority is used.

minResources

minResources indicates the minimum resources for running the PodGroup. If available resources in the cluster cannot satisfy the requirement, no pod or task in the PodGroup will be scheduled.

phase

phase indicates the current status of the PodGroup.

conditions

conditions represents the status log of the PodGroup, including the key events that occurred in the lifecycle of the PodGroup.

running

running indicates the number of running pods or tasks in the PodGroup.

succeed

succeed indicates the number of successful pods or tasks in the PodGroup.

failed

failed indicates the number of failed pods or tasks in the PodGroup.

Status

pending

pending indicates that the PodGroup has been accepted by Volcano but its resource requirement has not been satisfied yet. Once satisfied, the status will turn to running.

running

running indicates that there are at least minMember pods or tasks running under the PodGroup.

unknown

unknown indicates that among minMember pods or tasks, some are running while others are not scheduled. The reason could be due to the lack of resources. The scheduler will wait until ControllerManager starts these pods or tasks again.

inqueue

inqueue indicates that the PodGroup has passed validation and is waiting to be bound to a node. It is a transient state between pending and running.

Usage

minMember

In some scenarios such as machine learning training, you do not need all tasks of a job to be completed. Instead, when a specified number of tasks are completed, the job can be achieved. In this case, the minMember field is suitable.

priorityClassName

priorityClassName is used in preemptive priority scheduling.

minResources

In some scenarios such as big data analytics, a job can run only when available resources meet the minimum requirement. minResources is suitable for such scenarios.

Note

Automatic Creation

If no PodGroup is specified when a VolcanoJob is created, Volcano will create a PodGroup with the same name as the VolcanoJob.