Resource Reservation
Summary
A scheduling mechanism and its API is provided to reserve node resources for pods may not be created yet.
Motivation
Pods are fundamental units for allocating node resources in Kubernetes, which bind resource requirements with business logic. The scheduler is not able to reserve node resources for specific pods or workloads. We may try using a fake pod to prepare resources by the preemption mechanism. However, fake pods can be preempted by any scheduled pods with higher priorities, which make resources get scrambled unexpectedly.
In Koordinator, a resource reservation mechanism is proposed to enhance scheduling and especially benefits scenarios below:
- Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. With a reservation, the scheduler should be able to “lock” resources preventing from allocation of other pods with the same or higher priority.
- De-scheduling: For the descheduler, it is better to ensure sufficient resources with the reservation before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted.
- Horizontal scaling: Using reservation to achieve more deterministic horizontal scaling. e.g. Submit a reservation and make sure it is available before scaling up replicas.
- Resource Pre-allocation: Sometimes we want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. Reservation can help with this and it should make no physical cost.
Goals
- Define the basic API of resource reservation for Motivations<1,2,3>, extensible for supporting Motivation<4> in the future.
- Provide a scheduler plugin that implements above reservation mechanism.
Non-Goals/Future Work
- Detailed design of reservative preemption/descheduler/horizontal scaler/pre-allocation.
- Modify kubelet admission control for reservation objects.
Proposal
User Stories
Story 1
As a Kubernetes developer, I want to enhance the current preemption mechanism since preempted resources may be allocated by pods other than the preemptor. The scheduler can create a reservation for the preempting pods, so the ownership of preempted resources can be guaranteed, making the preemption more reliable.
Story 2
As a cluster administrator, I want to use descheduler to migrate pods that are placed abnormally to somewhere they could “live better” and fulfill orchestration requirements of the app. e.g. Move pods on a over-utilized node to idler nodes and bind CPUs of same NUMA node. Reservations can be created before rescheduling pods, helping ensure there are sufficient resources and well placement.
Story 3
As an application administrator, I want to make the horizontal scaling of my app more deterministic by submitting reservations before a scale-up. Besides, I can also reserve resources after a scale-down for future demands. It is useful especially when we want a guaranteed scale-up of applications for the coming business peak.
Story 4
As a cluster administrator, I want to pre-allocate node resources for future usage no matter whether they are available now or not. I want to allocate the future free resources but do not disrupt the running of scheduled pods. Reservation can be made to pre-allocate resources since it makes no physical cost to the node. It may be in a Waiting
state. When there is enough space for the reservation, it will become Available
for the owner pods’ scheduling.
API
In this section, a Custom Resource Definition (CRD) named Reservation
is proposed to allow the scheduler to reserve node resources for specific pods.
// Reservation objects are non-namespaced.
// It can reserve resources for pods of any namespace. Any affinity/anti-affinity of reservation scheduling can be
// specified in the pod template.
type Reservation struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ReservationSpec `json:"spec,omitempty"`
Status ReservationStatus `json:"status,omitempty"`
}
type ReservationSpec struct {
// Template defines the scheduling requirements (resources, affinities, images, ...) processed by the scheduler just
// like a normal pod.
// If the `template.spec.nodeName` is specified, the scheduler will not choose another node but reserve resources on
// the specified node.
Template *corev1.PodTemplateSpec `json:"template,omitempty"`
// Specify the owners who can allocate the reserved resources.
// Multiple owner selectors and ORed.
Owners []ReservationOwner `json:"owners,omitempty"`
// By default, the resources requirements of reservation (specified in `template.spec`) is filtered by whether the
// node has sufficient free resources (i.e. ReservationRequest < NodeFree).
// When `preAllocation` is set, the scheduler will skip this validation and allow overcommitment. The scheduled
// reservation would be waiting to be available until free resources are sufficient.
// NOTE: Not supported in v0.6.
PreAllocation bool `json:"preAllocation,omitempty"`
// Time-to-Live period for the reservation.
// `expires` and `ttl` are mutually exclusive. If both `ttl` and `expires` are not specified, a very
// long TTL will be picked as default. Set 0 to disable the expiration.
TTL *metav1.Duration `json:"ttl,omitempty"`
// Expired timestamp when the reservation expires.
// `expires` and `ttl` are mutually exclusive. Defaults to being set dynamically at runtime based on the `ttl`.
Expires *metav1.Time `json:"expires,omitempty"`
}
type ReservationStatus struct {
// The `phase` indicates whether is reservation is waiting for process (`Pending`), available to allocate
// (`Available`) or timeout/expired to get cleanup (Failed).
Phase ReservationPhase `json:"phase,omitempty"`
// The `conditions` indicate the messages of reason why the reservation is still pending.
Conditions []ReservationCondition `json:"conditions,omitempty"`
// Current resource owners which allocated the reservation resources.
CurrentOwners []corev1.ObjectReference `json:"currentOwners,omitempty"`
// Name of node the reservation is scheduled on.
NodeName string `json:"nodeName,omitempty"`
// Resource reserved and allocatable for owners.
Allocatable corev1.ResourceList `json:"allocatable,omitempty"`
// Resource allocated by current owners.
Allocated corev1.ResourceList `json:"allocated,omitempty"`
}
type ReservationOwner struct {
// Multiple field selectors are ANDed.
Object *corev1.ObjectReference `json:"object,omitempty"`
Controller *ReservationControllerReference `json:"controller,omitempty"`
LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
}
type ReservationControllerReference struct {
// Extend with a `namespace` field for reference different namespaces.
metav1.OwnerReference `json:",inline"`
Namespace string `json:"namespace,omitempty"`
}
type ReservationPhase string
const (
// ReservationPending indicates the Reservation has not been processed by the scheduler or is unschedulable for
// some reasons (e.g. the resource requirements cannot get satisfied).
ReservationPending ReservationPhase = "Pending"
// ReservationAvailable indicates the Reservation is both scheduled and available for allocation.
ReservationAvailable ReservationPhase = "Available"
// ReservationWaiting indicates the Reservation is scheduled, but the resources to reserve are not ready for
// allocation (e.g. in pre-allocation for running pods).
ReservationWaiting ReservationPhase = "Waiting"
// ReservationFailed indicates the Reservation is failed to reserve resources, due to expiration or marked as
// unavailable, which the object is not available to allocate and will get cleaned in the future.
ReservationFailed ReservationPhase = "Failed"
)
type ReservationCondition struct {
Type ReservationConditionType `json:"type,omitempty"`
Status ConditionStatus `json:"status,omitempty"`
Reason string `json:"reason,omitempty"`
Message string `json:"message,omitempty"`
LastProbeTime metav1.Time `json:"lastProbeTime,omitempty"`
LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"`
}
type ReservationConditionType string
const (
ReservationConditionScheduled ReservationConditionType = "Scheduled"
ReservationConditionReady ReservationConditionType = "Ready"
)
type ConditionStatus string
const (
ConditionStatusTrue ConditionStatus = "True"
ConditionStatusFalse ConditionStatus = "False"
ConditionStatusUnknown ConditionStatus = "Unknown"
)
const (
ReasonReservationScheduled = "Scheduled"
ReasonReservationUnschedulable = "Unschedulable"
ReasonReservationAvailable = "Available"
ReasonReservationExpired = "Expired"
)
Implementation Details
Reservation Plugin
Schedule Reservations
A Reservation
object has its scheduling requirements like a pod. Ideally, A Reservation
object should get processed directly by the scheduler like a pod. However, it can require a series of modifications on scheduling framework, losing the compatibility with standard kube-scheduler, kubelet, autoscaler, etc. In the reservation plugin, we fake one reservation pod for one Reservation
inside the scheduler to fulfill general scheduling plugins (noderesources, nodeaffinity, tainttolerations, …). The scheduling framework can handle Reservation
objects by processing fake pods in both scheduling cycle and binding cycle.
A fake pod inside the scheduler can construct the same affinity/anti-affinity constraints as owner pods, which may change the reservation result. To handle this problem, koord-scheduler extends the framework to skip check of pod affinity for existing reservations in the Filter
phase.
A reservation specified PreAllocation
intends to pre-allocate resources on nodes. The scheduler will skip its filtering of node resources in the scheduling cycle. However, the scheduled reservation will be Waiting
to be Available
until there are enough resources to fulfill its requests.
If all nodes are unscheduled for the reservation, the scheduler keeps its status as Pending
and sets Conditions
with the failure message.
Once the scheduling decision has been made, the corresponding Reservation
object is updated with a new status indicating whether the reservation succeeded or not. The fake pod does not expose to other components, and the kubelet without modification does not perceive a Reservation
assigned. Fortunately, a Reservation
does not need to be executable on the node, so existing containers can keep running as usual without additional admissions.
If a reservation has set the nodeName
(inside the template
field), the scheduler is responsible for checking if the node can fulfill the reservation since kubelet does not do admissions for the reservation.
Allocate Reserved Resources
Let’s call the reservation is allocatable for a pod if:
- The reservation is available.
- The pod matches the reservation owner spec.
- There are sufficient free resources in the reservation to fulfill the pod.
When the reservation plugin is enabled, the scheduler checks for every scheduling pod if there are allocatable reservations on a node. With a Score
plugin implemented, the scheduler prefers pods to schedule on nodes which have more allocatable reserved resources.
When a pod is scheduled on a node with allocatable reservations, it allocates resources belonging to one of reservations. To pick one of reservations, we choose the one which can get most reserved resources allocated (i.e. MostAllocated). And the scheduler also annotates the pod with the reservation info.
Expiration and Cleanup
When a reservation has been created for a long time exceeding the TTL
or Expires
, the scheduler updates its status as Expired
. For expired reservations, the scheduler will cleanup them with a custom garbage collection period.
When a node is deleted, the available and waiting reservations on the node should be marked as Failed
since they are not allocatable any more.
Use Cases
To generally reserve node resources, submit a Reservation
and set the pod template in the field spec.template
. Then the koord-scheduler will update this Reservation
with the scheduling result and the resources will get reserved.
To be more specific,
spec.template
specifies the fundamental resource requirements of a reservation. The scheduler will schedule the fake pod based on the template.spec.owners
specifies which kinds of pods can use the reservation.spec.ttl
andexpires
specifies the expiration for the reservation.spec.preAllocation
indicates whether the scheduler should filter with its resource requirements. Otherwise, the pre-allocation of node resources is allowed, and the reservation will become available until there are sufficient resources.status.phase
is marked asPending
when the Reservation is created. And it is marked asAvailable
when the Reservation is successfully scheduled.status.conditions
shows why the reservation is unscheduled or failed.- When a Reservation is
Available
on the node, only specified pods can allocate the reserved resources.
Usage in Preemption
The Priority Preemption happens in the PostFilter phase trying to make preemptive pods schedulable by evicting low-priority pods. When a pod succeeds the preemption, the pod status
will be patched with a nominated node where the scheduler do the eviction. However, the preemptor’s nominated node is not always the same as the scheduled node, since the scheduler does not reserve resources for the preemptor. To ensure the preemptive resources are for the preemptor, firstly the scheduler can create a reservation that both sets owners
with the preemptor pod and relevant affinity rules for reserving resources of the preempts. Then the scheduler evict pods, and the reservation will become Available
once the resources are released. Finally, the preemptor pods can get scheduled on the nodes with preemptive resource reserved.
Usage in Descheduling
Before a pod is rescheduled, the descheduler can create a reservation that sets template
and owners
for the candidate. When the reservation becomes Available
, the descheduler can assign the pod to allocate the reserved resources. This solves the problem in which the rescheduled pod has stopped at the old node but cannot run on the new node. Moreover, the descheduler can migrate resources between pods by setting the preAllocation
field.
Usage in Pre-allocation
Reservations with preAllocation
specified allow users to pre-allocate the node resources from running pods. The status.phase
of the reservation is set as Waiting
until the resources are released, indicating that its availability is conditional. Once the referenced pods have terminated, the phase
is Available
for owners, and the pre-allocation succeeds.
Risks and Mitigations
Kubelet without any modification possibly ignore Reservation
objects in predicate admission, which increases the chance of unexpected overcommitment at nodes. Reservation
does not require any physical resources to be executable, so the overcommitment is mainly a problem only when pods get scheduled with Reservation
and start to run, which is somewhat easier to mitigate since Kubelet do admit these pods. To further descrease the possibility of unexpected overcommitment or pods admit failures, we could use resource estimation for in-flight pods, balance pods to the nodes with less reserved resources, etc.
Unsolved Problems
As stated above, Reservation
can generate the same pod affinity/anti-affinity rules as the owner pods. The problem gets resolved in the koord-scheduler by extending scheduling framework, but it still limits the standard kube-scheduler.
Alternatives
Use a pause
pod with a low priority to reserve resources
Reserving resources with pause pods with very low assigned priority does work when the preemption can be precisely enabled for specific pods. In the example of cluster autoscaler, pause
pods are helpful when we need to overprovision resources to prevent idle nodes from scaling down by CA. However, a pause
pod has no reservation guarantee except priority
. As declared above, many scenarios require reservations to rely on other pod characteristics (e.g. names, namespaces, labels, priorityClass), where pause
pods cannot meet the demands.