Evicting pods using the descheduler
While the scheduler is used to determine the most suitable node to host a new pod, the descheduler can be used to evict a running pod so that the pod can be rescheduled onto a more suitable node.
The descheduler is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/. |
About the descheduler
You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.
You can benefit from descheduling running pods in situations such as the following:
Nodes are underutilized or overutilized.
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
Node failure requires pods to be moved.
New nodes are added to clusters.
Pods have been restarted too many times.
The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods. |
When the descheduler decides to evict pods from a node, it employs the following general mechanism:
Critical pods with
priorityClassName
set tosystem-cluster-critical
orsystem-node-critical
are never evicted.Static, mirrored, or stand-alone pods that are not part of a replication controller, replica set, deployment, or job are never evicted because these pods will not be recreated.
Pods associated with daemon sets are never evicted.
Pods with local storage are never evicted.
Best effort pods are evicted before burstable and guaranteed pods.
All types of pods with the
descheduler.alpha.kubernetes.io/evict
annotation are evicted. This annotation is used to override checks that prevent eviction, and the user can select which pod is evicted. Users should know how and if the pod will be recreated.Pods subject to pod disruption budget (PDB) are not evicted if descheduling violates its pod disruption budget (PDB). The pods are evicted by using eviction subresource to handle PDB.
Descheduler strategies
The following descheduler strategies are available:
Low node utilization
The LowNodeUtilization
strategy finds nodes that are underutilized and evicts pods, if possible, from other nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes.
The underutilization of nodes is determined by several configurable threshold parameters: CPU, memory, and number of pods. If a node’s usage is below the configured thresholds for all parameters (CPU, memory, and number of pods), then the node is considered to be underutilized.
You can also set a target threshold for CPU, memory, and number of pods. If a node’s usage is above the configured target thresholds for any of the parameters, then the node’s pods might be considered for eviction.
Additionally, you can use the NumberOfNodes
parameter to set the strategy to activate only when the number of underutilized nodes is above the configured value. This can be helpful in large clusters where a few nodes might be underutilized frequently or for a short period of time.
Duplicate pods
The RemoveDuplicates
strategy ensures that there is only one pod associated with a replica set, replication controller, deployment, or job running on same node. If there are more, then those duplicate pods are evicted for better spreading of pods in a cluster.
This situation could occur after a node failure, when a pod is moved to another node, leading to more than one pod associated with a replica set, replication controller, deployment, or job on that node. After the failed node is ready again, this strategy evicts the duplicate pod.
This strategy has an optional parameter, ExcludeOwnerKinds
, that allows you to specify a list of Kind
types. If a pod has any of these types listed as an OwnerRef
, that pod is not considered for eviction.
Violation of inter-pod anti-affinity
The RemovePodsViolatingInterPodAntiAffinity
strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.
This situation could occur when anti-affinity rules are created for pods that are already running on the same node.
Violation of node affinity
The RemovePodsViolatingNodeAffinity
strategy ensures that pods violating node affinity are removed from nodes.
This situation could occur if a node no longer satisfies a pod’s affinity rule. If another node is available that satisfies the affinity rule, then the pod is evicted.
Violation of node taints
The RemovePodsViolatingNodeTaints
strategy ensures that pods violating NoSchedule
taints on nodes are removed.
This situation could occur if a pod is set to tolerate a taint key=value:NoSchedule
and is running on a tainted node. If the node’s taint is updated or removed, the taint is no longer satisfied by the pod’s tolerations and the pod is evicted.
Too many restarts
The RemovePodsHavingTooManyRestarts
strategy ensures that pods that have been restarted too many times are removed from nodes.
This situation could occur if a pod is scheduled on a node that is unable to start it. For example, if the node is having network issues and is unable to mount a networked persistent volume, then the pod should be evicted so that it can be scheduled on another node. Another example is if the pod is crashlooping.
This strategy has two configurable parameters: PodRestartThreshold
and IncludingInitContainers
. If a pod is restarted more than the configured PodRestartThreshold
value, then the pod is evicted. You can use the IncludingInitContainers
parameter to specify whether restarts for Init Containers should be calculated into the PodRestartThreshold
value.
Pod life time
The PodLifeTime
strategy evicts pods that are too old.
After a pod reaches the age, in seconds, set by the MaxPodLifeTimeSeconds
parameter, it is evicted.
Installing the descheduler
The descheduler is not available by default. To enable the descheduler, you must install the Kube Descheduler Operator from OperatorHub. After the Kube Descheduler Operator is installed, you can then configure the eviction strategies.
Prerequisites
Cluster administrator privileges.
Access to the OKD web console.
Ensure that you have downloaded the pull secret from the Red Hat OpenShift Cluster Manager site as shown in Obtaining the installation program in the installation documentation for your platform.
If you have the pull secret, add the
redhat-operators
catalog to the OperatorHub custom resource (CR) as shown in Configuring OKD to use Red Hat Operators.
Procedure
Log in to the OKD web console.
Create the required namespace for the Kube Descheduler Operator.
Navigate to Administration → Namespaces and click Create Namespace.
Enter
openshift-kube-descheduler-operator
in the Name field and click Create.
Install the Kube Descheduler Operator.
Navigate to Operators → OperatorHub.
Type Kube Descheduler Operator into the filter box.
Select the Kube Descheduler Operator and click Install.
On the Install Operator page, select A specific namespace on the cluster. Select openshift-kube-descheduler-operator from the drop-down menu.
Adjust the values for the Update Channel and Approval Strategy to the desired values.
Click Install.
Create a descheduler instance.
From the Operators → Installed Operators page, click the Kube Descheduler Operator.
Select the Kube Descheduler tab and click Create KubeDescheduler.
Edit the settings as necessary and click Create.
You can now configure the strategies for the descheduler. There are no strategies enabled by default.
Configuring descheduler strategies
You can configure which strategies the descheduler uses to evict pods.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Specify one or more strategies in the
spec.strategies
section.apiVersion: operator.openshift.io/v1beta1
kind: KubeDescheduler
metadata:
name: cluster
namespace: openshift-kube-descheduler-operator
spec:
deschedulingIntervalSeconds: 3600
strategies:
- name: "LowNodeUtilization" (1)
params:
- name: "CPUThreshold"
value: "10"
- name: "MemoryThreshold"
value: "20"
- name: "PodsThreshold"
value: "30"
- name: "MemoryTargetThreshold"
value: "40"
- name: "CPUTargetThreshold"
value: "50"
- name: "PodsTargetThreshold"
value: "60"
- name: "NumberOfNodes"
value: "3"
- name: "RemoveDuplicates" (2)
params:
- name: "ExcludeOwnerKinds"
value: "ReplicaSet"
- name: "RemovePodsHavingTooManyRestarts" (3)
params:
- name: "PodRestartThreshold"
value: "10"
- name: "IncludingInitContainers"
value: "false"
- name: "RemovePodsViolatingInterPodAntiAffinity" (4)
- name: "PodLifeTime" (5)
params:
- name: "MaxPodLifeTimeSeconds"
value: "86400"
1 The LowNodeUtilization
strategy provides additional parameters, such asCPUThreshold
andMemoryThreshold
, that you can optionally configure.2 The RemoveDuplicates
strategy provides an optional parameter,ExcludeOwnerKinds
.3 The RemovePodsHavingTooManyRestarts
strategy requires thePodRestartThreshold
parameter to be set. It also provides the optionalIncludingInitContainers
parameter.4 The RemovePodsViolatingInterPodAntiAffinity
,RemovePodsViolatingNodeAffinity
, andRemovePodsViolatingNodeTaints
strategies do not have any additional parameters to configure.5 The PodLifeTime
strategy requires theMaxPodLifeTimeSeconds
parameter to be set.You can enable multiple strategies and the order that the strategies are specified in is not important.
Save the file to apply the changes.
Filtering pods by namespace
You can configure whether or not pods are considered for eviction based on their namespace. Only the following descheduler strategies support namespace filtering:
PodLifeTime
RemovePodsHavingTooManyRestarts
RemovePodsViolatingInterPodAntiAffinity
RemovePodsViolatingNodeAffinity
RemovePodsViolatingNodeTaints
You can use the IncludeNamespaces
parameter to specify which namespaces that a descheduler strategy should be run on, or you can use the ExcludeNamespaces
parameter to specify which namespaces that a descheduler strategy should not be run on.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Add either the
IncludeNamespaces
orExcludeNamespaces
parameter to one or more strategies:apiVersion: operator.openshift.io/v1beta1
kind: KubeDescheduler
metadata:
...
spec:
deschedulingIntervalSeconds: 3600
strategies:
- name: "RemovePodsHavingTooManyRestarts"
params:
- name: "PodRestartThreshold"
value: "10"
- name: "IncludingInitContainers"
value: "false"
- name: "IncludeNamespaces" (1)
value: "my-project" (2)
- name: "PodLifeTime"
params:
- name: "MaxPodLifeTimeSeconds"
value: "86400"
- name: "ExcludeNamespaces" (1)
value: "my-other-project" (2)
1 You cannot specify both IncludeNamespaces
andExcludeNamespaces
for the same strategy.2 Separate multiple namespaces with commas. Save the file to apply the changes.
Filtering pods by priority
You can configure descheduler strategies to consider pods for eviction only if their priority is lower than a specified priority level. Pods that are higher than the specified priority threshold are not considered for eviction.
You can use the ThresholdPriority
parameter to set a numerical priority threshold, or you can use the ThresholdPriorityClassName
parameter to specify a certain priority class name.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Add either the
ThresholdPriority
orThresholdPriorityClassName
parameter to one or more strategies:apiVersion: operator.openshift.io/v1beta1
kind: KubeDescheduler
metadata:
...
spec:
deschedulingIntervalSeconds: 3600
strategies:
- name: "RemovePodsHavingTooManyRestarts"
params:
- name: "PodRestartThreshold"
value: "10"
- name: "IncludingInitContainers"
value: "false"
- name: "ThresholdPriority" (1)
value: "10000"
- name: "PodLifeTime"
params:
- name: "MaxPodLifeTimeSeconds"
value: "86400"
- name: "ThresholdPriorityClassName" (1)
value: "my-priority-class-name" (2)
1 You cannot specify both ThresholdPriority
andThresholdPriorityClassName
for the same strategy.2 The numerical priority value associated with this priority class name is used as the threshold. The priority class must already exist or the descheduler will throw an error. Save the file to apply the changes.
Configuring additional descheduler settings
You can configure additional settings for the descheduler, such as how frequently it runs.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Configure additional settings as necessary:
apiVersion: operator.openshift.io/v1beta1
kind: KubeDescheduler
metadata:
name: cluster
namespace: openshift-kube-descheduler-operator
spec:
deschedulingIntervalSeconds: 3600 (1)
flags:
- --dry-run (2)
image: quay.io/openshift/origin-descheduler:4.6 (3)
...
1 Set number of seconds between descheduler runs. A value of 0
in this field runs the descheduler once and exits.2 Set one or more flags to append to the descheduler pod. This flag must be in the format ready to pass to the binary. 3 Set the descheduler container image to deploy. Save the file to apply the changes.
Uninstalling the descheduler
You can remove the descheduler from your cluster by removing the descheduler instance and uninstalling the Kube Descheduler Operator. This procedure also cleans up the KubeDescheduler
CRD and openshift-kube-descheduler-operator
namespace.
Prerequisites
Cluster administrator privileges.
Access to the OKD web console.
Procedure
Log in to the OKD web console.
Delete the descheduler instance.
From the Operators → Installed Operators page, click Kube Descheduler Operator.
Select the Kube Descheduler tab.
Click the Options menu next to the cluster entry and select Delete KubeDescheduler.
In the confirmation dialog, click Delete.
Uninstall the Kube Descheduler Operator.
Navigate to Operators → Installed Operators,
Click the Options menu next to the Kube Descheduler Operator entry and select Uninstall Operator.
In the confirmation dialog, click Uninstall.
Delete the
openshift-kube-descheduler-operator
namespace.Navigate to Administration → Namespaces.
Enter
openshift-kube-descheduler-operator
into the filter box.Click the Options menu next to the openshift-kube-descheduler-operator entry and select Delete Namespace.
In the confirmation dialog, enter
openshift-kube-descheduler-operator
and click Delete.
Delete the
KubeDescheduler
CRD.Navigate to Administration → Custom Resource Definitions.
Enter
KubeDescheduler
into the filter box.Click the Options menu next to the KubeDescheduler entry and select Delete CustomResourceDefinition.
In the confirmation dialog, click Delete.