Applying autoscaling to an OKD cluster

Applying autoscaling to an OKD cluster involves deploying a cluster autoscaler and then deploying machine autoscalers for each machine type in your cluster.

You can configure the cluster autoscaler only in clusters where the machine API is operational.

About the cluster autoscaler

The cluster autoscaler adjusts the size of an OKD cluster to meet its current deployment needs. It uses declarative, Kubernetes-style arguments to provide infrastructure management that does not rely on objects of a specific cloud provider. The cluster autoscaler has a cluster scope, and is not associated with a particular namespace.

The cluster autoscaler increases the size of the cluster when there are pods that failed to schedule on any of the current nodes due to insufficient resources or when another node is necessary to meet deployment needs. The cluster autoscaler does not increase the cluster resources beyond the limits that you specify.

Ensure that the maxNodesTotal value in the ClusterAutoscaler resource definition that you create is large enough to account for the total possible number of machines in your cluster. This value must encompass the number of control plane machines and the possible number of compute machines that you might scale to.

The cluster autoscaler decreases the size of the cluster when some nodes are consistently not needed for a significant period, such as when it has low resource use and all of its important pods can fit on other nodes.

If the following types of pods are present on a node, the cluster autoscaler will not remove the node:

  • Pods with restrictive pod disruption budgets (PDBs).

  • Kube-system pods that do not run on the node by default.

  • Kube-system pods that do not have a PDB or have a PDB that is too restrictive.

  • Pods that are not backed by a controller object such as a deployment, replica set, or stateful set.

  • Pods with local storage.

  • Pods that cannot be moved elsewhere because of a lack of resources, incompatible node selectors or affinity, matching anti-affinity, and so on.

  • Unless they also have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation, pods that have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation.

If you configure the cluster autoscaler, additional usage restrictions apply:

  • Do not modify the nodes that are in autoscaled node groups directly. All nodes within the same node group have the same capacity and labels and run the same system pods.

  • Specify requests for your pods.

  • If you have to prevent pods from being deleted too quickly, configure appropriate PDBs.

  • Confirm that your cloud provider quota is large enough to support the maximum node pools that you configure.

  • Do not run additional node group autoscalers, especially the ones offered by your cloud provider.

The horizontal pod autoscaler (HPA) and the cluster autoscaler modify cluster resources in different ways. The HPA changes the deployment’s or replica set’s number of replicas based on the current CPU load. If the load increases, the HPA creates new replicas, regardless of the amount of resources available to the cluster. If there are not enough resources, the cluster autoscaler adds resources so that the HPA-created pods can run. If the load decreases, the HPA stops some replicas. If this action causes some nodes to be underutilized or completely empty, the cluster autoscaler deletes the unnecessary nodes.

The cluster autoscaler takes pod priorities into account. The Pod Priority and Preemption feature enables scheduling pods based on priorities if the cluster does not have enough resources, but the cluster autoscaler ensures that the cluster has resources to run all pods. To honor the intention of both features, the cluster autoscaler includes a priority cutoff function. You can use this cutoff to schedule “best-effort” pods, which do not cause the cluster autoscaler to increase resources but instead run only when spare resources are available.

Pods with priority lower than the cutoff value do not cause the cluster to scale up or prevent the cluster from scaling down. No new nodes are added to run the pods, and nodes running these pods might be deleted to free resources.

About the machine autoscaler

The machine autoscaler adjusts the number of Machines in the machine sets that you deploy in an OKD cluster. You can scale both the default worker machine set and any other machine sets that you create. The machine autoscaler makes more Machines when the cluster runs out of resources to support more deployments. Any changes to the values in MachineAutoscaler resources, such as the minimum or maximum number of instances, are immediately applied to the machine set they target.

You must deploy a machine autoscaler for the cluster autoscaler to scale your machines. The cluster autoscaler uses the annotations on machine sets that the machine autoscaler sets to determine the resources that it can scale. If you define a cluster autoscaler without also defining machine autoscalers, the cluster autoscaler will never scale your cluster.

Configuring the cluster autoscaler

First, deploy the cluster autoscaler to manage automatic resource scaling in your OKD cluster.

Because the cluster autoscaler is scoped to the entire cluster, you can make only one cluster autoscaler for the cluster.

ClusterAutoscaler resource definition

This ClusterAutoscaler resource definition shows the parameters and sample values for the cluster autoscaler.

  1. apiVersion: "autoscaling.openshift.io/v1"
  2. kind: "ClusterAutoscaler"
  3. metadata:
  4. name: "default"
  5. spec:
  6. podPriorityThreshold: -10 (1)
  7. resourceLimits:
  8. maxNodesTotal: 24 (2)
  9. cores:
  10. min: 8 (3)
  11. max: 128 (4)
  12. memory:
  13. min: 4 (5)
  14. max: 256 (6)
  15. gpus:
  16. - type: nvidia.com/gpu (7)
  17. min: 0 (8)
  18. max: 16 (9)
  19. - type: amd.com/gpu (7)
  20. min: 0 (8)
  21. max: 4 (9)
  22. scaleDown: (10)
  23. enabled: true (11)
  24. delayAfterAdd: 10m (12)
  25. delayAfterDelete: 5m (13)
  26. delayAfterFailure: 30s (14)
  27. unneededTime: 5m (15)
1Specify the priority that a pod must exceed to cause the cluster autoscaler to deploy additional nodes. Enter a 32-bit integer value. The podPriorityThreshold value is compared to the value of the PriorityClass that you assign to each pod.
2Specify the maximum number of nodes to deploy. This value is the total number of machines that are deployed in your cluster, not just the ones that the autoscaler controls. Ensure that this value is large enough to account for all of your control plane and compute machines and the total number of replicas that you specify in your MachineAutoscaler resources.
3Specify the minimum number of cores to deploy in the cluster.
4Specify the maximum number of cores to deploy in the cluster.
5Specify the minimum amount of memory, in GiB, in the cluster.
6Specify the maximum amount of memory, in GiB, in the cluster.
7Optionally, specify the type of GPU node to deploy. Only nvidia.com/gpu and amd.com/gpu are valid types.
8Specify the minimum number of GPUs to deploy in the cluster.
9Specify the maximum number of GPUs to deploy in the cluster.
10In this section, you can specify the period to wait for each action by using any valid ParseDuration interval, including ns, us, ms, s, m, and h.
11Specify whether the cluster autoscaler can remove unnecessary nodes.
12Optionally, specify the period to wait before deleting a node after a node has recently been added. If you do not specify a value, the default value of 10m is used.
13Specify the period to wait before deleting a node after a node has recently been deleted. If you do not specify a value, the default value of 10s is used.
14Specify the period to wait before deleting a node after a scale down failure occurred. If you do not specify a value, the default value of 3m is used.
15Specify the period before an unnecessary node is eligible for deletion. If you do not specify a value, the default value of 10m is used.

When performing a scaling operation, the cluster autoscaler remains within the ranges set in the ClusterAutoscaler resource definition, such as the minimum and maximum number of cores to deploy or the amount of memory in the cluster. However, the cluster autoscaler does not correct the current values in your cluster to be within those ranges.

Deploying the cluster autoscaler

To deploy the cluster autoscaler, you create an instance of the ClusterAutoscaler resource.

Procedure

  1. Create a YAML file for the ClusterAutoscaler resource that contains the customized resource definition.

  2. Create the resource in the cluster:

    1. $ oc create -f <filename>.yaml (1)
    1<filename> is the name of the resource file that you customized.

Next steps

  • After you configure the cluster autoscaler, you must configure at least one machine autoscaler.

Configuring the machine autoscalers

After you deploy the cluster autoscaler, deploy MachineAutoscaler resources that reference the machine sets that are used to scale the cluster.

You must deploy at least one MachineAutoscaler resource after you deploy the ClusterAutoscaler resource.

You must configure separate resources for each machine set. Remember that machine sets are different in each region, so consider whether you want to enable machine scaling in multiple regions. The machine set that you scale must have at least one machine in it.

MachineAutoscaler resource definition

This MachineAutoscaler resource definition shows the parameters and sample values for the machine autoscaler.

  1. apiVersion: "autoscaling.openshift.io/v1beta1"
  2. kind: "MachineAutoscaler"
  3. metadata:
  4. name: "worker-us-east-1a" (1)
  5. namespace: "openshift-machine-api"
  6. spec:
  7. minReplicas: 1 (2)
  8. maxReplicas: 12 (3)
  9. scaleTargetRef: (4)
  10. apiVersion: machine.openshift.io/v1beta1
  11. kind: MachineSet (5)
  12. name: worker-us-east-1a (6)
1Specify the machine autoscaler name. To make it easier to identify which machine set this machine autoscaler scales, specify or include the name of the machine set to scale. The machine set name takes the following form: <clusterid>-<machineset>-<aws-region-az>
2Specify the minimum number machines of the specified type that must remain in the specified zone after the cluster autoscaler initiates cluster scaling. If running in AWS, GCP, or Azure, this value can be set to 0. For other providers, do not set this value to 0.
3Specify the maximum number machines of the specified type that the cluster autoscaler can deploy in the specified AWS zone after it initiates cluster scaling. Ensure that the maxNodesTotal value in the ClusterAutoscaler resource definition is large enough to allow the machine autoscaler to deploy this number of machines.
4In this section, provide values that describe the existing machine set to scale.
5The kind parameter value is always MachineSet.
6The name value must match the name of an existing machine set, as shown in the metadata.name parameter value.

Deploying the machine autoscaler

To deploy the machine autoscaler, you create an instance of the MachineAutoscaler resource.

Procedure

  1. Create a YAML file for the MachineAutoscaler resource that contains the customized resource definition.

  2. Create the resource in the cluster:

    1. $ oc create -f <filename>.yaml (1)
    1<filename> is the name of the resource file that you customized.

Additional resources