Schedule based on Cluster Resource Modeling

Overview

When scheduling an application to a specific cluster, the resource status of the destination cluster is a factor that cannot be ignored. When cluster resources are insufficient to run a given replica, we want the scheduler to avoid this scheduling behavior as much as possible. This article will focus on how Karmada performs scheduling based on the cluster resource modeling.

Cluster Resource Modeling

During the scheduling process, karmada-scheduler makes decisions based on a number of factors, one of which is the state of the cluster’s resources. Karmada currently has two different ways of scheduling based on cluster resources, one of which is a generic cluster modeling and the other is a customized cluster modeling.

General Cluster Modeling

Start to use General Cluster Resource Models

For that purpose, we introduced ResourceSummary to the Cluster API.

For example:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: 950m
  11. memory: 290Mi
  12. pods: "11"

From the example above, we can know the allocatable and allocated resources of the cluster.

Schedule based on General Cluster Resource Models

Assume that there is a Pod which will be scheduled to one of the member clusters managed by Karmada.

Member1:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: 950m
  11. memory: 290Mi
  12. pods: "11"

Member2:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: "2"
  11. memory: 290Mi
  12. pods: "11"

Member3:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: "2"
  11. memory: 290Mi
  12. pods: "110"

Assume that the Pod’s request is 500m CPU. Member1 and Member2 have sufficient resources to run this replica but Member3 has no quota for Pods. Considering the amount of available resources, the scheduler prefers to schedule the Pod to member1.

Clustermember1member2member3
AvailableReplicas(4 - 0.95) / 0.5 = 6.1(4 - 2) / 0.5 = 40

Customized Cluster Modeling

Background

ResourceSummary describes the overall available resources of the cluster. However, the ResourceSummary is not precise enough, it mechanically counts the resources on all nodes, but ignores the fragmented resources of these nodes. For example, a cluster with 2000 nodes has only 1 core CPU left on each node. From the ResourceSummary, we get that there are 2000 cores CPU left for the cluster, but actually, this cluster cannot run any pod that requires more than 1 core CPU.

Therefore, we introduce a CustomizedClusterResourceModeling for each cluster that records the resource profile of each node. Karmada collects node and pod information from each cluster and computes the appropriate user configured resource model to categorize the node into.

Start to use Customized Cluster Resource Models

CustomizedClusterResourceModeling feature gate has evolved to Beta sine Karmada v1.4 and is enabled by default. If you use Karmada v1.3, you need to enable this feature gate in karmada-scheduler, karmada-aggregated-server and karmada-controller-manager.

For example, you can use the command below to turn on the feature gate in the karmada-controller-manager.

  1. kubectl --kubeconfig ~/.kube/karmada.config --context karmada-host edit deploy/karmada-controller-manager -nkarmada-system
  1. - command:
  2. - /bin/karmada-controller-manager
  3. - --kubeconfig=/etc/kubeconfig
  4. - --bind-address=0.0.0.0
  5. - --cluster-status-update-frequency=10s
  6. - --secure-port=10357
  7. - --feature-gates=CustomizedClusterResourceModeling=true
  8. - --v=4

After that, when a cluster is registered to the Karmada control plane, Karmada will automatically sets up a generic model for the cluster. You can see it in cluster.spec.

By default, a resource model is as follows:

  1. resourceModels:
  2. - grade: 0
  3. ranges:
  4. - max: "1"
  5. min: "0"
  6. name: cpu
  7. - max: 4Gi
  8. min: "0"
  9. name: memory
  10. - grade: 1
  11. ranges:
  12. - max: "2"
  13. min: "1"
  14. name: cpu
  15. - max: 16Gi
  16. min: 4Gi
  17. name: memory
  18. - grade: 2
  19. ranges:
  20. - max: "4"
  21. min: "2"
  22. name: cpu
  23. - max: 32Gi
  24. min: 16Gi
  25. name: memory
  26. - grade: 3
  27. ranges:
  28. - max: "8"
  29. min: "4"
  30. name: cpu
  31. - max: 64Gi
  32. min: 32Gi
  33. name: memory
  34. - grade: 4
  35. ranges:
  36. - max: "16"
  37. min: "8"
  38. name: cpu
  39. - max: 128Gi
  40. min: 64Gi
  41. name: memory
  42. - grade: 5
  43. ranges:
  44. - max: "32"
  45. min: "16"
  46. name: cpu
  47. - max: 256Gi
  48. min: 128Gi
  49. name: memory
  50. - grade: 6
  51. ranges:
  52. - max: "64"
  53. min: "32"
  54. name: cpu
  55. - max: 512Gi
  56. min: 256Gi
  57. name: memory
  58. - grade: 7
  59. ranges:
  60. - max: "128"
  61. min: "64"
  62. name: cpu
  63. - max: 1Ti
  64. min: 512Gi
  65. name: memory
  66. - grade: 8
  67. ranges:
  68. - max: "9223372036854775807"
  69. min: "128"
  70. name: cpu
  71. - max: "9223372036854775807"
  72. min: 1Ti
  73. name: memory

Customize your cluster resource models

In some cases, the default cluster resource model may not match your cluster. You can adjust the granularity of the cluster resource model to better distribute resources to your cluster.

For example, you can use the command below to customize the cluster resource models of member1.

  1. kubectl --kubeconfig ~/.kube/karmada.config --context karmada-apiserver edit cluster/member1

A Customized resource model should meet the following requirements:

  • The grade of each model should not be the same.
  • The number of resource types in each model should be the same.
  • Currently only four resource types are supported cpu, memory, storage, ephemeral-storage.
  • The max value of each resource must be greater than the min value.
  • The min value of each resource in the first model should be 0.
  • The max value of each resource in the last model should be MaxInt64.
  • The resource types of each model should be the same.
  • Model intervals for resources must be contiguous and non-overlapping from low-grade to high-grade models.

For example, a customized cluster resource model is given below:

  1. resourceModels:
  2. - grade: 0
  3. ranges:
  4. - max: "1"
  5. min: "0"
  6. name: cpu
  7. - max: 4Gi
  8. min: "0"
  9. name: memory
  10. - grade: 1
  11. ranges:
  12. - max: "2"
  13. min: "1"
  14. name: cpu
  15. - max: 16Gi
  16. min: 4Gi
  17. name: memory
  18. - grade: 2
  19. ranges:
  20. - max: "9223372036854775807"
  21. min: "2"
  22. name: cpu
  23. - max: "9223372036854775807"
  24. min: 16Gi
  25. name: memory

The above is a cluster resource model with three grades, each grade defines the resource ranges for two resources, CPU and memory. At this point if a node has remaining available resources of 0.5 cores CPU and 2Gi memory, it will be classified as a grade 0 resource model, while if it has 1.5 cores CPU and 10Gi memory, it will be classified as grade 1.

Schedule based on Customized Cluster Resource Models

Cluster Resource Model classifies a cluster’s nodes into different grades based on their available resources. This classification helps the scheduler determine the number of replicas that can fit into each cluster. By categorizing nodes based on their available resources, the Cluster Resource Model enables the scheduler to make decisions about where to propagate the application replicas using various scheduling policies.

Assume that there is a Pod to be scheduled to one of the member clusters managed by Karmada with the same cluster resource models. The remaining available resources of these member clusters are as follows:

Member1:

  1. spec:
  2. ...
  3. - grade: 2
  4. ranges:
  5. - max: "4"
  6. min: "2"
  7. name: cpu
  8. - max: 32Gi
  9. min: 16Gi
  10. name: memory
  11. - grade: 3
  12. ranges:
  13. - max: "8"
  14. min: "4"
  15. name: cpu
  16. - max: 64Gi
  17. min: 32Gi
  18. name: memory
  19. ...
  20. ...
  21. status:
  22. - count: 1
  23. grade: 2
  24. - count: 6
  25. grade: 3

Member2:

  1. spec:
  2. ...
  3. - grade: 2
  4. ranges:
  5. - max: "4"
  6. min: "2"
  7. name: cpu
  8. - max: 32Gi
  9. min: 16Gi
  10. name: memory
  11. - grade: 3
  12. ranges:
  13. - max: "8"
  14. min: "4"
  15. name: cpu
  16. - max: 64Gi
  17. min: 32Gi
  18. name: memory
  19. ...
  20. ...
  21. status:
  22. - count: 4
  23. grade: 2
  24. - count: 4
  25. grade: 3

Member3:

  1. spec:
  2. ...
  3. - grade: 6
  4. ranges:
  5. - max: "64"
  6. min: "32"
  7. name: cpu
  8. - max: 512Gi
  9. min: 256Gi
  10. name: memory
  11. ...
  12. ...
  13. status:
  14. - count: 1
  15. grade: 6

Suppose the Pod’s resource request is for a 3-core CPU and 20Gi of memory. All nodes that are classified as grade 3 and above fulfill this request, since we need grade’s resource min value to be at least as big as the requested value. For example, nodes in grade 2 with less than 3C and 20Gi don’t fulfill our requirements, so we eliminate the entire grade due to that. Cluster resource model then proceeds to calculate how many replicas can fit in each cluster based on that information:

Clustermember1member2member3
AvailableReplicas1 min(2/3, 16/20) + 6 min(4/3, 32/20) = 64 min(2/3, 16/20) + 4 min(4/3, 32/20) = 41 * min(32/3, 256/20) = 10

Suppose now that the Pod requires 5C and 60Gi. In this case, not even grade 3 nodes satisfy the resource request (some may do, but since we can’t know for sure, the entire grade has to be eliminated) since 5C > 4C and 60Gi > 32Gi. The amount of replicas able to fit in each cluster is calculated below:

Clustermember1member2member3
AvailableReplicas001 * min(32/5, 256/60) = 4

Disable Cluster Resource Modeling

The resource modeling is always used by the scheduler to make scheduling decisions in scenarios of dynamic replica assignment based on cluster free resources. In the process of resource modeling, it will collect node and pod information from all clusters managed by Karmada. This imposes a considerable performance burden in large-scale scenarios.

You can disable cluster resource modeling by setting --enable-cluster-resource-modeling to false in karmada-controller-manager and karmada-agent.