Application-level failover

In the multi-cluster scenario, user workloads may be deployed in multiple clusters to improve service high availability. Karmada already supports multi-cluster failover when detecting a cluster fault. It’s a consideration from a cluster perspective. However, some failures of clusters will only affect specific applications. From the perspective of the cluster, it may be necessary to distinguish between affected and unaffected applications. Also, the application may still be unavailable when the control plane of the cluster is in a healthy state. Therefore, Karmada needs to provide a means of fault migration from an application perspective.

Why Application-level Failover is Required

The following describes some scenarios of application-level failover:

  • The administrator deploys an application in multiple clusters with preemptive scheduling. When cluster resources are in short supply, low-priority applications that were running normally are preempted and cannot run normally for a long time. At this time, applications cannot self-heal within a cluster. Users wants to try to schedule it to another cluster to ensure that the service is continuously served.
  • The administrator use the cloud vendor’s spot instance to deploy the application. When users use spot instances to deploy applications, the application may fail to run due to resources being recycled. In this scenario, the amount of resource perceived by the scheduler is the size of the resource quota, not the actual available resources. At this time, users want to schedule the application other than the one that failed previously.
  • ….

How Do I Enable the Feature?

When an application migrates from one cluster to another, it needs to ensure that its dependencies are migrated synchronously. Therefore, you need to ensure that PropagateDeps feature gate is enabled and propagateDeps: true is set in the propagation policy. PropagateDeps feature gate has evolved to Beta since Karmada v1.4 and is enabled by default.

Also, whether the application needs to be migrated depends on the health status of the application. Karmada’s Resource Interpreter Framework is designed for interpreting resource structure. It provides users with a interpreter operation to tell Karmada how to figure out the health status of a specific object. It’s up to users to decide when to reschedule. Before you use the feature, you need to ensure that the interpretHealth rules for the application is configured.

If you use the purge mode with graceful eviction, GracefulEviction feature gate should be enabled. GracefulEviction feature gate has also evolved to Beta since Karmada v1.4 and is enabled by default.

Configure Application Failover

.spec.failover.application field of PropagationPolicy represents the rules of application failover.

It has three fields to set:

  • DecisionConditions
  • PurgeMode
  • GracePeriodSeconds

Configure Decision Conditions

DecisionConditions indicates the decision conditions of performing the failover process. Only when all conditions are met can the failover process be performed. Currently, it includes tolerance time for the application’s unhealthy state. It’s 300s by default.

PropagationPolicy can be configured as follows:

  1. apiVersion: policy.karmada.io/v1alpha1
  2. kind: PropagationPolicy
  3. metadata:
  4. name: test-propagation
  5. spec:
  6. #...
  7. failover:
  8. application:
  9. decisionConditions:
  10. tolerationSeconds: 300
  11. #...

Configure PurgeMode

PurgeMode represents represents how to deal with the legacy applications on the cluster from which the application is migrated. Karmada supports three different purgeMode for eviction:

  • Immediately represents that Karmada will immediately evict the legacy application.
  • Graciously represents that Karmada will wait for the application to come back to healthy on the new cluster or after a timeout is reached before evicting the application. You need to configure GracePeriodSeconds meanwhile. If the application on the new cluster cannot reach a Healthy state, Karmada will delete the application after GracePeriodSeconds is reached. It’s 600s by default.
  • Never represents that Karmada will not evict the application and users manually confirms how to clean up redundant copies.

PropagationPolicy can be configured as follows:

  1. apiVersion: policy.karmada.io/v1alpha1
  2. kind: PropagationPolicy
  3. metadata:
  4. name: test-propagation
  5. spec:
  6. #...
  7. failover:
  8. application:
  9. decisionConditions:
  10. tolerationSeconds: 300
  11. gracePeriodSeconds: 600
  12. purgeMode: Graciously
  13. #...

Or

  1. apiVersion: policy.karmada.io/v1alpha1
  2. kind: PropagationPolicy
  3. metadata:
  4. name: test-propagation
  5. spec:
  6. #...
  7. failover:
  8. application:
  9. decisionConditions:
  10. tolerationSeconds: 300
  11. purgeMode: Never
  12. #...

Example

Assume that you have configured a propagationPolicy:

  1. apiVersion: policy.karmada.io/v1alpha1
  2. kind: PropagationPolicy
  3. metadata:
  4. name: nginx-propagation
  5. spec:
  6. failover:
  7. application:
  8. decisionConditions:
  9. tolerationSeconds: 120
  10. purgeMode: Never
  11. propagateDeps: true # application failover is set, propagateDeps must be true
  12. resourceSelectors:
  13. - apiVersion: apps/v1
  14. kind: Deployment
  15. name: nginx
  16. placement:
  17. clusterAffinity:
  18. clusterNames:
  19. - member1
  20. - member2
  21. - member3
  22. spreadConstraints:
  23. - maxGroups: 1
  24. minGroups: 1
  25. spreadByField: cluster
  26. ---
  27. apiVersion: apps/v1
  28. kind: Deployment
  29. metadata:
  30. name: nginx
  31. labels:
  32. app: nginx
  33. spec:
  34. replicas: 2
  35. selector:
  36. matchLabels:
  37. app: nginx
  38. template:
  39. metadata:
  40. labels:
  41. app: nginx
  42. spec:
  43. containers:
  44. - image: nginx
  45. name: nginx

Now the application is scheduled into member2 and these two replicas run normally. Now you taint all nodes in member2 and evict the replica to construct the abnormal state of the application.

  1. # mark node "member2-control-plane" as unschedulable in cluster member2
  2. kubectl --context member2 cordon member2-control-plane
  3. # delete the pod in cluster member2
  4. kubectl --context member2 delete pod -l app=nginx

You can immediately find that the deployment is unhealthy now from the ResourceBinding.

  1. #...
  2. status:
  3. aggregatedStatus:
  4. - applied: true
  5. clusterName: member2
  6. health: Unhealthy
  7. status:
  8. availableReplicas: 0
  9. readyReplicas: 0
  10. replicas: 2

After tolerationSeconds is reached, you will find that the deployment in member2 has been evicted and it’s re-scheduled to member1.

  1. #...
  2. spec:
  3. clusters:
  4. - name: member1
  5. replicas: 2
  6. gracefulEvictionTasks:
  7. - creationTimestamp: "2023-05-08T09:29:02Z"
  8. fromCluster: member2
  9. producer: resource-binding-application-failover-controller
  10. reason: ApplicationFailure
  11. suppressDeletion: true

You can edit suppressDeletion to false in gracefulEvictionTasks to evict the application in the failed cluster after you confirm the failure.

Stateful Application Failover Support

Starting from v1.12, the application-level failover feature adds support for stateful application failover, it provides a generalized way for users to define application state preservation in the context of cluster-to-cluster failovers.

In releases prior to v1.12, Karmada’s scheduling logic runs on the assumption that resources that are scheduled and rescheduled are stateless. In some cases, users may desire to conserve a certain state so that applications can resume from where they left off in the previous cluster. For CRDs dealing with data-processing (such as Flink or Spark), it can be particularly useful to restart applications from a previous checkpoint. That way applications can seamlessly resume processing data while avoiding double processing.

Defining StatePreservation

StatePreservation is a field under .spec.failover.application, it defines the policy for preserving and restoring state data during failover events for stateful applications. When an application fails over from one cluster to another, this policy enables the extraction of critical data from the original resource configuration.

It contains a list of StatePreservationRule configurations. Each rule specifies a JSONPath expression targeting specific pieces of state data to be preserved during failover events. An AliasLabelName is associated with each rule, serving as a label key when the preserved data is passed to the new cluster. You can define the state preservation policy:

  1. apiVersion: policy.karmada.io/v1alpha1
  2. kind: PropagationPolicy
  3. metadata:
  4. name: example-propagation
  5. spec:
  6. #...
  7. failover:
  8. application:
  9. decisionConditions:
  10. tolerationSeconds: 60
  11. purgeMode: Immediately
  12. statePreservation:
  13. rules:
  14. - aliasLabelName: pre-updated-replicas
  15. jsonPath: "{ .updatedReplicas }"

The above configuration will parse the updatedReplicas field from the application .status before migration. Upon successful migration, the extracted data is then re-injected into the new resource, ensuring that the application can resume operation with its previous state intact.

This capability requires enabling the StatefulFailoverInjection feature gate. StatefulFailoverInjection is currently in Alpha and is turned off by default.

There are currently some restrictions on the use of this function, please pay attention when using it:

  1. Only the scenario where an application is deployed in one cluster and migrated to another cluster is considered.
  2. If consecutive failovers occur, for example, an application is migrated form clusterA to clusterB and then to clusterC, the PreservedLabelState before the last failover is used for injection. If the PreservedLabelState is empty, the injection is skipped.
  3. The injection operation is performed only when PurgeMode is set to Immediately.

Application-level failover - 图1note

Application failover is still a work in progress. We are in the progress of gathering use cases. If you are interested in this feature, please feel free to start an enhancement issue to let us know.