Configuring leader election
- Operator leader election examples
  - Leader-for-life election
  - Leader-with-lease election

Configuring leader election

During the lifecycle of an Operator, it is possible that there may be more than one instance running at any given time, for example when rolling out an upgrade for the Operator. In such a scenario, it is necessary to avoid contention between multiple Operator instances using leader election. This ensures only one leader instance handles the reconciliation while the other instances are inactive but ready to take over when the leader steps down.

There are two different leader election implementations to choose from, each with its own trade-off:

Leader-for-life

The leader pod only gives up leadership, using garbage collection, when it is deleted. This implementation precludes the possibility of two instances mistakenly running as leaders, a state also known as split brain. However, this method can be subject to a delay in electing a new leader. For example, when the leader pod is on an unresponsive or partitioned node, the pod-eviction-timeout dictates long how it takes for the leader pod to be deleted from the node and step down, with a default of 5m. See the Leader-for-life Go documentation for more.

Leader-with-lease

The leader pod periodically renews the leader lease and gives up leadership when it cannot renew the lease. This implementation allows for a faster transition to a new leader when the existing leader is isolated, but there is a possibility of split brain in certain situations. See the Leader-with-lease Go documentation for more.

By default, the Operator SDK enables the Leader-for-life implementation. Consult the related Go documentation for both approaches to consider the trade-offs that make sense for your use case.

Operator leader election examples

The following examples illustrate how to use the two leader election options for an Operator, Leader-for-life and Leader-with-lease.

Leader-for-life election

With the Leader-for-life election implementation, a call to leader.Become() blocks the Operator as it retries until it can become the leader by creating the config map named memcached-operator-lock:

import (
  ...
  "github.com/operator-framework/operator-sdk/pkg/leader"
)
func main() {
  ...
  err = leader.Become(context.TODO(), "memcached-operator-lock")
  if err != nil {
    log.Error(err, "Failed to retry for leader lock")
    os.Exit(1)
  }
  ...
}

If the Operator is not running inside a cluster, leader.Become() simply returns without error to skip the leader election since it cannot detect the name of the Operator.

Leader-with-lease election

The Leader-with-lease implementation can be enabled using the Manager Options for leader election:

import (
  ...
  "sigs.k8s.io/controller-runtime/pkg/manager"
)
func main() {
  ...
  opts := manager.Options{
    ...
    LeaderElection: true,
    LeaderElectionID: "memcached-operator-lock"
  }
  mgr, err := manager.New(cfg, opts)
  ...
}

When the Operator is not running in a cluster, the Manager returns an error when starting because it cannot detect the namespace of the Operator to create the config map for leader election. You can override this namespace by setting the LeaderElectionNamespace option for the Manager.