Automatic Multicluster Failover
The Linkerd Failover extension is a controller which automatically shifts traffic from a primary service to one or more fallback services whenever the primary becomes unavailable. This can help add resiliency when you have a service which is replicated in multiple clusters. If the local service is unavailable, the failover controller can shift that traffic to the backup cluster.
Let’s see a simple example of how to use this extension by installing the Emojivoto application on two Kubernetes clusters and simulating a failure in one cluster. We will see the failover controller shift traffic to the other cluster to ensure the service remains available.
Linkerd Production Tip
This page contains best-effort instructions by the open source community. Production users with mission-critical applications should familiarize themselves with Linkerd production resources and/or connect with a commercial Linkerd provider.
Prerequisites
You will need two clusters with Linkerd installed and for the clusters to be linked together with the multicluster extension. Follow the steps in the multicluster guide to generate a shared trust root, install Linkerd, Linkerd Viz, and Linkerd Multicluster, and to link the clusters together. For the remainder of this guide, we will assume the cluster context names are “east” and “west” respectively. Please substitute your cluster context names where appropriate.
Installing the Failover Extension
Failovers are described using SMI TrafficSplit resources. We install the Linkerd SMI extension and the Linkerd Failover extension. These can be installed in both clusters, but since we’ll only be initiating failover from the “west” cluster in this example, we’ll only install them in that cluster:
# Install linkerd-smi in west cluster
> helm --kube-context=west repo add linkerd-smi https://linkerd.github.io/linkerd-smi
> helm --kube-context=west repo up
> helm --kube-context=west install linkerd-smi -n linkerd-smi --create-namespace linkerd-smi/linkerd-smi
# Install linkerd-failover in west cluster
> helm --kube-context=west repo add linkerd-edge https://helm.linkerd.io/edge
> helm --kube-context=west repo up
> helm --kube-context=west install linkerd-failover -n linkerd-failover --create-namespace --devel linkerd-edge/linkerd-failover
Installing and Exporting Emojivoto
We’ll now install the Emojivoto example application into both clusters:
> linkerd --context=west inject https://run.linkerd.io/emojivoto.yml | kubectl --context=west apply -f -
> linkerd --context=east inject https://run.linkerd.io/emojivoto.yml | kubectl --context=east apply -f -
Next we’ll “export” the web-svc
in the east cluster by setting the mirror.linkerd.io/exported=true
label. This will instruct the multicluster extension to create a mirror service called web-svc-east
in the west cluster, making the east Emojivoto application available in the west cluster:
> kubectl --context=east -n emojivoto label svc/web-svc mirror.linkerd.io/exported=true
> kubectl --context=west -n emojivoto get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
emoji-svc ClusterIP 10.96.41.137 <none> 8080/TCP,8801/TCP 13m
voting-svc ClusterIP 10.96.247.68 <none> 8080/TCP,8801/TCP 13m
web-svc ClusterIP 10.96.222.169 <none> 80/TCP 13m
web-svc-east ClusterIP 10.96.244.245 <none> 80/TCP 92s
Creating the Failover TrafficSplit
To tell the failover controller how to failover traffic, we need to create a TrafficSplit resource in the west cluster with the failover.linkerd.io/controlled-by: linkerd-failover
label. The failover.linkerd.io/primary-service
annotation indicates that the web-svc
backend is the primary and all other backends will be treated as the fallbacks:
kubectl --context=west apply -f - <<EOF
apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
metadata:
name: web-svc-failover
namespace: emojivoto
labels:
failover.linkerd.io/controlled-by: linkerd-failover
annotations:
failover.linkerd.io/primary-service: web-svc
spec:
service: web-svc
backends:
- service: web-svc
weight: 1
- service: web-svc-east
weight: 0
EOF
This TrafficSplit indicates that the local (west) web-svc
should be used as the primary, but traffic should be shifted to the remote (east) web-svc-east
if the primary becomes unavailable.
Testing the Failover
We can use the linkerd viz stat
command to see that the vote-bot
traffic generator in the west cluster is sending traffic to the local primary service, web-svc
:
> linkerd --context=west viz stat -n emojivoto svc --from deploy/vote-bot
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
web-svc - 96.67% 2.0rps 2ms 3ms 5ms 1
web-svc-east - - - - - - -
Now we’ll simulate the local service becoming unavailable by scaling it down:
> kubectl --context=west -n emojivoto scale deploy/web --replicas=0
We can immediately see that the TrafficSplit has been adjusted to send traffic to the backup. Notice that the web-svc
backend now has weight 0 and the web-svc-east
backend now has weight 1.
> kubectl --context=west -n emojivoto get ts/web-svc-failover -o yaml
apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
metadata:
annotations:
failover.linkerd.io/primary-service: web-svc
creationTimestamp: "2022-03-22T23:47:11Z"
generation: 4
labels:
failover.linkerd.io/controlled-by: linkerd-failover
name: web-svc-failover
namespace: emojivoto
resourceVersion: "10817806"
uid: 77039fb3-5e39-48ad-b7f7-638d187d7a28
spec:
backends:
- service: web-svc
weight: 0
- service: web-svc-east
weight: 1
service: web-svc
We can also confirm that this traffic is going to the fallback using the viz stat
command:
> linkerd --context=west viz stat -n emojivoto svc --from deploy/vote-bot
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
web-svc - - - - - - -
web-svc-east - 93.04% 1.9rps 25ms 30ms 30ms 1
Finally, we can restore the primary by scaling its deployment back up and observe the traffic shift back to it:
> kubectl --context=west -n emojivoto scale deploy/web --replicas=1
deployment.apps/web scaled
> linkerd --context=west viz stat -n emojivoto svc --from deploy/vote-bot
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
web-svc - 89.29% 1.9rps 2ms 4ms 5ms 1
web-svc-east - - - - - - -