Topology Aware Hints

Topology Aware Hints provides a mechanism to help keep network traffic within the zone where it originated. Preferring same-zone traffic between Pods in your cluster can help with reliability, performance (network latency and throughput), or cost.

FEATURE STATE: Kubernetes v1.23 [beta]

Topology Aware Hints enable topology aware routing by including suggestions for how clients should consume endpoints. This approach adds metadata to enable consumers of EndpointSlice (or Endpoints) objects, so that traffic to those network endpoints can be routed closer to where it originated.

For example, you can route traffic within a locality to reduce costs, or to improve network performance.

Motivation

Kubernetes clusters are increasingly deployed in multi-zone environments. Topology Aware Hints provides a mechanism to help keep traffic within the zone it originated from. This concept is commonly referred to as “Topology Aware Routing”. When calculating the endpoints for a Service, the EndpointSlice controller considers the topology (region and zone) of each endpoint and populates the hints field to allocate it to a zone. Cluster components such as the kube-proxy can then consume those hints, and use them to influence how the traffic is routed (favoring topologically closer endpoints).

Using Topology Aware Hints

You can activate Topology Aware Hints for a Service by setting the service.kubernetes.io/topology-aware-hints annotation to auto. This tells the EndpointSlice controller to set topology hints if it is deemed safe. Importantly, this does not guarantee that hints will always be set.

How it works

The functionality enabling this feature is split into two components: The EndpointSlice controller and the kube-proxy. This section provides a high level overview of how each component implements this feature.

EndpointSlice controller

The EndpointSlice controller is responsible for setting hints on EndpointSlices when this feature is enabled. The controller allocates a proportional amount of endpoints to each zone. This proportion is based on the allocatable CPU cores for nodes running in that zone. For example, if one zone had 2 CPU cores and another zone only had 1 CPU core, the controller would allocate twice as many endpoints to the zone with 2 CPU cores.

The following example shows what an EndpointSlice looks like when hints have been populated:

  1. apiVersion: discovery.k8s.io/v1
  2. kind: EndpointSlice
  3. metadata:
  4. name: example-hints
  5. labels:
  6. kubernetes.io/service-name: example-svc
  7. addressType: IPv4
  8. ports:
  9. - name: http
  10. protocol: TCP
  11. port: 80
  12. endpoints:
  13. - addresses:
  14. - "10.1.2.3"
  15. conditions:
  16. ready: true
  17. hostname: pod-1
  18. zone: zone-a
  19. hints:
  20. forZones:
  21. - name: "zone-a"

kube-proxy

The kube-proxy component filters the endpoints it routes to based on the hints set by the EndpointSlice controller. In most cases, this means that the kube-proxy is able to route traffic to endpoints in the same zone. Sometimes the controller allocates endpoints from a different zone to ensure more even distribution of endpoints between zones. This would result in some traffic being routed to other zones.

Safeguards

The Kubernetes control plane and the kube-proxy on each node apply some safeguard rules before using Topology Aware Hints. If these don’t check out, the kube-proxy selects endpoints from anywhere in your cluster, regardless of the zone.

  1. Insufficient number of endpoints: If there are less endpoints than zones in a cluster, the controller will not assign any hints.

  2. Impossible to achieve balanced allocation: In some cases, it will be impossible to achieve a balanced allocation of endpoints among zones. For example, if zone-a is twice as large as zone-b, but there are only 2 endpoints, an endpoint allocated to zone-a may receive twice as much traffic as zone-b. The controller does not assign hints if it can’t get this “expected overload” value below an acceptable threshold for each zone. Importantly this is not based on real-time feedback. It is still possible for individual endpoints to become overloaded.

  3. One or more Nodes has insufficient information: If any node does not have a topology.kubernetes.io/zone label or is not reporting a value for allocatable CPU, the control plane does not set any topology-aware endpoint hints and so kube-proxy does not filter endpoints by zone.

  4. One or more endpoints does not have a zone hint: When this happens, the kube-proxy assumes that a transition from or to Topology Aware Hints is underway. Filtering endpoints for a Service in this state would be dangerous so the kube-proxy falls back to using all endpoints.

  5. A zone is not represented in hints: If the kube-proxy is unable to find at least one endpoint with a hint targeting the zone it is running in, it falls to using endpoints from all zones. This is most likely to happen as you add a new zone into your existing cluster.

Constraints

  • Topology Aware Hints are not used when either externalTrafficPolicy or internalTrafficPolicy is set to Local on a Service. It is possible to use both features in the same cluster on different Services, just not on the same Service.

  • This approach will not work well for Services that have a large proportion of traffic originating from a subset of zones. Instead this assumes that incoming traffic will be roughly proportional to the capacity of the Nodes in each zone.

  • The EndpointSlice controller ignores unready nodes as it calculates the proportions of each zone. This could have unintended consequences if a large portion of nodes are unready.

  • The EndpointSlice controller does not take into account tolerations when deploying or calculating the proportions of each zone. If the Pods backing a Service are limited to a subset of Nodes in the cluster, this will not be taken into account.

  • This may not work well with autoscaling. For example, if a lot of traffic is originating from a single zone, only the endpoints allocated to that zone will be handling that traffic. That could result in Horizontal Pod Autoscaler either not picking up on this event, or newly added pods starting in a different zone.

What’s next