GEP-1324: Service Mesh in Gateway API
- Issue: #1324
- Status: Memorandum
Note: This GEP is exempt from the Probationary Period rules of our GEP overview as it existed before those rules did, and so it has been explicitly grandfathered in.
Overview
Gateway API represents the next generation of traffic routing APIs in Kubernetes. While most of the current work for Gateway API is focused on the ingress use-case, support for service mesh implementations within the spec would be advantageous to the greater community. Therefore, this GEP serves as the genesis for developing common patterns for using Gateway API for east/west traffic within a mesh implementation.
Goals
- Augment the Gateway API specification with the necessary models, resources, and modifications so that the xRoute primitives are usable by service mesh implementations
- Make sure that Gateway API ingress implementations can coexist and interoperate with mesh implementations
- Remain agnostic as to the topology of mesh implementations’ data planes (i.e sidecar vs. non-sidecar)
- Enable mesh implementations to have vendor-specific extensions for their own policies, etc, in line with existing extensibility offered by Gateway API
- Cooperate with the greater Gateway API community on shared patterns/policies that are applicable for both use-cases (e.g. authorization policy)
- Maintain the integrity of the namespace boundary for traffic routing and policy, making explicit those instances where users define exceptions
Non-Goals
- Represent every service mesh feature within Gateway API
- Duplicate concerns for the sake of “mesh-specific” resources
- Overload existing Gateway API mechanisms when a new mechanism is more appropriate
- Preemptively standardizing arbitrary policy attachment resources
- Will instead let these policies stay vendor specific and move for standardization once a pattern emerges organically
- N.B - vendors who implement common policies should strongly consider namespacing their resources. More guidance to come
Versioning
Features or other modifications to the Gateway API spec that fall under this GEP will be subject to the same versioning guidelines as the rest of the Gateway API. For example, to move changes concerning a beta feature (e.g. HTTPRoute) from experimental to standard, all of the beta criteria must be met (e.g. implemented by several implementations).
Use-Cases
These use-cases are presented as an aid for discussion, and as frames of reference for how users may attempt to utilize the outputs of this GEP. They are not an exhaustive list of features for mesh support in Gateway API nor does a particular use-cases inclusion necessarily imply that it is a goal.
- As a service producer…
- I want to deploy a canary version of my application that splits traffic based on HTTP properties.
- I want to change the behavior (such as timeouts, retries, header manipulation) of my application through configuration, rather than modifying my application.
- I want to apply authorization policies, using client identities and/or HTTP properties.
- I want to collect HTTP metrics for my application without modifying my application.
- I want to run a mix of protocols (HTTP, TCP, TLS, gRPC, …) within my application.
- I want to be able to gradually opt-in to a mesh (no mesh, L4 only, L7 enabled) so I can choose the right fit for my application’s performance and compatibility goals.
- I want to define access policies for my service
- As a service consumer…
- I want to change the behavior (such as timeouts, retries, header manipulation) when my application connects to services through configuration, rather than modifying my application.
- I want to collect HTTP metrics for services I connect to.
- I want to be able to connect to Kubernetes Services and external services.
- I want to override the destination of my traffic, for example, to send requests to external services to an internal replica, or to send all requests to an egress proxy.
- As a mesh administrator…
- I want to enforce that all traffic within my cluster is encrypted.
- I want to have strict isolation and control at namespace level, so a bug/malicious user can’t impact other namespaces
- I want to be able to allow app owners to gradually opt-in to a mesh (no mesh, L4 only, L7 enabled) so they can choose the right fit for their applications’ performance and compatibility goals.
- Since mesh can be multi-tenant and hosting multiple services (e.g. foo or bar), as a mesh administrator I need to make sure a client can discover different services. Here are a few possible ways:
- Each service is allocated a unique IP and port
- Or Each service must use a unique hostname
- Or a unique port and protocol (80:http, 443:tls)
Glossary
When discussing design, it’s helpful to have a shared vocabulary on the various relevant concepts. Many of these terms are highly overloaded or have many different meanings in different communities; the goal here is not to make any new standards or definitions, but rather to ensure readers are on the same page with important terms.
Service Mesh
This is both core to the design and highly overloaded. Many folks have wildly different views of what “service mesh” means, and emotional responses to the word. For the purposes of this document, a “mesh” just means some component in a cluster that is capable of adding functionality to users network requests. The Gateway API for ingress traffic has been described as something that translates from things without context (outside of the cluster) to things with context (inside the cluster). Similarly, a service mesh could be described as something that uses that context to modify cluster network requests.
This is intentionally broad - while most see a service mesh as “userspace sidecar proxies using iptables to redirect traffic”, this document doesn’t depend on this narrow definition, and includes architectures such as Kubernetes itself, “proxyless” architectures, and more.
Scope of Functionality
For the purposes of this feature, targeted service mesh functionality is limited to the following:
- Traffic routing (e.g. traffic splitting)
- Traffic encryption
- Note that the implementation details of traffic encryption (e.g. mTLS, WireGuard) are out of the scope of this GEP. No specific implementation will be assumed.
Policy
Policy is a broad term for any resource that can attach to another resource to configure it. This is defined in the Policy Attachment GEP. Examples of (hypothetical) policies could be resources like TimeoutPolicy
or AuthorizationPolicy
.
Routes
Routes are a collective term for a number of types in gateway-api: HTTPRoute
, TLSRoute
, TCPRoute
, UDPRoute
, etc. In the future there may be more defined, in core or extensions (such as SQLRoute
).This document focuses on routes, rather than “Policy”, as these are the primary existing core resources in gateway, and help form a foundation for policy attachment. Future work from GAMMA will likely look into policies, such as authorization policies.
Service
“service” is a key component to meshes. In general mesh parlance, the term “service” represents an arbitrary cluster of related workloads accessible via the network as a single entity. In a Kubernetes world, this compute can be a typical Service
, but it can also be something like a ServiceImport
(from the MCS spec), or a custom type (a made up example would be a CloudService
abstracting cloud endpoints). Due to the naming overlap with the Kubernetes Service
resource, it becomes necessary to use more precise terms when describing these concerns, reserving Service
exclusively for the Kubernetes resource.
While a Service
in Kubernetes is a single resource, logically it handles two different roles - a “service frontend“ and “service backend“.
Resource level view of Service
Decomposed view of Service
The “service frontend” refers to how we call a service. In a Service
, this is an automatically allocated DNS name (name.namespace.svc.cluster.local
) and IP address (ClusterIP
). Not all services have frontends - for example, “headless” Service
s.
The “service backend” refers to the target of a service. In Service
, this is Endpoints
(or EndpointSlice
).
In the Gateway API today, Service
is used as a “service backend”. For example, in a route we would refer to Service
as the (aptly named) backendRef
:
A user connecting to a Gateway, which has a Route forwarding to “Service Backend”
However, there are other ways that services could be used in the API by utilizing the “service frontend”.
Below shows a hypothetical way to model splitting calls to the a
service frontend to backends composed of a-v1
and a-v2
(see real world example). In this case, service is used as both a frontend and a backend. The route attaches to “Service a frontend”, and any calls to that will be directed to the a-v1
and a-v2
backends (as determined by the logic within the route.
A user connecting to a service (frontend), which has a Route forwarding to two different services (backends)
Producer and Consumer
A service Producer and service Consumer describe the two roles in a mesh.
A service producer is someone that authors and operates a service. Some example actions they may take are deploy a Service
and Deployment
, setup authorization policies to limit access to their workloads, and setup routing rules to canary their workloads.
A service consumer is someone that utilizes a service by communicating with it. For example, they may want to take actions like curl hello-world
in their application - we could call this “consuming the hello-world
service”.
Consumer and producer are personas, not concrete workloads or resources. One important aspect of these personas is that they may live in different trust boundaries (typically namespace).
Gateway
We use the term “gateway” in reference to an abstract concept capturing the idea of a component which translates network traffic.
We are using the term in a deliberately abstract way, therefore it is important that readers do not confuse this abstract interpretation of the term - and associated terms - with more concrete interpretations in more specific contexts. For example, we are specifically not referring to the meaning of “gateway” in IP packet routing, nor even the meaning of “translation” in that context.
We can think of a gateway as being a component (aka “network element”) which implements a layer of indirection between a sender of traffic (e.g. a consumer) and its receivers (e.g. a producer). Senders and receivers are all known as “network endpoints”. Gateways consume “reachability information” describing endpoints in order to gain knowledge of available receivers.
We can describe the capabilities of this layer of indirection in various ways:
- Routing - choosing a receiver endpoint based on data (routing discriminators like IP addresses, TCP/UDP ports, TLS SNI, HTTP headers, etc.) in the traffic and rules configured owners (or privileged users) of the gateway.
- Load-balancing - choosing a receiver endpoint from several valid endpoints based on knowledge of previous traffic from this or other sender endpoints, or knowledge of receiver endpoints status.
- Filtering - modifying the traffic in various ways before forwarding, for example modifying headers or URLs, or redirecting/mirroring the traffic to alternative endpoints.
- Encryption - some combination of decrypting traffic from sender endpoints and/or encrypting traffic before forwarding to receiver endpoints.
A concrete implementation of these abstract ideas might modify traffic in the following ways:
- Translation of IP addresses or TCP/UDP ports - the packet is forwarded after some address modifications (aka Network Address Translation, or NAT).
- Proxying of TCP/UDP packets or HTTP request/responses - the payload to be separated from the original protocol headers and new headers are constructed. A proxy acts as a server to receive traffic and as a client to forward the traffic onwards.
- Encapsulation of IP/TCP/UP packets in another protocol - the packet is wrapped in new protocol headers and must be de-encapsulated (“terminated”) by another gateway before being forwarded to the receiver endpoint.
While a gateway may be a single-instance component through which traffic must flow - e.g. an IP router, or HTTP proxy - a gateway can also be deployed with multiple, redundant instances, or it can be a “logical” entity such that the traffic forwarding rules are distributed to many different parts of a network. Similarly, many “logical” gateway descriptions could potentially be combined or collapsed into a single concrete gateway configuration.
Gateways are often used to “segment” endpoints on a network in various ways - we sometimes use “North/South” to describe traffic flowing across a segment boundary, and “East/West” to describe traffic flowing within a segment. Segmentation can be useful, for example, to separate tenants from one another to increase security. The concepts of “micro-segmentation” and “zero trust” describes the idea that security should be enforced even between trusted endpoints with a system, leading to smaller “micro” segments, even to the point of each endpoint occupying its own segment and trusting “zero” other endpoints. A gateway implements segmentation by being responsible for forwarding traffic between segments, and enforcing segmentation policy at that point.
Gateways whose rule set describes how traffic should be translated as it “enters” a segment are known as “ingress gateways”. In contrast, gateways who provide some translation function to traffic leaving a segment are known as “egress gateways”. In other words, an ingress gateway is located on a segment boundary in front of service producers and an egress gateway is located on a segment boundary in front of service consumers.
Deferred Goals
These are goals that we aren’t explicitly excluding, but will reconsider at a later point in time
- Service mesh activation/enrollment (e.g. sidecar injection, namespace enrollment)
- Egress (e.g. off-mesh destinations including legacy services or third-party APIs)
- Multicluster use-cases (e.g. Kubernetes MCS or cluster federation)
- Heterogenous deployment targets (e.g VMs, serverless)
- Deployment models (e.g. installation, mesh upgrades)