- Antrea OVS Pipeline
- Introduction
- Terminology
- Dumping the Flows / Groups
- OVS Registers
- OVS Ct Mark
- OVS Ct Label
- OVS Ct Zone
- Kubernetes NetworkPolicy Implementation
- Kubernetes Service Implementation
- Antrea-native NetworkPolicy Implementation
- Antrea-native L7 NetworkPolicy Implementation
- TrafficControl Implementation
- Egress Implementation
- OVS Tables
- PipelineRootClassifier
- ARPSpoofGuard
- ARPResponder
- Classifier
- SpoofGuard
- UnSNAT
- ConntrackZone
- ConntrackState
- PreRoutingClassifier
- NodePortMark
- SessionAffinity
- ServiceLB
- EndpointDNAT
- AntreaPolicyEgressRule
- EgressRule
- EgressDefaultRule
- EgressMetric
- L3Forwarding
- EgressMark
- L3DecTTL
- SNATMark
- SNAT
- L2ForwardingCalc
- TrafficControl
- IngressSecurityClassifier
- AntreaPolicyIngressRule
- IngressRule
- IngressDefaultRule
- IngressMetric
- ConntrackCommit
- Output
Antrea OVS Pipeline
Introduction
This document outlines the Open vSwitch (OVS) pipeline Antrea uses to implement its networking functionalities. The following assumptions are currently in place:
- Antrea is deployed in encap mode, establishing an overlay network across all Nodes.
- All the Nodes are Linux Nodes.
- IPv6 is disabled.
- Option
antreaProxy.proxyAll
(referred to asproxyAll
later in this document) is enabled. - Two Alpha features
TrafficControl
andL7NetworkPolicy
are enabled. - Default settings are maintained for other features and options.
The document references version v1.15 of Antrea.
Terminology
Antrea / Kubernetes
- Node Route Controller: the Kubernetes controller which is a part of antrea-agent and watches for updates to Nodes. When a Node is added, it updates the local networking configurations (e.g. configure the tunnel to the new Node). When a Node is deleted, it performs the necessary clean-ups.
- peer Node: this is how we refer to other Nodes in the cluster, to which the local Node is connected through a Geneve, VXLAN, GRE, or STT tunnel.
- Antrea-native NetworkPolicy: Antrea ClusterNetworkPolicy and Antrea NetworkPolicy CRDs, as documented here.
- Service session affinity: a Service attribute that selects the same backend Pods for connections from a particular client. For a K8s Service, session affinity can be enabled by setting
service.spec.sessionAffinity
toClientIP
(default isNone
). See Kubernetes Service for more information about session affinity.
OpenFlow
- table-miss flow: a “catch-all” flow in an OpenFlow table, which is used if no other flow is matched. If the table-miss flow does not exist, by default packets unmatched by flows are dropped (discarded).
- action
conjunction
: an efficient way in OVS to implement conjunctive matches, is a match for which multiple fields are required to match conjunctively, each within a set of acceptable values. See OVS fields for more information. - action
normal
: OpenFlow defines this action to submit a packet to “the traditional non-OpenFlow pipeline of the switch”. In other words, if a flow uses this action, the packets matched by the flow traverse the switch in the same manner as they would if OpenFlow were not configured on the switch. Antrea uses this action to process ARP packets as a regular learning L2 switch would. - action
group
: an action used to process forwarding decisions on multiple OVS ports. Examples include: load-balancing, multicast, and active/standby. See OVS group action for more information. - action
IN_PORT
: an action to output packets to the port on which they were received. This is the only standard way to output the packets to the input port. - action
ct
: an action to commit connections to the connection tracking module, which OVS can use to match the state of a TCP, UDP, ICMP, etc., connection. See the OVS Conntrack tutorial for more information. - reg mark: a value stored in an OVS register conveying information for a packet across the pipeline. Explore all reg marks in the pipeline in the OVS Registers section.
- ct mark: a value stored in the field
ct_mark
of OVS conntrack, conveying information for a connection throughout its entire lifecycle across the pipeline. Explore all values used in the pipeline in the Ct Marks section. - ct label: a value stored in the field
ct_label
of OVS conntrack, conveying information for a connection throughout its entire lifecycle across the pipeline. Explore all values used in the pipeline in the Ct Labels section. - ct zone: a zone is to isolate connection tracking rules stored in the field
ct_zone
of OVS conntrack. It is conceptually similar to the more generic Linux network namespace but is specific to conntrack and has less overhead. Explore all the zones used in the pipeline in the Ct Zones section.
Misc
- dmac table: a traditional L2 switch has a “dmac” table that maps the learned destination MAC address to the appropriate egress port. It is often the same physical table as the “smac” table (which matches the source MAC address and initiates MAC learning if the address is unknown).
- Global Virtual MAC: a virtual MAC address that is used as the destination MAC for all tunneled traffic across all Nodes. This simplifies networking by enabling all Nodes to use this MAC address instead of the actual MAC address of the appropriate remote gateway. This allows each OVS to act as a “proxy” for the local gateway when receiving tunneled traffic and directly take care of the packet forwarding. Currently, we use a hard-coded value of
aa:bb:cc:dd:ee:ff
. - Virtual Service IP: a virtual IP address used as the source IP address for hairpin Service connections through the Antrea gateway port. Currently, we use a hard-coded value of
169.254.0.253
. - Virtual NodePort DNAT IP: a virtual IP address used as a DNAT IP address for NodePort Service connections through Antrea gateway port. Currently, we use a hard-coded value of
169.254.0.252
.
Dumping the Flows / Groups
This guide includes a representative flow dump for every table in the pipeline, to illustrate the function of each table. If you have a cluster running Antrea, you can dump the flows or groups on a given Node as follows:
# Dump all flows.
kubectl exec -n kube-system <ANTREA_AGENT_POD_NAME> -c antrea-ovs -- ovs-ofctl dump-flows <BRIDGE_NAME> -O Openflow15 [--no-stats] [--names]
# Dump all groups.
kubectl exec -n kube-system <ANTREA_AGENT_POD_NAME> -c antrea-ovs -- ovs-ofctl dump-groups <BRIDGE_NAME> -O Openflow15 [--names]
where <ANTREA_AGENT_POD_NAME>
is the name of the antrea-agent Pod running on that Node, and <BRIDGE_NAME>
is the name of the bridge created by Antrea (br-int
by default).
You can also dump the flows for a specific table or group as follows:
# Dump flows of a table.
kubectl exec -n kube-system <ANTREA_AGENT_POD_NAME> -c antrea-ovs -- ovs-ofctl dump-flows <BRIDGE_NAME> table=<TABLE_NAME> -O Openflow15 [--no-stats] [--names]
# Dump a group.
kubectl exec -n kube-system <ANTREA_AGENT_POD_NAME> -c antrea-ovs -- ovs-ofctl dump-groups <BRIDGE_NAME> <GROUP_ID> -O Openflow15 [--names]
where <TABLE_NAME>
is the name of a table in the pipeline, and <GROUP_ID>
is the ID of a group.
OVS Registers
We use some OVS registers to carry information throughout the pipeline. To enhance usability, we assign friendly names to the registers we use.
Register | Field Range | Field Name | Reg Mark Value | Reg Mark Name | Description |
---|---|---|---|---|---|
NXM_NX_REG0 | bits 0-3 | PktSourceField | 0x1 | FromTunnelRegMark | Packet source is tunnel port. |
0x2 | FromGatewayRegMark | Packet source is the local Antrea gateway port. | |||
0x3 | FromPodRegMark | Packet source is local Pod port. | |||
0x4 | FromUplinkRegMark | Packet source is uplink port. | |||
0x5 | FromBridgeRegMark | Packet source is local bridge port. | |||
0x6 | FromTCReturnRegMark | Packet source is TrafficControl return port. | |||
bits 4-7 | PktDestinationField | 0x1 | ToTunnelRegMark | Packet destination is tunnel port. | |
0x2 | ToGatewayRegMark | Packet destination is the local Antrea gateway port. | |||
0x3 | ToLocalRegMark | Packet destination is local Pod port. | |||
0x4 | ToUplinkRegMark | Packet destination is uplink port. | |||
0x5 | ToBridgeRegMark | Packet destination is local bridge port. | |||
bit 9 | 0b0 | NotRewriteMACRegMark | Packet’s source/destination MAC address does not need to be rewritten. | ||
0b1 | RewriteMACRegMark | Packet’s source/destination MAC address needs to be rewritten. | |||
bit 10 | 0b1 | APDenyRegMark | Packet denied (Drop/Reject) by Antrea NetworkPolicy. | ||
bits 11-12 | APDispositionField | 0b00 | DispositionAllowRegMark | Indicates Antrea NetworkPolicy disposition: allow. | |
0b01 | DispositionDropRegMark | Indicates Antrea NetworkPolicy disposition: drop. | |||
0b11 | DispositionPassRegMark | Indicates Antrea NetworkPolicy disposition: pass. | |||
bit 13 | 0b1 | GeneratedRejectPacketOutRegMark | Indicates packet is a generated reject response packet-out. | ||
bit 14 | 0b1 | SvcNoEpRegMark | Indicates packet towards a Service without Endpoint. | ||
bit 19 | 0b1 | RemoteSNATRegMark | Indicates packet needs SNAT on a remote Node. | ||
bit 22 | 0b1 | L7NPRedirectRegMark | Indicates L7 Antrea NetworkPolicy disposition of redirect. | ||
bits 21-22 | OutputRegField | 0b01 | OutputToOFPortRegMark | Output packet to an OVS port. | |
0b10 | OutputToControllerRegMark | Send packet to Antrea Agent. | |||
bits 25-32 | PacketInOperationField | Field to store NetworkPolicy packetIn operation. | |||
NXM_NX_REG1 | bits 0-31 | TargetOFPortField | Egress OVS port of packet. | ||
NXM_NX_REG2 | bits 0-31 | SwapField | Swap values in flow fields in OpenFlow actions. | ||
bits 0-7 | PacketInTableField | OVS table where it was decided to send packets to the controller (Antrea Agent). | |||
NXM_NX_REG3 | bits 0-31 | EndpointIPField | Field to store IPv4 address of the selected Service Endpoint. | ||
APConjIDField | Field to store Conjunction ID for Antrea Policy. | ||||
NXM_NX_REG4 | bits 0-15 | EndpointPortField | Field store TCP/UDP/SCTP port of a Service’s selected Endpoint. | ||
bits 16-18 | ServiceEPStateField | 0b001 | EpToSelectRegMark | Packet needs to do Service Endpoint selection. | |
bits 16-18 | ServiceEPStateField | 0b010 | EpSelectedRegMark | Packet has done Service Endpoint selection. | |
bits 16-18 | ServiceEPStateField | 0b011 | EpToLearnRegMark | Packet has done Service Endpoint selection and the selected Endpoint needs to be cached. | |
bits 0-18 | EpUnionField | The union value of EndpointPortField and ServiceEPStateField. | |||
bit 19 | 0b1 | ToNodePortAddressRegMark | Packet is destined for a Service of type NodePort. | ||
bit 20 | 0b1 | AntreaFlexibleIPAMRegMark | Packet is from local Antrea IPAM Pod. | ||
bit 20 | 0b0 | NotAntreaFlexibleIPAMRegMark | Packet is not from local Antrea IPAM Pod. | ||
bit 21 | 0b1 | ToExternalAddressRegMark | Packet is destined for a Service’s external IP. | ||
bits 22-23 | TrafficControlActionField | 0b01 | TrafficControlMirrorRegMark | Indicates packet needs to be mirrored (used by TrafficControl). | |
0b10 | TrafficControlRedirectRegMark | Indicates packet needs to be redirected (used by TrafficControl). | |||
bit 24 | 0b1 | NestedServiceRegMark | Packet is destined for a Service using other Services as Endpoints. | ||
bit 25 | 0b1 | DSRServiceRegMark | Packet is destined for a Service working in DSR mode. | ||
0b0 | NotDSRServiceRegMark | Packet is destined for a Service working in non-DSR mode. | |||
bit 26 | 0b1 | RemoteEndpointRegMark | Packet is destined for a Service selecting a remote non-hostNetwork Endpoint. | ||
bit 27 | 0b1 | FromExternalRegMark | Packet is from Antrea gateway, but its source IP is not the gateway IP. | ||
bit 28 | 0b1 | FromLocalRegMark | Packet is from a local Pod or the Node. | ||
NXM_NX_REG5 | bits 0-31 | TFEgressConjIDField | Egress conjunction ID hit by TraceFlow packet. | ||
NXM_NX_REG6 | bits 0-31 | TFIngressConjIDField | Ingress conjunction ID hit by TraceFlow packet. | ||
NXM_NX_REG7 | bits 0-31 | ServiceGroupIDField | GroupID corresponding to the Service. | ||
NXM_NX_REG8 | bits 0-11 | VLANIDField | VLAN ID. | ||
bits 12-15 | CtZoneTypeField | 0b0001 | IPCtZoneTypeRegMark | Ct zone type is IPv4. | |
0b0011 | IPv6CtZoneTypeRegMark | Ct zone type is IPv6. | |||
bits 0-15 | CtZoneField | Ct zone ID is a combination of VLANIDField and CtZoneTypeField. | |||
NXM_NX_REG9 | bits 0-31 | TrafficControlTargetOFPortField | Field to cache the OVS port to output packets to be mirrored or redirected (used by TrafficControl). | ||
NXM_NX_XXREG3 | bits 0-127 | EndpointIP6Field | Field to store IPv6 address of the selected Service Endpoint. |
Note that reg marks that have overlapped bits will not be used at the same time, such as SwapField
and PacketInTableField
.
OVS Ct Mark
We use some bits of the ct_mark
field of OVS conntrack to carry information throughout the pipeline. To enhance usability, we assign friendly names to the bits we use.
Field Range | Field Name | Ct Mark Value | Ct Mark Name | Description |
---|---|---|---|---|
bits 0-3 | ConnSourceCTMarkField | 0b0010 | FromGatewayCTMark | Connection source is the Antrea gateway port. |
0b0101 | FromBridgeCTMark | Connection source is the local bridge port. | ||
bit 4 | 0b1 | ServiceCTMark | Connection is for Service. | |
0b0 | NotServiceCTMark | Connection is not for Service. | ||
bit 5 | 0b1 | ConnSNATCTMark | SNAT’d connection for Service. | |
bit 6 | 0b1 | HairpinCTMark | Hair-pin connection. | |
bit 7 | 0b1 | L7NPRedirectCTMark | Connection should be redirected to an application-aware engine. |
OVS Ct Label
We use some bits of the ct_label
field of OVS conntrack to carry information throughout the pipeline. To enhance usability, we assign friendly names to the bits we use.
Field Range | Field Name | Description |
---|---|---|
bits 0-31 | IngressRuleCTLabel | Ingress rule ID. |
bits 32-63 | EgressRuleCTLabel | Egress rule ID. |
bits 64-75 | L7NPRuleVlanIDCTLabel | VLAN ID for L7 NetworkPolicy rule. |
OVS Ct Zone
We use some OVS conntrack zones to isolate connection tracking rules. To enhance usability, we assign friendly names to the ct zones.
Zone ID | Zone Name | Description |
---|---|---|
65520 | CtZone | Tracking IPv4 connections that don’t require SNAT. |
65521 | SNATCtZone | Tracking IPv4 connections that require SNAT. |
Kubernetes NetworkPolicy Implementation
Several tables of the pipeline are dedicated to Kubernetes NetworkPolicy implementation (tables EgressRule, EgressDefaultRule, IngressRule, and IngressDefaultRule).
Throughout this document, the following K8s NetworkPolicy example is used to demonstrate how simple ingress and egress policy rules are mapped to OVS flows.
This K8s NetworkPolicy is applied to Pods with the label app: web
in the default
Namespace. For these Pods, only TCP traffic on port 80 from Pods with the label app: client
and to Pods with the label app: db
is allowed. Because Antrea will only install OVS flows for this K8s NetworkPolicy on Nodes that have Pods selected by the policy, we have scheduled an app: web
Pod on the current Node from which the sample flows in this document are dumped. The Pod has been assigned an IP address 10.10.0.19
from the Antrea CNI, so you will see the IP address shown in the associated flows.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-app-db-network-policy
namespace: default
spec:
podSelector:
matchLabels:
app: web
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: client
ports:
- protocol: TCP
port: 80
egress:
- to:
- podSelector:
matchLabels:
app: db
ports:
- protocol: TCP
port: 3306
Kubernetes Service Implementation
Like K8s NetworkPolicy, several tables of the pipeline are dedicated to Kubernetes Service implementation (tables NodePortMark, SessionAffinity, ServiceLB, and EndpointDNAT).
By enabling proxyAll
, ClusterIP, NodePort, LoadBalancer, and ExternalIP are all handled by AntreaProxy. Otherwise, only in-cluster ClusterIP is handled. In this document, we use the sample K8s Services below. These Services select Pods with the label app: web
as Endpoints.
ClusterIP without Endpoint
A sample Service with clusterIP
set to 10.101.255.29
does not have any associated Endpoint.
apiVersion: v1
kind: Service
metadata:
name: sample-clusterip-no-ep
spec:
ports:
- protocol: TCP
port: 80
targetPort: 80
clusterIP: 10.101.255.29
ClusterIP
A sample ClusterIP Service with clusterIP
set to 10.105.31.235
.
apiVersion: v1
kind: Service
metadata:
name: sample-clusterip
spec:
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 80
clusterIP: 10.105.31.235
NodePort
A sample NodePort Service with nodePort
set to 30004
.
apiVersion: v1
kind: Service
metadata:
name: sample-nodeport
spec:
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 80
nodePort: 30004
type: NodePort
LoadBalancer
A sample LoadBalancer Service with ingress IP 192.168.77.150
assigned by an ingress controller.
apiVersion: v1
kind: Service
metadata:
name: sample-loadbalancer
spec:
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 192.168.77.150
Service with ExternalIP
A sample Service with external IP 192.168.77.200
.
apiVersion: v1
kind: Service
metadata:
name: sample-service-externalip
spec:
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 80
externalIPs:
- 192.168.77.200
Service with Session Affinity
A sample Service configured with session affinity.
apiVersion: v1
kind: Service
metadata:
name: sample-service-session-affinity
spec:
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 80
clusterIP: 10.96.76.15
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 300
Service with ExternalTrafficPolicy Local
A sample Service configured externalTrafficPolicy
to Local
. Only externalTrafficPolicy
of NodePort/LoadBalancer Service can be configured with Local
.
apiVersion: v1
kind: Service
metadata:
name: sample-service-etp-local
spec:
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
externalTrafficPolicy: Local
status:
loadBalancer:
ingress:
- ip: 192.168.77.151
Antrea-native NetworkPolicy Implementation
In addition to the tables created for K8s NetworkPolicy, Antrea creates additional dedicated tables to support Antrea-native NetworkPolicy (tables AntreaPolicyEgressRule and AntreaPolicyIngressRule).
Consider the following Antrea ClusterNetworkPolicy (ACNP) in the Application Tier as an example for the remainder of this document.
This ACNP is applied to all Pods with the label app: web
in all Namespaces. For these Pods, only TCP traffic on port 80 from the Pods with the label app: client
and to the Pods with the label app: db
is allowed. Similar to K8s NetworkPolicy, Antrea will only install OVS flows for this policy on Nodes that have Pods selected by the policy.
This policy has very similar rules as the K8s NetworkPolicy example shown previously. This is intentional to simplify this document and to allow easier comparison between the flows generated for both types of policies. Additionally, we should emphasize that this policy applies to Pods across all Namespaces, while a K8s NetworkPolicy is always scoped to a specific Namespace (in the case of our example, the default Namespace).
apiVersion: crd.antrea.io/v1beta1
kind: ClusterNetworkPolicy
metadata:
name: web-app-db-network-policy
spec:
priority: 5
tier: application
appliedTo:
- podSelector:
matchLabels:
app: web
ingress:
- action: Allow
from:
- podSelector:
matchLabels:
app: client
ports:
- protocol: TCP
port: 80
name: AllowFromClient
- action: Drop
egress:
- action: Allow
to:
- podSelector:
matchLabels:
app: db
ports:
- protocol: TCP
port: 3306
name: AllowToDB
- action: Drop
Antrea-native L7 NetworkPolicy Implementation
In addition to layer 3 and layer 4 policies mentioned above, Antrea-native Layer 7 NetworkPolicy is also supported in Antrea. The main difference is that Antrea-native L7 NetworkPolicy uses layer 7 protocol to filter traffic, not layer 3 or layer 4 protocol.
Consider the following Antrea-native L7 NetworkPolicy in the Application Tier as an example for the remainder of this document.
This ACNP is applied to all Pods with the label app: web
in all Namespaces. It allows only HTTP ingress traffic on port 8080 from Pods with the label app: client
, limited to the GET
method and /api/v2/*
path. Any other HTTP ingress traffic on port 8080 from Pods with the label app: client
will be dropped.
apiVersion: crd.antrea.io/v1beta1
kind: ClusterNetworkPolicy
metadata:
name: ingress-allow-http-request-to-api-v2
spec:
priority: 4
tier: application
appliedTo:
- podSelector:
matchLabels:
app: web
ingress:
- name: AllowFromClientL7
action: Allow
from:
- podSelector:
matchLabels:
app: client
ports:
- protocol: TCP
port: 8080
l7Protocols:
- http:
path: "/api/v2/*"
method: "GET"
TrafficControl Implementation
TrafficControl is a CRD API that manages and manipulates the transmission of Pod traffic. Antrea creates a dedicated table TrafficControl to implement feature TrafficControl
. We will use the following TrafficControls as examples for the remainder of this document.
TrafficControl for Packet Redirecting
This is a TrafficControl applied to Pods with the label app: web
. For these Pods, both ingress and egress traffic will be redirected to port antrea-tc-tap0
, and returned through port antrea-tc-tap1
.
apiVersion: crd.antrea.io/v1alpha2
kind: TrafficControl
metadata:
name: redirect-web-to-local
spec:
appliedTo:
podSelector:
matchLabels:
app: web
direction: Both
action: Redirect
targetPort:
ovsInternal:
name: antrea-tc-tap0
returnPort:
ovsInternal:
name: antrea-tc-tap1
TrafficControl for Packet Mirroring
This is a TrafficControl applied to Pods with the label app: db
. For these Pods, both ingress and egress will be mirrored (duplicated) to port antrea-tc-tap2
.
apiVersion: crd.antrea.io/v1alpha2
kind: TrafficControl
metadata:
name: mirror-db-to-local
spec:
appliedTo:
podSelector:
matchLabels:
app: db
direction: Both
action: Mirror
targetPort:
ovsInternal:
name: antrea-tc-tap2
Egress Implementation
Table EgressMark is dedicated to the implementation of feature Egress
.
Consider the following Egresses as examples for the remainder of this document.
Egress Applied to Web Pods
This is an Egress applied to Pods with the label app: web
. For these Pods, all egress traffic (traffic leaving the cluster) will be SNAT’d on the Node k8s-node-control-plane
using Egress IP 192.168.77.112
. In this context, k8s-node-control-plane
is known as the “Egress Node” for this Egress resource. Note that the flows presented in the rest of this document were dumped on Node k8s-node-control-plane
. Egress flows are different on the “source Node” (Node running a workload Pod to which the Egress resource is applied) and on the “Egress Node” (Node enforcing the SNAT policy).
apiVersion: crd.antrea.io/v1beta1
kind: Egress
metadata:
name: egress-web
spec:
appliedTo:
podSelector:
matchLabels:
app: web
egressIP: 192.168.77.112
status:
egressNode: k8s-node-control-plane
Egress Applied to Client Pods
This is an Egress applied to Pods with the label app: client
. For these Pods, all egress traffic will be SNAT’d on the Node k8s-node-worker-1
using Egress IP 192.168.77.113
.
apiVersion: crd.antrea.io/v1beta1
kind: Egress
metadata:
name: egress-client
spec:
appliedTo:
podSelector:
matchLabels:
app: client
egressIP: 192.168.77.113
status:
egressNode: k8s-node-worker-1
OVS Tables
PipelineRootClassifier
This table serves as the primary entry point in the pipeline, forwarding packets to different tables based on their respective protocols.
If you dump the flows of this table, you may see the following:
1. table=PipelineRootClassifier, priority=200,arp actions=goto_table:ARPSpoofGuard
2. table=PipelineRootClassifier, priority=200,ip actions=goto_table:Classifier
3. table=PipelineRootClassifier, priority=0 actions=drop
Flow 1 forwards ARP packets to table ARPSpoofGuard.
Flow 2 forwards IP packets to table Classifier.
Flow 3 is the table-miss flow to drop other unsupported protocols, not normally used.
ARPSpoofGuard
This table is designed to drop ARP spoofing packets from local Pods or the local Antrea gateway. We ensure that the advertised IP and MAC addresses are correct, meaning they match the values configured on the interface when Antrea sets up networking for a local Pod or the local Antrea gateway.
If you dump the flows of this table, you may see the following:
1. table=ARPSpoofGuard, priority=200,arp,in_port="antrea-gw0",arp_spa=10.10.0.1,arp_sha=ba:5e:d1:55:aa:c0 actions=goto_table:ARPResponder
2. table=ARPSpoofGuard, priority=200,arp,in_port="client-6-3353ef",arp_spa=10.10.0.26,arp_sha=5e:b5:e3:a6:90:b7 actions=goto_table:ARPResponder
3. table=ARPSpoofGuard, priority=200,arp,in_port="web-7975-274540",arp_spa=10.10.0.24,arp_sha=fa:b7:53:74:21:a6 actions=goto_table:ARPResponder
4. table=ARPSpoofGuard, priority=200,arp,in_port="db-755c6-5080e3",arp_spa=10.10.0.25,arp_sha=36:48:21:a2:9d:b4 actions=goto_table:ARPResponder
5. table=ARPSpoofGuard, priority=0 actions=drop
Flow 1 matches legitimate ARP packets from the local Antrea gateway.
Flows 2-4 match legitimate ARP packets from local Pods.
Flow 5 is the table-miss flow to drop ARP spoofing packets, which are not matched by flows 1-4.
ARPResponder
The purpose of this table is to handle ARP requests from the local Antrea gateway or local Pods, addressing specific cases:
- Responding to ARP requests from the local Antrea gateway seeking the MAC address of a remote Antrea gateway located on a different Node. This ensures that the local Node can reach any remote Pods.
- Ensuring the normal layer 2 (L2) learning among local Pods and the local Antrea gateway.
If you dump the flows of this table, you may see the following:
1. table=ARPResponder, priority=200,arp,arp_tpa=10.10.1.1,arp_op=1 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],set_field:aa:bb:cc:dd:ee:ff->eth_src,set_field:2->arp_op,move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],set_field:aa:bb:cc:dd:ee:ff->arp_sha,move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],set_field:10.10.1.1->arp_spa,IN_PORT
2. table=ARPResponder, priority=190,arp actions=NORMAL
3. table=ARPResponder, priority=0 actions=drop
Flow 1 is designed for case 1, matching ARP request packets for the MAC address of a remote Antrea gateway with IP address 10.10.1.1
. It programs an ARP reply packet and sends it back to the port where the request packet was received. Note that both the source hardware address and the source MAC address in the ARP reply packet are set to the Global Virtual MAC aa:bb:cc:dd:ee:ff
, not the actual MAC address of the remote Antrea gateway. This ensures that once the traffic is received by the remote OVS bridge, it can be directly forwarded to the appropriate Pod without actually going through the local Antrea gateway. The Global Virtual MAC is used as the destination MAC address for all the traffic being tunneled or routed.
This flow serves as the “ARP responder” for the peer Node whose local Pod subnet is 10.10.1.0/24
. If we were to look at the routing table for the local Node, we would find the following “onlink” route:
10.10.1.0/24 via 10.10.1.1 dev antrea-gw0 onlink
A similar route is installed on the local Antrea gateway (antrea-gw0) interface every time the Antrea Node Route Controller is notified that a new Node has joined the cluster. The route must be marked as “onlink” since the kernel does not have a route to the peer gateway 10.10.1.1
. We “trick” the kernel into believing that 10.10.1.1
is directly connected to the local Node, even though it is on the other side of the tunnel.
Flow 2 is designed for case 2, ensuring that OVS handles the remainder of ARP traffic as a regular L2 learning switch (using the normal
action). In particular, this takes care of forwarding ARP requests and replies among local Pods.
Flow 3 is the table-miss flow, which should never be used since ARP packets will be matched by either flow 1 or 2.
Classifier
This table is designed to determine the “category” of IP packets by matching on their ingress port. It addresses specific cases:
- Packets originating from the local Node through the local Antrea gateway port, requiring IP spoof legitimacy verification.
- Packets originating from the external network through the Antrea gateway port.
- Packets received through an overlay tunnel.
- Packets received through a return port defined in a user-provided TrafficControl CR (for feature
TrafficControl
). - Packets returned from an application-aware engine through a specific port (for feature
L7NetworkPolicy
). - Packets originating from local Pods, requiring IP spoof legitimacy verification.
If you dump the flows of this table, you may see the following:
1. table=Classifier, priority=210,ip,in_port="antrea-gw0",nw_src=10.10.0.1 actions=set_field:0x2/0xf->reg0,set_field:0x10000000/0x10000000->reg4,goto_table:SpoofGuard
2. table=Classifier, priority=200,in_port="antrea-gw0" actions=set_field:0x2/0xf->reg0,set_field:0x8000000/0x8000000->reg4,goto_table:SpoofGuard
3. table=Classifier, priority=200,in_port="antrea-tun0" actions=set_field:0x1/0xf->reg0,set_field:0x200/0x200->reg0,goto_table:UnSNAT
4. table=Classifier, priority=200,in_port="antrea-tc-tap2" actions=set_field:0x6/0xf->reg0,goto_table:L3Forwarding
5. table=Classifier, priority=200,in_port="antrea-l7-tap1",vlan_tci=0x1000/0x1000 actions=pop_vlan,set_field:0x6/0xf->reg0,goto_table:L3Forwarding
6. table=Classifier, priority=190,in_port="client-6-3353ef" actions=set_field:0x3/0xf->reg0,set_field:0x10000000/0x10000000->reg4,goto_table:SpoofGuard
7. table=Classifier, priority=190,in_port="web-7975-274540" actions=set_field:0x3/0xf->reg0,set_field:0x10000000/0x10000000->reg4,goto_table:SpoofGuard
8. table=Classifier, priority=190,in_port="db-755c6-5080e3" actions=set_field:0x3/0xf->reg0,set_field:0x10000000/0x10000000->reg4,goto_table:SpoofGuard
9. table=Classifier, priority=0 actions=drop
Flow 1 is designed for case 1, matching the source IP address 10.10.0.1
to ensure that the packets are originating from the local Antrea gateway. The following reg marks are loaded:
FromGatewayRegMark
, indicating that the packets are received on the local Antrea gateway port, which will be consumed in tables L3Forwarding, L3DecTTL, SNATMark and SNAT.FromLocalRegMark
, indicating that the packets are from the local Node, which will be consumed in table ServiceLB.
Flow 2 is designed for case 2, matching packets originating from the external network through the Antrea gateway port and forwarding them to table SpoofGuard. Since packets originating from the local Antrea gateway are matched by flow 1, flow 2 can only match packets originating from the external network. The following reg marks are loaded:
FromGatewayRegMark
, the same as flow 1.FromExternalRegMark
, indicating that the packets are from the external network, not the local Node.
Flow 3 is for case 3, matching packets through an overlay tunnel (i.e., from another Node) and forwarding them to table UnSNAT. This approach is based on the understanding that these packets originate from remote Nodes, potentially bearing varying source IP addresses. These packets undergo legitimacy verification before being tunneled. As a consequence, packets from the tunnel should be seamlessly forwarded to table UnSNAT. The following reg marks are loaded:
FromTunnelRegMark
, indicating that the packets are received on a tunnel, consumed in table L3Forwarding.RewriteMACRegMark
, indicating that the source and destination MAC addresses of the packets should be rewritten, and consumed in table L3Forwarding.
Flow 4 is for case 4, matching packets from a TrafficControl return port and forwarding them to table L3Forwarding to decide the egress port. It’s important to note that a forwarding decision for these packets was already made before redirecting them to the TrafficControl target port in table Output, and at this point, the source and destination MAC addresses of these packets have already been set to the correct values. The only purpose of forwarding the packets to table L3Forwarding is to load the tunnel destination IP for packets destined for remote Nodes. This ensures that the returned packets destined for remote Nodes are forwarded through the tunnel. FromTCReturnRegMark
, which will be used in table TrafficControl, is loaded to mark the packet source.
Flow 5 is for case 5, matching packets returned back from an application-aware engine through a specific port, stripping the VLAN ID used by the application-aware engine, and forwarding them to table L3Forwarding to decide the egress port. Like flow 4, the purpose of forwarding the packets to table L3Forwarding is to load the tunnel destination IP for packets destined for remote Nodes, and FromTCReturnRegMark
is also loaded.
Flows 6-8 are for case 6, matching packets from local Pods and forwarding them to table SpoofGuard to do legitimacy verification. The following reg marks are loaded:
FromPodRegMark
, indicating that the packets are received on the ports connected to the local Pods, consumed in tables L3Forwarding and SNATMark.FromLocalRegMark
, indicating that the packets are from the local Pods, consumed in table ServiceLB.
Flow 9 is the table-miss flow to drop packets that are not matched by flows 1-8.
SpoofGuard
This table is crafted to prevent IP spoofing from local Pods. It addresses specific cases:
- Allowing all packets from the local Antrea gateway. We do not perform checks for this interface as we need to accept external traffic with a source IP address that does not match the gateway IP.
- Ensuring that the source IP and MAC addresses are correct, i.e., matching the values configured on the interface when Antrea sets up networking for a Pod.
If you dump the flows of this table, you may see the following:
1. table=SpoofGuard, priority=200,ip,in_port="antrea-gw0" actions=goto_table:UnSNAT
2. table=SpoofGuard, priority=200,ip,in_port="client-6-3353ef",dl_src=5e:b5:e3:a6:90:b7,nw_src=10.10.0.26 actions=goto_table:UnSNAT
3. table=SpoofGuard, priority=200,ip,in_port="web-7975-274540",dl_src=fa:b7:53:74:21:a6,nw_src=10.10.0.24 actions=goto_table:UnSNAT
4. table=SpoofGuard, priority=200,ip,in_port="db-755c6-5080e3",dl_src=36:48:21:a2:9d:b4,nw_src=10.10.0.25 actions=goto_table:UnSNAT
5. table=SpoofGuard, priority=0 actions=drop
Flow 1 is for case 1, matching packets received on the local Antrea gateway port without checking the source IP and MAC addresses. There are some cases where the source IP of the packets through the local Antrea gateway port is not the local Antrea gateway IP address:
- When Antrea is deployed with kube-proxy, and
AntreaProxy
is not enabled, packets from local Pods destined for Services will first go through the gateway port, get load-balanced by the kube-proxy data path (undergoes DNAT) then re-enter the OVS pipeline through the gateway port (through an “onlink” route, installed by Antrea, directing the DNAT’d packets to the gateway port), resulting in the source IP being that of a local Pod. - When Antrea is deployed without kube-proxy, and both
AntreaProxy
andproxyAll
are enabled, packets from the external network destined for Services will be routed to OVS through the gateway port without masquerading source IP. - When Antrea is deployed with kube-proxy, packets from the external network destined for Services whose
externalTrafficPolicy
is set toLocal
will get load-balanced by the kube-proxy data path (undergoes DNAT with a local Endpoint selected by the kube-proxy) and then enter the OVS pipeline through the gateway (through a “onlink” route, installed by Antrea, directing the DNAT’d packets to the gateway port) without masquerading source IP.
Flows 2-4 are for case 2, matching legitimate IP packets from local Pods.
Flow 5 is the table-miss flow to drop IP spoofing packets.
UnSNAT
This table is used to undo SNAT on reply packets by invoking action ct
on them. The packets are from SNAT’d Service connections that have been committed to SNATCtZone
in table SNAT. After invoking action ct
, the packets will be in a “tracked” state, restoring all connection tracking fields (such as ct_state
, ct_mark
, ct_label
, etc.) to their original values. The packets with a “tracked” state are then forwarded to table ConntrackZone.
If you dump the flows of this table, you may see the following:
1. table=UnSNAT, priority=200,ip,nw_dst=169.254.0.253 actions=ct(table=ConntrackZone,zone=65521,nat)
2. table=UnSNAT, priority=200,ip,nw_dst=10.10.0.1 actions=ct(table=ConntrackZone,zone=65521,nat)
3. table=UnSNAT, priority=0 actions=goto_table:ConntrackZone
Flow 1 matches reply packets for Service connections which were SNAT’d with the Virtual Service IP 169.254.0.253
and invokes action ct
on them.
Flow 2 matches packets for Service connections which were SNAT’d with the local Antrea gateway IP 10.10.0.1
and invokes action ct
on them. This flow also matches request packets destined for the local Antrea gateway IP from local Pods by accident. However, this is harmless since such connections will never be committed to SNATCtZone
, and therefore, connection tracking fields for the packets are unset.
Flow 3 is the table-miss flow.
For reply packets from SNAT’d connections, whose destination IP is the translated SNAT IP, after invoking action ct
, the destination IP of the packets will be restored to the original IP before SNAT, stored in the connection tracking field ct_nw_dst
.
ConntrackZone
The main purpose of this table is to invoke action ct
on packets from all connections. After invoking ct
action, packets will be in a “tracked” state, restoring all connection tracking fields to their appropriate values. When invoking action ct
with CtZone
to the packets that have a “tracked” state associated with SNATCtZone
, then the “tracked” state associated with SNATCtZone
will be inaccessible. This transition occurs because the “tracked” state shifts to another state associated with CtZone
. A ct zone is similar in spirit to the more generic Linux network namespaces, uniquely containing a “tracked” state within each ct zone.
If you dump the flows of this table, you may see the following:
1. table=ConntrackZone, priority=200,ip actions=ct(table=ConntrackState,zone=65520,nat)
2. table=ConntrackZone, priority=0 actions=goto_table:ConntrackState
Flow 1 invokes ct
action on packets from all connections, and the packets are then forwarded to table ConntrackState with the “tracked” state associated with CtZone
. Note that for packets in an established Service (DNATed) connection, not the first packet of a Service connection, DNAT or un-DNAT is performed on them before they are forwarded.
Flow 2 is the table-miss flow that should remain unused.
ConntrackState
This table handles packets from the connections that have a “tracked” state associated with CtZone
. It addresses specific cases:
- Dropping invalid packets reported by conntrack.
- Forwarding tracked packets from all connections to table AntreaPolicyEgressRule directly, bypassing the tables like PreRoutingClassifier, NodePortMark, SessionAffinity, ServiceLB, and EndpointDNAT for Service Endpoint selection.
- Forwarding packets from new connections to table PreRoutingClassifier to start Service Endpoint selection since Service connections are not identified at this stage.
If you dump the flows of this table, you may see the following:
1. table=ConntrackState, priority=200,ct_state=+inv+trk,ip actions=drop
2. table=ConntrackState, priority=190,ct_state=-new+trk,ct_mark=0/0x10,ip actions=goto_table:AntreaPolicyEgressRule
3. table=ConntrackState, priority=190,ct_state=-new+trk,ct_mark=0x10/0x10,ip actions=set_field:0x200/0x200->reg0,goto_table:AntreaPolicyEgressRule
4. table=ConntrackState, priority=0 actions=goto_table:PreRoutingClassifier
Flow 1 is for case 1, dropping invalid packets.
Flow 2 is for case 2, matching packets from non-Service connections with NotServiceCTMark
and forwarding them to table AntreaPolicyEgressRule directly, bypassing the tables for Service Endpoint selection.
Flow 3 is also for case 2, matching packets from Service connections with ServiceCTMark
loaded in table EndpointDNAT and forwarding them to table AntreaPolicyEgressRule, bypassing the tables for Service Endpoint selection. RewriteMACRegMark
, which is used in table L3Forwarding, is loaded in this flow, indicating that the source and destination MAC addresses of the packets should be rewritten.
Flow 4 is the table-miss flow for case 3, matching packets from all new connections and forwarding them to table PreRoutingClassifier to start the processing of Service Endpoint selection.
PreRoutingClassifier
This table handles the first packet from uncommitted Service connections before Service Endpoint selection. It sequentially resubmits the packets to tables NodePortMark and SessionAffinity to do some pre-processing, including the loading of specific reg marks. Subsequently, it forwards the packets to table ServiceLB to perform Service Endpoint selection.
If you dump the flows of this table, you may see the following:
1. table=PreRoutingClassifier, priority=200,ip actions=resubmit(,NodePortMark),resubmit(,SessionAffinity),resubmit(,ServiceLB)
2. table=PreRoutingClassifier, priority=0 actions=goto_table:NodePortMark
Flow 1 sequentially resubmits packets to tables NodePortMark, SessionAffinity, and ServiceLB. Note that packets are ultimately forwarded to table ServiceLB. In tables NodePortMark and SessionAffinity, only reg marks are loaded.
Flow 2 is the table-miss flow that should remain unused.
NodePortMark
This table is designed to potentially mark packets destined for NodePort Services. It is only created when proxyAll
is enabled.
If you dump the flows of this table, you may see the following:
1. table=NodePortMark, priority=200,ip,nw_dst=192.168.77.102 actions=set_field:0x80000/0x80000->reg4
2. table=NodePortMark, priority=200,ip,nw_dst=169.254.0.252 actions=set_field:0x80000/0x80000->reg4
3. table=NodePortMark, priority=0 actions=goto_table:SessionAffinity
Flow 1 matches packets destined for the local Node from local Pods. NodePortRegMark
is loaded, indicating that the packets are potentially destined for NodePort Services. We assume only one valid IP address, 192.168.77.102
(the Node’s transport IP), can serve as the host IP address for NodePort based on the option antreaProxy.nodePortAddresses
. If there are multiple valid IP addresses specified in the option, a flow similar to flow 1 will be installed for each IP address.
Flow 2 match packets destined for the Virtual NodePort DNAT IP. Packets destined for NodePort Services from the local Node or the external network is DNAT’d to the Virtual NodePort DNAT IP by iptables before entering the pipeline.
Flow 3 is the table-miss flow.
Note that packets of NodePort Services have not been identified in this table by matching destination IP address. The identification of NodePort Services will be done finally in table ServiceLB by matching NodePortRegMark
and the the specific destination port of a NodePort.
SessionAffinity
This table is designed to implement Service session affinity. The learned flows that cache the information of the selected Endpoints are installed here.
If you dump the flows of this table, you may see the following:
1. table=SessionAffinity, hard_timeout=300, priority=200,tcp,nw_src=10.10.0.1,nw_dst=10.96.76.15,tp_dst=80 \
actions=set_field:0x50/0xffff->reg4,set_field:0/0x4000000->reg4,set_field:0xa0a0001->reg3,set_field:0x20000/0x70000->reg4,set_field:0x200/0x200->reg0
2. table=SessionAffinity, priority=0 actions=set_field:0x10000/0x70000->reg4
Flow 1 is a learned flow generated by flow 3 in table ServiceLB, designed for the sample Service [ClusterIP with Session Affinity], to implement Service session affinity. Here are some details about the flow:
- The “hard timeout” of the learned flow should be equal to the value of
service.spec.sessionAffinityConfig.clientIP.timeoutSeconds
defined in the Service. This means that until the hard timeout expires, this flow is present in the pipeline, and the session affinity of the Service takes effect. Unlike an “idle timeout”, the “hard timeout” does not reset whenever the flow is matched. - Source IP address, destination IP address, destination port, and transport protocol are used to match packets of connections sourced from the same client and destined for the Service during the affinity time window.
- Endpoint IP address and Endpoint port are loaded into
EndpointIPField
andEndpointPortField
respectively. EpSelectedRegMark
is loaded, indicating that the Service Endpoint selection is done, and ensuring that the packets will only match the last flow in table ServiceLB.RewriteMACRegMark
, which will be consumed in table L3Forwarding, is loaded here, indicating that the source and destination MAC addresses of the packets should be rewritten.
Flow 2 is the table-miss flow to match the first packet of connections destined for Services. The loading of EpToSelectRegMark
, to be consumed in table ServiceLB, indicating that the packet needs to do Service Endpoint selection.
ServiceLB
This table is used to implement Service Endpoint selection. It addresses specific cases:
- ClusterIP, as demonstrated in the examples ClusterIP without Endpoint and ClusterIP.
- NodePort, as demonstrated in the example NodePort.
- LoadBalancer, as demonstrated in the example LoadBalancer.
- Service configured with external IPs, as demonstrated in the example Service with ExternalIP.
- Service configured with session affinity, as demonstrated in the example Service with session affinity.
- Service configured with externalTrafficPolicy to
Local
, as demonstrated in the example Service with ExternalTrafficPolicy Local.
If you dump the flows of this table, you may see the following:
1. table=ServiceLB, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=10.101.255.29,tp_dst=80 actions=set_field:0x200/0x200->reg0,set_field:0x20000/0x70000->reg4,set_field:0x9->reg7,group:9
2. table=ServiceLB, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=10.105.31.235,tp_dst=80 actions=set_field:0x200/0x200->reg0,set_field:0x20000/0x70000->reg4,set_field:0xc->reg7,group:10
3. table=ServiceLB, priority=200,tcp,reg4=0x90000/0xf0000,tp_dst=30004 actions=set_field:0x200/0x200->reg0,set_field:0x20000/0x70000->reg4,set_field:0x200000/0x200000->reg4,set_field:0xc->reg7,group:12
4. table=ServiceLB, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=192.168.77.150,tp_dst=80 actions=set_field:0x200/0x200->reg0,set_field:0x20000/0x70000->reg4,set_field:0xe->reg7,group:14
5. table=ServiceLB, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=192.168.77.200,tp_dst=80 actions=set_field:0x200/0x200->reg0,set_field:0x20000/0x70000->reg4,set_field:0x10->reg7,group:16
6. table=ServiceLB, priority=200,tcp,reg4=0x10000/0x70000,nw_dst=10.96.76.15,tp_dst=80 actions=set_field:0x200/0x200->reg0,set_field:0x30000/0x70000->reg4,set_field:0xa->reg7,group:11
7. table=ServiceLB, priority=190,tcp,reg4=0x30000/0x70000,nw_dst=10.96.76.15,tp_dst=80 actions=learn(table=SessionAffinity,hard_timeout=300,priority=200,delete_learned,cookie=0x203000000000a,\
eth_type=0x800,nw_proto=6,NXM_OF_TCP_DST[],NXM_OF_IP_DST[],NXM_OF_IP_SRC[],load:NXM_NX_REG4[0..15]->NXM_NX_REG4[0..15],load:NXM_NX_REG4[26]->NXM_NX_REG4[26],load:NXM_NX_REG3[]->NXM_NX_REG3[],load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[9]),\
set_field:0x20000/0x70000->reg4,goto_table:EndpointDNAT
8. table=ServiceLB, priority=210,tcp,reg4=0x10010000/0x10070000,nw_dst=192.168.77.151,tp_dst=80 actions=set_field:0x200/0x200->reg0,set_field:0x20000/0x70000->reg4,set_field:0x11->reg7,group:17
9. table=ServiceLB, priority=200,tcp,nw_dst=192.168.77.151,tp_dst=80 actions=set_field:0x200/0x200->reg0,set_field:0x20000/0x70000->reg4,set_field:0x12->reg7,group:18
10. table=ServiceLB, priority=0 actions=goto_table:EndpointDNAT
Flow 1 and flow 2 are designed for case 1, matching the first packet of connections destined for the sample ClusterIP without Endpoint or ClusterIP. This is achieved by matching EpToSelectRegMark
loaded in table SessionAffinity, clusterIP, and port. The target of the packet matched by the flow is an OVS group where the Endpoint will be selected. Before forwarding the packet to the OVS group, RewriteMACRegMark
, which will be consumed in table L3Forwarding, is loaded, indicating that the source and destination MAC addresses of the packets should be rewritten. EpSelectedRegMark
, which will be consumed in table EndpointDNAT, is also loaded, indicating that the Endpoint is selected. Note that the Service Endpoint selection is not completed yet, as it will be done in the target OVS group.
Flow 3 is for case 2, matching the first packet of connections destined for the sample NodePort. This is achieved by matching EpToSelectRegMark
loaded in table SessionAffinity, NodePortRegMark
loaded in table NodePortMark, and NodePort port. Similar to flows 1-2, RewriteMACRegMark
and EpSelectedRegMark
are also loaded.
Flow 4 is for case 3, processing the first packet of connections destined for the ingress IP of the sample LoadBalancer, similar to flow 1.
Flow 5 is for case 4, processing the first packet of connections destined for the external IP of the sample Service with ExternalIP, similar to flow 1.
Flow 6 is the initial process for case 5, matching the first packet of connections destined for the sample Service with Session Affinity. This is achieved by matching the conditions similar to flow 1. Like flow 1, the target of the flow is also an OVS group, and RewriteMACRegMark
is loaded. The difference is that EpToLearnRegMark
is loaded, rather than EpSelectedRegMark
, indicating that the selected Endpoint needs to be cached.
Flow 7 is the final process for case 5, matching the packet previously matched by flow 6, resubmitted back from the target OVS group after selecting an Endpoint. Then a learned flow will be generated in table SessionAffinity to match the packets of the subsequent connections from the same client IP, ensuring that the packets are always forwarded to the same Endpoint selected the first time. EpSelectedRegMark
, which will be consumed in table EndpointDNAT, is loaded, indicating that Service Endpoint selection has been done.
Flow 8 and flow 9 are for case 6. Flow 8 has higher priority than flow 9, prioritizing matching the first packet of connections sourced from a local Pod or the local Node with FromLocalRegMark
loaded in table Classifier and destined for the sample Service with ExternalTrafficPolicy Local. The target of flow 8 is an OVS group that has all the Endpoints across the cluster, ensuring accessibility for Service connections originating from local Pods or Nodes, even though externalTrafficPolicy
is set to Local
for the Service. Due to the existence of flow 8, consequently, flow 9 exclusively matches packets sourced from the external network, resembling the pattern of flow 1. The target of flow 9 is an OVS group that has only the local Endpoints since externalTrafficPolicy
of the Service is Local
.
Flow 10 is the table-miss flow.
As mentioned above, the Service Endpoint selection is performed within OVS groups. 3 typical OVS groups are listed below:
1. group_id=9,type=select,\
bucket=bucket_id:0,weight:100,actions=set_field:0x4000/0x4000->reg0,resubmit(,EndpointDNAT)
2. group_id=10,type=select,\
bucket=bucket_id:0,weight:100,actions=set_field:0xa0a0018->reg3,set_field:0x50/0xffff->reg4,resubmit(,EndpointDNAT),\
bucket=bucket_id:1,weight:100,actions=set_field:0x4000000/0x4000000->reg4,set_field:0xa0a0106->reg3,set_field:0x50/0xffff->reg4,resubmit(,EndpointDNAT)
3. group_id=11,type=select,\
bucket=bucket_id:0,weight:100,actions=set_field:0xa0a0018->reg3,set_field:0x50/0xffff->reg4,resubmit(,ServiceLB),\
bucket=bucket_id:1,weight:100,actions=set_field:0x4000000/0x4000000->reg4,set_field:0xa0a0106->reg3,set_field:0x50/0xffff->reg4,resubmit(,ServiceLB)
The first group with group_id
9 is the destination of packets matched by flow 1, designed for a Service without Endpoints. The group only has a single bucket where SvcNoEpRegMark
which will be used in table EndpointDNAT is loaded, indicating that the Service has no Endpoint, and then packets are forwarded to table EndpointDNAT.
The second group with group_id
10 is the destination of packets matched by flow 2, designed for a Service with Endpoints. The group has 2 buckets, indicating the availability of 2 selectable Endpoints. Each bucket has an equal chance of being chosen since they have the same weights. For every bucket, the Endpoint IP and Endpoint port are loaded into EndpointIPField
and EndpointPortField
, respectively. These loaded values will be consumed in table EndpointDNAT to which the packets are forwarded and in which DNAT will be performed. RemoteEndpointRegMark
is loaded for remote Endpoints, like the bucket with bucket_id
1 in this group.
The third group with group_id
11 is the destination of packets matched by flow 6, designed for a Service that has Endpoints and is configured with session affinity. The group closely resembles the group with group_id
10, except that the destination of the packets is table ServiceLB, rather than table EndpointDNAT. After being resubmitted back to table ServiceLB, they will be matched by flow 7.
EndpointDNAT
The table implements DNAT for Service connections after Endpoint selection is performed in table ServiceLB.
If you dump the flows of this table, you may see the following::
1. table=EndpointDNAT, priority=200,reg0=0x4000/0x4000 actions=controller(reason=no_match,id=62373,userdata=04)
2. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0018,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.0.24:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3]))
3. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0106,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.1.6:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3]))
4. table=EndpointDNAT, priority=190,reg4=0x20000/0x70000 actions=set_field:0x10000/0x70000->reg4,resubmit(,ServiceLB)
5. table=EndpointDNAT, priority=0 actions=goto_table:AntreaPolicyEgressRule
Flow 1 is designed for Services without Endpoints. It identifies the first packet of connections destined for such Service by matching SvcNoEpRegMark
. Subsequently, the packet is forwarded to the OpenFlow controller (Antrea Agent). For TCP Service traffic, the controller will send a TCP RST, and for all other cases the controller will send an ICMP Destination Unreachable message.
Flows 2-3 are designed for Services that have selected an Endpoint. These flows identify the first packet of connections destined for such Services by matching EndpointPortField
, which stores the Endpoint IP, and EpUnionField
(a combination of EndpointPortField
storing the Endpoint port and EpSelectedRegMark
). Then ct
action is invoked on the packet, performing DNAT’d and forwarding it to table ConntrackState with the “tracked” state associated with CtZone
. Some bits of ct mark are persisted:
ServiceCTMark
, to be consumed in tables L3Forwarding and ConntrackCommit, indicating that the current packet and subsequent packets of the connection are for a Service.- The value of
PktSourceField
is persisted toConnSourceCTMarkField
, storing the source of the connection for the current packet and subsequent packets of the connection.
Flow 4 is to resubmit the packets which are not matched by flows 1-3 back to table ServiceLB to select Endpoint again.
Flow 5 is the table-miss flow to match non-Service packets.
AntreaPolicyEgressRule
This table is used to implement the egress rules across all Antrea-native NetworkPolicies, except for NetworkPolicies that are created in the Baseline Tier. Antrea-native NetworkPolicies created in the Baseline Tier will be enforced after K8s NetworkPolicies and their egress rules are installed in tables EgressDefaultRule and EgressRule respectively, i.e.
Antrea-native NetworkPolicy other Tiers -> AntreaPolicyEgressRule
K8s NetworkPolicy -> EgressRule
Antrea-native NetworkPolicy Baseline Tier -> EgressDefaultRule
Antrea-native NetworkPolicy relies on the OVS built-in conjunction
action to implement policies efficiently. This enables us to do a conjunctive match across multiple dimensions (source IP, destination IP, port, etc.) efficiently without “exploding” the number of flows. For our use case, we have at most 3 dimensions.
The only requirement of conj_id
is to be a unique 32-bit integer within the table. At the moment we use a single custom allocator, which is common to all tables that can have NetworkPolicy flows installed ( AntreaPolicyEgressRule, EgressRule, EgressDefaultRule, AntreaPolicyIngressRule, IngressRule, and IngressDefaultRule).
For this table, you will need to keep in mind the Antrea-native NetworkPolicy specification. Since the sample egress policy resides in the Application Tie, if you dump the flows of this table, you may see the following:
1. table=AntreaPolicyEgressRule, priority=64990,ct_state=-new+est,ip actions=goto_table:EgressMetric
2. table=AntreaPolicyEgressRule, priority=64990,ct_state=-new+rel,ip actions=goto_table:EgressMetric
3. table=AntreaPolicyEgressRule, priority=14500,ip,nw_src=10.10.0.24 actions=conjunction(7,1/3)
4. table=AntreaPolicyEgressRule, priority=14500,ip,nw_dst=10.10.0.25 actions=conjunction(7,2/3)
5. table=AntreaPolicyEgressRule, priority=14500,tcp,tp_dst=3306 actions=conjunction(7,3/3)
6. table=AntreaPolicyEgressRule, priority=14500,conj_id=7,ip actions=set_field:0x7->reg5,ct(commit,table=EgressMetric,zone=65520,exec(set_field:0x700000000/0xffffffff00000000->ct_label))
7. table=AntreaPolicyEgressRule, priority=14499,ip,nw_src=10.10.0.24 actions=conjunction(5,1/2)
8. table=AntreaPolicyEgressRule, priority=14499,ip actions=conjunction(5,2/2)
9. table=AntreaPolicyEgressRule, priority=14499,conj_id=5 actions=set_field:0x5->reg3,set_field:0x400/0x400->reg0,goto_table:EgressMetric
10. table=AntreaPolicyEgressRule, priority=0 actions=goto_table:EgressRule
Flows 1-2, which are installed by default with the highest priority, match non-new and “tracked” packets and forward them to table EgressMetric to bypass the check from egress rules. This means that if a connection is established, its packets go straight to table EgressMetric, with no other match required. In particular, this ensures that reply traffic is never dropped because of an Antrea-native NetworkPolicy or K8s NetworkPolicy rule. However, this also means that ongoing connections are not affected if the Antrea-native NetworkPolicy or the K8s NetworkPolicy is updated.
The priorities of flows 3-9 installed for the egress rules are decided by the following:
- The
spec.tier
value in an Antrea-native NetworkPolicy determines the primary level for flow priority. - The
spec.priority
value in an Antrea-native NetworkPolicy determines the secondary level for flow priority within the samespec.tier
. A lower value in this field corresponds to a higher priority for the flow. - The rule’s position within an Antrea-native NetworkPolicy also influences flow priority. Rules positioned closer to the beginning have higher priority for the flow.
Flows 3-6, whose priorities are all 14500, are installed for the egress rule AllowToDB
in the sample policy. These flows are described as follows:
- Flow 3 is used to match packets with the source IP address in set {10.10.0.24}, which has all IP addresses of the Pods selected by the label
app: web
, constituting the first dimension forconjunction
withconj_id
7. - Flow 4 is used to match packets with the destination IP address in set {10.10.0.25}, which has all IP addresses of the Pods selected by the label
app: db
, constituting the second dimension forconjunction
withconj_id
7. - Flow 5 is used to match packets with the destination TCP port in set {3306} specified in the rule, constituting the third dimension for
conjunction
withconj_id
7. - Flow 6 is used to match packets meeting all the three dimensions of
conjunction
withconj_id
7 and forward them to table EgressMetric, persistingconj_id
toEgressRuleCTLabel
, which will be consumed in table EgressMetric.
Flows 7-9, whose priorities are all 14499, are installed for the egress rule with a Drop
action defined after the rule AllowToDB
in the sample policy, and serves as a default rule. Antrea-native NetworkPolicy does not have the same default isolated behavior as K8s NetworkPolicy (implemented in the EgressDefaultRule table). As soon as a rule is matched, we apply the corresponding action. If no rule is matched, there is no implicit drop for Pods to which an Antrea-native NetworkPolicy applies. These flows are described as follows:
- Flow 7 is used to match packets with the source IP address in set {10.10.0.24}, which is from the Pods selected by the label
app: web
, constituting the first dimension forconjunction
withconj_id
5. - Flow 8 is used to match any IP packets, constituting the second dimension for
conjunction
withconj_id
5. This flow, which matches all IP packets, exists because we need at least 2 dimensions for a conjunctive match. - Flow 9 is used to match packets meeting both dimensions of
conjunction
withconj_id
5.APDenyRegMark
is loaded and will be consumed in table EgressMetric to which the packets are forwarded.
Flow 10 is the table-miss flow to forward packets not matched by other flows to table EgressMetric.
EgressRule
For this table, you will need to keep in mind the K8s NetworkPolicy specification that we are using.
This table is used to implement the egress rules across all K8s NetworkPolicies. If you dump the flows for this table, you may see the following:
1. table=EgressRule, priority=200,ip,nw_src=10.10.0.24 actions=conjunction(2,1/3)
2. table=EgressRule, priority=200,ip,nw_dst=10.10.0.25 actions=conjunction(2,2/3)
3. table=EgressRule, priority=200,tcp,tp_dst=3306 actions=conjunction(2,3/3)
4. table=EgressRule, priority=190,conj_id=2,ip actions=set_field:0x2->reg5,ct(commit,table=EgressMetric,zone=65520,exec(set_field:0x200000000/0xffffffff00000000->ct_label))
5. table=EgressRule, priority=0 actions=goto_table:EgressDefaultRule
Flows 1-4 are installed for the egress rule in the sample K8s NetworkPolicy. These flows are described as follows:
- Flow 1 is to match packets with the source IP address in set {10.10.0.24}, which has all IP addresses of the Pods selected by the label
app: web
in thedefault
Namespace, constituting the first dimension forconjunction
withconj_id
2. - Flow 2 is to match packets with the destination IP address in set {10.10.0.25}, which has all IP addresses of the Pods selected by the label
app: db
in thedefault
Namespace, constituting the second dimension forconjunction
withconj_id
2. - Flow 3 is to match packets with the destination TCP port in set {3306} specified in the rule, constituting the third dimension for
conjunction
withconj_id
2. - Flow 4 is to match packets meeting all the three dimensions of
conjunction
withconj_id
2 and forward them to table EgressMetric, persistingconj_id
toEgressRuleCTLabel
.
Flow 5 is the table-miss flow to forward packets not matched by other flows to table EgressDefaultRule.
EgressDefaultRule
This table complements table EgressRule for K8s NetworkPolicy egress rule implementation. When a NetworkPolicy is applied to a set of Pods, then the default behavior for egress connections for these Pods becomes “deny” (they become isolated Pods). This table is in charge of dropping traffic originating from Pods to which a NetworkPolicy (with an egress rule) is applied, and which did not match any of the “allowed” list rules.
If you dump the flows of this table, you may see the following:
1. table=EgressDefaultRule, priority=200,ip,nw_src=10.10.0.24 actions=drop
2. table=EgressDefaultRule, priority=0 actions=goto_table:EgressMetric
Flow 1, based on our sample K8s NetworkPolicy, is to drop traffic originating from 10.10.0.24, an IP address associated with a Pod selected by the label app: web
. If there are multiple Pods being selected by the label app: web
, you will see multiple similar flows for each IP address.
Flow 2 is the table-miss flow to forward packets to table EgressMetric.
This table is also used to implement Antrea-native NetworkPolicy egress rules that are created in the Baseline Tier. Since the Baseline Tier is meant to be enforced after K8s NetworkPolicies, the corresponding flows will be created at a lower priority than K8s NetworkPolicy default drop flows. These flows are similar to flows 3-9 in table AntreaPolicyEgressRule. For the sake of simplicity, we have not defined any example Baseline policies in this document.
EgressMetric
This table is used to collect egress metrics for Antrea-native NetworkPolicies and K8s NetworkPolicies.
If you dump the flows of this table, you may see the following:
1. table=EgressMetric, priority=200,ct_state=+new,ct_label=0x200000000/0xffffffff00000000,ip actions=goto_table:L3Forwarding
2. table=EgressMetric, priority=200,ct_state=-new,ct_label=0x200000000/0xffffffff00000000,ip actions=goto_table:L3Forwarding
3. table=EgressMetric, priority=200,ct_state=+new,ct_label=0x700000000/0xffffffff00000000,ip actions=goto_table:L3Forwarding
4. table=EgressMetric, priority=200,ct_state=-new,ct_label=0x700000000/0xffffffff00000000,ip actions=goto_table:L3Forwarding
5. table=EgressMetric, priority=200,reg0=0x400/0x400,reg3=0x5 actions=drop
6. table=EgressMetric, priority=0 actions=goto_table:L3Forwarding
Flows 1-2, matching packets with EgressRuleCTLabel
set to 2, the conj_id
allocated for the sample K8s NetworkPolicy egress rule and loaded in table EgressRule flow 4, are used to collect metrics for the egress rule.
Flows 3-4, matching packets with EgressRuleCTLabel
set to 7, the conj_id
allocated for the sample Antrea-native NetworkPolicy egress rule and loaded in table AntreaPolicyEgressRule flow 6, are used to collect metrics for the egress rule.
Flow 5 serves as the drop rule for the sample Antrea-native NetworkPolicy egress rule. It drops the packets by matching APDenyRegMark
loaded in table AntreaPolicyEgressRule flow 9 and APConjIDField
set to 5 which is the conj_id
allocated the egress rule and loaded in table AntreaPolicyEgressRule flow 9.
These flows have no explicit action besides the goto_table
action. This is because we rely on the “implicit” flow counters to keep track of connection / packet statistics.
Ct label is used in flows 1-4, while reg is used in flow 5. The distinction lies in the fact that the value persisted in the ct label can be read throughout the entire lifecycle of a connection, but the reg mark is only valid for the current packet. For a connection permitted by a rule, all its packets should be collected for metrics, thus a ct label is used. For a connection denied or dropped by a rule, the first packet and the subsequent retry packets will be blocked, therefore a reg is enough.
Flow 6 is the table-miss flow.
L3Forwarding
This table, designated as the L3 routing table, serves to assign suitable source and destination MAC addresses to packets based on their destination IP addresses, as well as their reg marks or ct marks.
If you dump the flows of this table, you may see the following:
1. table=L3Forwarding, priority=210,ip,nw_dst=10.10.0.1 actions=set_field:ba:5e:d1:55:aa:c0->eth_dst,set_field:0x20/0xf0->reg0,goto_table:L3DecTTL
2. table=L3Forwarding, priority=210,ct_state=+rpl+trk,ct_mark=0x2/0xf,ip actions=set_field:ba:5e:d1:55:aa:c0->eth_dst,set_field:0x20/0xf0->reg0,goto_table:L3DecTTL
3. table=L3Forwarding, priority=200,ip,reg0=0/0x200,nw_dst=10.10.0.0/24 actions=goto_table:L2ForwardingCalc
4. table=L3Forwarding, priority=200,ip,nw_dst=10.10.1.0/24 actions=set_field:ba:5e:d1:55:aa:c0->eth_src,set_field:aa:bb:cc:dd:ee:ff->eth_dst,set_field:192.168.77.103->tun_dst,set_field:0x10/0xf0->reg0,goto_table:L3DecTTL
5. table=L3Forwarding, priority=200,ip,reg0=0x200/0x200,nw_dst=10.10.0.24 actions=set_field:ba:5e:d1:55:aa:c0->eth_src,set_field:fa:b7:53:74:21:a6->eth_dst,goto_table:L3DecTTL
6. table=L3Forwarding, priority=200,ip,reg0=0x200/0x200,nw_dst=10.10.0.25 actions=set_field:ba:5e:d1:55:aa:c0->eth_src,set_field:36:48:21:a2:9d:b4->eth_dst,goto_table:L3DecTTL
7. table=L3Forwarding, priority=200,ip,reg0=0x200/0x200,nw_dst=10.10.0.26 actions=set_field:ba:5e:d1:55:aa:c0->eth_src,set_field:5e:b5:e3:a6:90:b7->eth_dst,goto_table:L3DecTTL
8. table=L3Forwarding, priority=190,ct_state=-rpl+trk,ip,reg0=0x3/0xf,reg4=0/0x100000 actions=goto_table:EgressMark
9. table=L3Forwarding, priority=190,ct_state=-rpl+trk,ip,reg0=0x1/0xf actions=set_field:ba:5e:d1:55:aa:c0->eth_dst,goto_table:EgressMark
10. table=L3Forwarding, priority=190,ct_mark=0x10/0x10,reg0=0x202/0x20f actions=set_field:ba:5e:d1:55:aa:c0->eth_dst,set_field:0x20/0xf0->reg0,goto_table:L3DecTTL
11. table=L3Forwarding, priority=0 actions=set_field:0x20/0xf0->reg0,goto_table:L2ForwardingCalc
Flow 1 matches packets destined for the local Antrea gateway IP, rewrites their destination MAC address to that of the local Antrea gateway, loads ToGatewayRegMark
, and forwards them to table L3DecTTL to decrease TTL value. The action of rewriting the destination MAC address is not necessary but not harmful for Pod-to-gateway request packets because the destination MAC address is already the local gateway MAC address. In short, the action is only necessary for AntreaIPAM
Pods, not required by the sample NodeIPAM Pods in this document.
Flow 2 matches reply packets with corresponding ct “tracked” states and FromGatewayCTMark
from connections initiated through the local Antrea gateway. In other words, these are connections for which the first packet of the connection (SYN packet for TCP) was received through the local Antrea gateway. It rewrites the destination MAC address to that of the local Antrea gateway, loads ToGatewayRegMark
, and forwards them to table L3DecTTL. This ensures that reply packets can be forwarded back to the local Antrea gateway in subsequent tables. This flow is required to handle the following cases when AntreaProxy is not enabled:
- Reply traffic for connections from a local Pod to a ClusterIP Service, which are handled by kube-proxy and go through DNAT. In this case, the destination IP address of the reply traffic is the Pod which initiated the connection to the Service (no SNAT by kube-proxy). These packets should be forwarded back to the local Antrea gateway to the third-party module to complete the DNAT processes, e.g., kube-proxy. The destination MAC of the packets is rewritten in the table to avoid it is forwarded to the original client Pod by mistake.
- When hairpin is involved, i.e. connections between 2 local Pods, for which NAT is performed. One example is a Pod accessing a NodePort Service for which externalTrafficPolicy is set to
Local
using the local Node’s IP address, as there will be no SNAT for such traffic. Another example could be hostPort support, depending on how the feature is implemented.
Flow 3 matches packets from intra-Node connections (excluding Service connections) and marked with NotRewriteMACRegMark
, indicating that the destination and source MACs of packets should not be overwritten, and forwards them to table L2ForwardingCalc instead of table L3DecTTL. The deviation is due to local Pods connections not traversing any router device or undergoing NAT process. For packets from Service or inter-Node connections, RewriteMACRegMark
, mutually exclusive with NotRewriteMACRegMark
, is loaded. Therefore, the packets will not be matched by the flow.
Flow 4 is designed to match packets destined for a remote Pod CIDR. This involves installing a separate flow for each remote Node, with each flow matching the destination IP address of the packets against the Pod subnet for the respective Node. For the matched packets, the source MAC address is set to that of the local Antrea gateway MAC, and the destination MAC address is set to the Global Virtual MAC. The Openflow tun_dst
field is set to the appropriate value (i.e. the IP address of the remote Node). Additionally, ToTunnelRegMark
is loaded, signifying that the packets will be forwarded to remote Nodes through a tunnel. The matched packets are then forwarded to table L3DecTTL to decrease the TTL value.
Flow 5-7 matches packets destined for local Pods and marked by RewriteMACRegMark
, which signifies that the packets may originate from Service or inter-Node connections. For the matched packets, the source MAC address is set to that of the local Antrea gateway MAC, and the destination MAC address is set to the associated local Pod MAC address. The matched packets are then forwarded to table L3DecTTL to decrease the TTL value.
Flow 8 matches request packets originating from local Pods and destined for the external network, and then forwards them to table EgressMark dedicated to feature Egress
. In table EgressMark, SNAT IPs for Egress are looked up for the packets. To match the expected packets, FromPodRegMark
is used to exclude packets that are not from local Pods. Additionally, NotAntreaFlexibleIPAMRegMark
, mutually exclusive with AntreaFlexibleIPAMRegMark
which is used to mark packets from Antrea IPAM Pods, is used since Egress can only be applied to Node IPAM Pods.
It’s worth noting that packets sourced from local Pods and destined for the Services listed in the option antreaProxy.skipServices
are unexpectedly matched by flow 8 due to the fact that there is no flow in ServiceLB to handle these Services. Consequently, the destination IP address of the packets, allocated from the Service CIDR, is considered part of the “external network”. No need to worry about the mismatch, as flow 3 in table EgressMark is designed to match these packets and prevent them from undergoing SNAT by Egress.
Flow 9 matches request packets originating from remote Pods and destined for the external network, and then forwards them to table EgressMark dedicated to feature Egress
. To match the expected packets, FromTunnelRegMark
is used to include packets that are from remote Pods through a tunnel. Considering that the packets from remote Pods traverse a tunnel, the destination MAC address of the packets, represented by the Global Virtual MAC, needs to be rewritten to MAC address of the local Antrea gateway.
Flow 10 matches packets from Service connections that are originating from the local Antrea gateway and destined for the external network. This is accomplished by matching RewriteMACRegMark
, FromGatewayRegMark
, and ServiceCTMark
. The destination MAC address is then set to that of the local Antrea gateway. Additionally, ToGatewayRegMark
, which will be used with FromGatewayRegMark
together to identify hairpin connections in table SNATMark, is loaded. Finally, the packets are forwarded to table L3DecTTL.
Flow 11 is the table-miss flow, and is used for packets originating from local Pods and destined for the external network, and then forwarding them to table L2ForwardingCalc. ToGatewayRegMark
is loaded as the matched packets traverse the local Antrea gateway.
EgressMark
This table is dedicated to feature Egress
. It includes flows to select the right SNAT IPs for egress traffic originating from Pods and destined for the external network.
If you dump the flows of this table, you may see the following:
1. table=EgressMark, priority=210,ip,nw_dst=192.168.77.102 actions=set_field:0x20/0xf0->reg0,goto_table:L2ForwardingCalc
2. table=EgressMark, priority=210,ip,nw_dst=192.168.77.103 actions=set_field:0x20/0xf0->reg0,goto_table:L2ForwardingCalc
3. table=EgressMark, priority=210,ip,nw_dst=10.96.0.0/12 actions=set_field:0x20/0xf0->reg0,goto_table:L2ForwardingCalc
4. table=EgressMark, priority=200,ip,in_port="client-6-3353ef" actions=set_field:ba:5e:d1:55:aa:c0->eth_src,set_field:aa:bb:cc:dd:ee:ff->eth_dst,set_field:192.168.77.113->tun_dst,set_field:0x10/0xf0->reg0,set_field:0x80000/0x80000->reg0,goto_table:L2ForwardingCalc
5. table=EgressMark, priority=200,ct_state=+new+trk,ip,tun_dst=192.168.77.112 actions=set_field:0x1/0xff->pkt_mark,set_field:0x20/0xf0->reg0,goto_table:L2ForwardingCalc
6. table=EgressMark, priority=200,ct_state=+new+trk,ip,in_port="web-7975-274540" actions=set_field:0x1/0xff->pkt_mark,set_field:0x20/0xf0->reg0,goto_table:L2ForwardingCalc
7. table=EgressMark, priority=190,ct_state=+new+trk,ip,reg0=0x1/0xf actions=drop
8. table=EgressMark, priority=0 actions=set_field:0x20/0xf0->reg0,goto_table:L2ForwardingCalc
Flows 1-2 match packets originating from local Pods and destined for the transport IP of remote Nodes, and then forward them to table L2ForwardingCalc to bypass Egress SNAT. ToGatewayRegMark
is loaded, indicating that the output port of the packets is the local Antrea gateway.
Flow 3 matches packets originating from local Pods and destined for the Services listed in the option antreaProxy.skipServices
, and then forwards them to table L2ForwardingCalc to bypass Egress SNAT. Similar to flows 1-2, ToGatewayRegMark
is also loaded.
The packets, matched by flows 1-3, are forwarded to this table by flow 8 in table L3Forwarding, as they are classified as part of traffic destined for the external network. However, these packets are not intended to undergo Egress SNAT. Consequently, flows 1-3 are used to bypass Egress SNAT for these packets.
Flow 4 match packets originating from local Pods selected by the sample Egress egress-client, whose SNAT IP is configured on a remote Node, which means that the matched packets should be forwarded to the remote Node through a tunnel. Before sending the packets to the tunnel, the source and destination MAC addresses are set to the local Antrea gateway MAC and the Global Virtual MAC respectively. Additionally, ToTunnelRegMark
, indicating that the output port is a tunnel, and EgressSNATRegMark
, indicating that packets should undergo SNAT on a remote Node, are loaded. Finally, the packets are forwarded to table L2ForwardingCalc.
Flow 5 matches the first packet of connections originating from remote Pods selected by the sample Egress egress-web whose SNAT IP is configured on the local Node, and then loads an 8-bit ID allocated for the associated SNAT IP defined in the sample Egress to the pkt_mark
, which will be consumed by iptables on the local Node to perform SNAT with the SNAT IP. Subsequently, ToGatewayRegMark
, indicating that the output port is the local Antrea gateway, is loaded. Finally, the packets are forwarded to table L2ForwardingCalc.
Flow 6 matches the first packet of connections originating from local Pods selected by the sample Egress egress-web, whose SNAT IP is configured on the local Node. Similar to flow 4, the 8-bit ID allocated for the SNAT IP is loaded to pkt_mark
, ToGatewayRegMark
is loaded, and the packets are forwarded to table L2ForwardingCalc finally.
Flow 7 drops all other packets tunneled from remote Nodes (identified with FromTunnelRegMark
, indicating that the packets are from remote Pods through a tunnel). The packets are not matched by any flows 1-6, which means that they are here unexpected and should be dropped.
Flow 8 is the table-miss flow, which matches “tracked” and non-new packets from Egress connections and forwards them to table L2ForwardingCalc. ToGatewayRegMark
is also loaded for these packets.
L3DecTTL
This is the table to decrement TTL for IP packets.
If you dump the flows of this table, you may see the following:
1. table=L3DecTTL, priority=210,ip,reg0=0x2/0xf actions=goto_table:SNATMark
2. table=L3DecTTL, priority=200,ip actions=dec_ttl,goto_table:SNATMark
3. table=L3DecTTL, priority=0 actions=goto_table:SNATMark
Flow 1 matches packets with FromGatewayRegMark
, which means that these packets enter the OVS pipeline from the local Antrea gateway, as the host IP stack should have decremented the TTL already for such packets, TTL should not be decremented again.
Flow 2 is to decrement TTL for packets which are not matched by flow 1.
Flow 3 is the table-miss flow that should remain unused.
SNATMark
This table marks connections requiring SNAT within the OVS pipeline, distinct from Egress SNAT handled by iptables.
If you dump the flows of this table, you may see the following:
1. table=SNATMark, priority=200,ct_state=+new+trk,ip,reg0=0x22/0xff actions=ct(commit,table=SNAT,zone=65520,exec(set_field:0x20/0x20->ct_mark,set_field:0x40/0x40->ct_mark))
2. table=SNATMark, priority=200,ct_state=+new+trk,ip,reg0=0x12/0xff,reg4=0x200000/0x2200000 actions=ct(commit,table=SNAT,zone=65520,exec(set_field:0x20/0x20->ct_mark))
3. table=SNATMark, priority=190,ct_state=+new+trk,ip,nw_src=10.10.0.23,nw_dst=10.10.0.23 actions=ct(commit,table=SNAT,zone=65520,exec(set_field:0x20/0x20->ct_mark,set_field:0x40/0x40->ct_mark))
4. table=SNATMark, priority=190,ct_state=+new+trk,ip,nw_src=10.10.0.24,nw_dst=10.10.0.24 actions=ct(commit,table=SNAT,zone=65520,exec(set_field:0x20/0x20->ct_mark,set_field:0x40/0x40->ct_mark))
5. table=SNATMark, priority=0 actions=goto_table:SNAT
Flow 1 matches the first packet of hairpin Service connections, identified by FromGatewayRegMark
and ToGatewayRegMark
, indicating that both the input and output ports of the connections are the local Antrea gateway port. Such hairpin connections will undergo SNAT with the Virtual Service IP in table SNAT. Before forwarding the packets to table SNAT, ConnSNATCTMark
, indicating that the connection requires SNAT, and HairpinCTMark
, indicating that this is a hairpin connection, are persisted to mark the connections. These two ct marks will be consumed in table SNAT.
Flow 2 matches the first packet of Service connections requiring SNAT, identified by FromGatewayRegMark
and ToTunnelRegMark
, indicating that the input port is the local Antrea gateway and the output port is a tunnel. Such connections will undergo SNAT with the IP address of the local Antrea gateway in table SNAT. Before forwarding the packets to table SNAT, ToExternalAddressRegMark
and NotDSRServiceRegMark
are loaded, indicating that the packets are destined for a Service’s external IP, like NodePort, LoadBalancerIP or ExternalIP, but it is not DSR mode. Additionally, ConnSNATCTMark
, indicating that the connection requires SNAT, is persisted to mark the connections.
It’s worth noting that flows 1-2 are specific to proxyAll
, but they are harmless when proxyAll
is disabled since these flows should be never matched by in-cluster Service traffic.
Flow 3-4 match the first packet of hairpin Service connections, identified by the same source and destination Pod IP addresses. Such hairpin connections will undergo SNAT with the IP address of the local Antrea gateway in table SNAT. Similar to flow 1, ConnSNATCTMark
and HairpinCTMark
are persisted to mark the connections.
Flow 5 is the table-miss flow.
SNAT
This table performs SNAT for connections requiring SNAT within the pipeline.
If you dump the flows of this table, you may see the following:
1. table=SNAT, priority=200,ct_state=+new+trk,ct_mark=0x40/0x40,ip,reg0=0x2/0xf actions=ct(commit,table=L2ForwardingCalc,zone=65521,nat(src=169.254.0.253),exec(set_field:0x10/0x10->ct_mark,set_field:0x40/0x40->ct_mark))
2. table=SNAT, priority=200,ct_state=+new+trk,ct_mark=0x40/0x40,ip,reg0=0x3/0xf actions=ct(commit,table=L2ForwardingCalc,zone=65521,nat(src=10.10.0.1),exec(set_field:0x10/0x10->ct_mark,set_field:0x40/0x40->ct_mark))
3. table=SNAT, priority=200,ct_state=-new-rpl+trk,ct_mark=0x20/0x20,ip actions=ct(table=L2ForwardingCalc,zone=65521,nat)
4. table=SNAT, priority=190,ct_state=+new+trk,ct_mark=0x20/0x20,ip,reg0=0x2/0xf actions=ct(commit,table=L2ForwardingCalc,zone=65521,nat(src=10.10.0.1),exec(set_field:0x10/0x10->ct_mark))
5. table=SNAT, priority=0 actions=goto_table:L2ForwardingCalc
Flow 1 matches the first packet of hairpin Service connections through the local Antrea gateway, identified by HairpinCTMark
and FromGatewayRegMark
. It performs SNAT with the Virtual Service IP 169.254.0.253
and forwards the SNAT’d packets to table L2ForwardingCalc. Before SNAT, the “tracked” state of packets is associated with CtZone
. After SNAT, their “track” state is associated with SNATCtZone
, and then ServiceCTMark
and HairpinCTMark
persisted in CtZone
are not accessible anymore. As a result, ServiceCTMark
and HairpinCTMark
need to be persisted once again, but this time they are persisted in SNATCtZone
for subsequent tables to consume.
Flow 2 matches the first packet of hairpin Service connection originating from local Pods, identified by HairpinCTMark
and FromPodRegMark
. It performs SNAT with the IP address of the local Antrea gateway and forwards the SNAT’d packets to table L2ForwardingCalc. Similar to flow 1, ServiceCTMark
and HairpinCTMark
are persisted in SNATCtZone
.
Flow 3 matches the subsequent request packets of connections for which SNAT was performed for the first packet, and then invokes ct
action on the packets again to restore the “tracked” state in SNATCtZone
. The packets with the appropriate “tracked” state are forwarded to table L2ForwardingCalc.
Flow 4 matches the first packet of Service connections requiring SNAT, identified by ConnSNATCTMark
and FromGatewayRegMark
, indicating the connection is destined for an external Service IP initiated through the Antrea gateway and the Endpoint is a remote Pod. It performs SNAT with the IP address of the local Antrea gateway and forwards the SNAT’d packets to table L2ForwardingCalc. Similar to other flow 1 or 2, ServiceCTMark
is persisted in SNATCtZone
.
Flow 5 is the table-miss flow.
L2ForwardingCalc
This is essentially the “dmac” table of the switch. We program one flow for each port (tunnel port, the local Antrea gateway port, and local Pod ports).
If you dump the flows of this table, you may see the following:
1. table=L2ForwardingCalc, priority=200,dl_dst=ba:5e:d1:55:aa:c0 actions=set_field:0x2->reg1,set_field:0x200000/0x600000->reg0,goto_table:TrafficControl
2. table=L2ForwardingCalc, priority=200,dl_dst=aa:bb:cc:dd:ee:ff actions=set_field:0x1->reg1,set_field:0x200000/0x600000->reg0,goto_table:TrafficControl
3. table=L2ForwardingCalc, priority=200,dl_dst=5e:b5:e3:a6:90:b7 actions=set_field:0x24->reg1,set_field:0x200000/0x600000->reg0,goto_table:TrafficControl
4. table=L2ForwardingCalc, priority=200,dl_dst=fa:b7:53:74:21:a6 actions=set_field:0x25->reg1,set_field:0x200000/0x600000->reg0,goto_table:TrafficControl
5. table=L2ForwardingCalc, priority=200,dl_dst=36:48:21:a2:9d:b4 actions=set_field:0x26->reg1,set_field:0x200000/0x600000->reg0,goto_table:TrafficControl
6. table=L2ForwardingCalc, priority=0 actions=goto_table:TrafficControl
Flow 1 matches packets destined for the local Antrea gateway, identified by the destination MAC address being that of the local Antrea gateway. It loads OutputToOFPortRegMark
, indicating that the packets should output to an OVS port, and also loads the port number of the local Antrea gateway to TargetOFPortField
. Both of these two values will be consumed in table Output.
Flow 2 matches packets destined for a tunnel, identified by the destination MAC address being that of the Global Virtual MAC. Similar to flow 1, OutputToOFPortRegMark
is loaded, and the port number of the tunnel is loaded to TargetOFPortField
.
Flows 3-5 match packets destined for local Pods, identified by the destination MAC address being that of one of the local Pods. Similar to flow 1, OutputToOFPortRegMark
is loaded, and the port number of the local Pods is loaded to TargetOFPortField
.
Flow 6 is the table-miss flow.
TrafficControl
This table is dedicated to TrafficControl
.
If you dump the flows of this table, you may see the following:
1. table=TrafficControl, priority=210,reg0=0x200006/0x60000f actions=goto_table:Output
2. table=TrafficControl, priority=200,reg1=0x25 actions=set_field:0x22->reg9,set_field:0x800000/0xc00000->reg4,goto_table:IngressSecurityClassifier
3. table=TrafficControl, priority=200,in_port="web-7975-274540" actions=set_field:0x22->reg9,set_field:0x800000/0xc00000->reg4,goto_table:IngressSecurityClassifier
4. table=TrafficControl, priority=200,reg1=0x26 actions=set_field:0x27->reg9,set_field:0x400000/0xc00000->reg4,goto_table:IngressSecurityClassifier
5. table=TrafficControl, priority=200,in_port="db-755c6-5080e3" actions=set_field:0x27->reg9,set_field:0x400000/0xc00000->reg4,goto_table:IngressSecurityClassifier
6. table=TrafficControl, priority=0 actions=goto_table:IngressSecurityClassifier
Flow 1 matches packets returned from TrafficControl return ports and forwards them to table Output, where the packets are output to the port to which they are destined. To identify such packets, OutputToOFPortRegMark
, indicating that the packets should be output to an OVS port, and FromTCReturnRegMark
loaded in table Classifier, indicating that the packets are from a TrafficControl return port, are used.
Flows 2-3 are installed for the sample TrafficControl redirect-web-to-local to mark the packets associated with the Pods labeled by app: web
using TrafficControlRedirectRegMark
. Flow 2 handles the ingress direction, while flow 3 handles the egress direction. In table Output, these packets will be redirected to a TrafficControl target port specified in TrafficControlTargetOFPortField
, of which value is loaded in these 2 flows.
Flows 4-5 are installed for the sample TrafficControl mirror-db-to-local to mark the packets associated with the Pods labeled by app: db
using TrafficControlMirrorRegMark
. Similar to flows 2-3, flows 4-5 also handles the two directions. In table Output, these packets will be mirrored (duplicated) to a TrafficControl target port specified in TrafficControlTargetOFPortField
, of which value is loaded in these 2 flows.
Flow 6 is the table-miss flow.
IngressSecurityClassifier
This table is to classify packets before they enter the tables for ingress security.
If you dump the flows of this table, you may see the following:
1. table=IngressSecurityClassifier, priority=210,pkt_mark=0x80000000/0x80000000,ct_state=-rpl+trk,ip actions=goto_table:ConntrackCommit
2. table=IngressSecurityClassifier, priority=201,reg4=0x80000/0x80000 actions=goto_table:AntreaPolicyIngressRule
3. table=IngressSecurityClassifier, priority=200,reg0=0x20/0xf0 actions=goto_table:IngressMetric
4. table=IngressSecurityClassifier, priority=200,reg0=0x10/0xf0 actions=goto_table:IngressMetric
5. table=IngressSecurityClassifier, priority=200,reg0=0x40/0xf0 actions=goto_table:IngressMetric
6. table=IngressSecurityClassifier, priority=200,ct_mark=0x40/0x40 actions=goto_table:ConntrackCommit
7. table=IngressSecurityClassifier, priority=0 actions=goto_table:AntreaPolicyIngressRule
Flow 1 matches locally generated request packets for liveness/readiness probes from kubelet, identified by pkt_mark
which is set by iptables in the host network namespace. It forwards the packets to table ConntrackCommit directly to bypass all tables for ingress security.
Flow 2 matches packets destined for NodePort Services and forwards them to table AntreaPolicyIngressRule to enforce Antrea-native NetworkPolicies applied to NodePort Services. Without this flow, if the selected Endpoint is not a local Pod, the packets might be matched by one of the flows 3-5, skipping table AntreaPolicyIngressRule.
Flows 3-5 matches packets destined for the local Antrea gateway, tunnel, uplink port with ToGatewayRegMark
, ToTunnelRegMark
or ToUplinkRegMark
, respectively, and forwards them to table IngressMetric directly to bypass all tables for ingress security.
Flow 5 matches packets from hairpin connections with HairpinCTMark
and forwards them to table ConntrackCommit directly to bypass all tables for ingress security. Refer to this PR #5687 for more information.
Flow 6 is the table-miss flow.
AntreaPolicyIngressRule
This table is very similar to table AntreaPolicyEgressRule but implements the ingress rules of Antrea-native NetworkPolicies. Depending on the tier to which the policy belongs, the rules will be installed in a table corresponding to that tier. The ingress table to tier mappings is as follows:
Antrea-native NetworkPolicy other Tiers -> AntreaPolicyIngressRule
K8s NetworkPolicy -> IngressRule
Antrea-native NetworkPolicy Baseline Tier -> IngressDefaultRule
Again for this table, you will need to keep in mind the Antrea-native NetworkPolicy specification and Antrea-native L7 NetworkPolicy specification that we are using that we are using. Since these sample ingress policies reside in the Application Tier, if you dump the flows for this table, you may see the following:
1. table=AntreaPolicyIngressRule, priority=64990,ct_state=-new+est,ip actions=goto_table:IngressMetric
2. table=AntreaPolicyIngressRule, priority=64990,ct_state=-new+rel,ip actions=goto_table:IngressMetric
3. table=AntreaPolicyIngressRule, priority=14500,reg1=0x7 actions=conjunction(14,2/3)
4. table=AntreaPolicyIngressRule, priority=14500,ip,nw_src=10.10.0.26 actions=conjunction(14,1/3)
5. table=AntreaPolicyIngressRule, priority=14500,tcp,tp_dst=8080 actions=conjunction(14,3/3)
6. table=AntreaPolicyIngressRule, priority=14500,conj_id=14,ip actions=set_field:0xd->reg6,ct(commit,table=IngressMetric,zone=65520,exec(set_field:0xd/0xffffffff->ct_label,set_field:0x80/0x80->ct_mark,set_field:0x20000000000000000/0xfff0000000000000000->ct_label))
7. table=AntreaPolicyIngressRule, priority=14600,ip,nw_src=10.10.0.26 actions=conjunction(6,1/3)
8. table=AntreaPolicyIngressRule, priority=14600,reg1=0x25 actions=conjunction(6,2/3)
9. table=AntreaPolicyIngressRule, priority=14600,tcp,tp_dst=80 actions=conjunction(6,3/3)
10. table=AntreaPolicyIngressRule, priority=14600,conj_id=6,ip actions=set_field:0x6->reg6,ct(commit,table=IngressMetric,zone=65520,exec(set_field:0x6/0xffffffff->ct_label))
11. table=AntreaPolicyIngressRule, priority=14600,ip actions=conjunction(4,1/2)
12. table=AntreaPolicyIngressRule, priority=14599,reg1=0x25 actions=conjunction(4,2/2)
13. table=AntreaPolicyIngressRule, priority=14599,conj_id=4 actions=set_field:0x4->reg3,set_field:0x400/0x400->reg0,goto_table:IngressMetric
14. table=AntreaPolicyIngressRule, priority=0 actions=goto_table:IngressRule
Flows 1-2, which are installed by default with the highest priority, match non-new and “tracked” packets and forward them to table IngressMetric to bypass the check from egress rules. This means that if a connection is established, its packets go straight to table IngressMetric, with no other match required. In particular, this ensures that reply traffic is never dropped because of an Antrea-native NetworkPolicy or K8s NetworkPolicy rule. However, this also means that ongoing connections are not affected if the Antrea-native NetworkPolicy or the K8s NetworkPolicy is updated.
Similar to table AntreaPolicyEgressRule, the priorities of flows 3-13 installed for the ingress rules are decided by the following:
- The
spec.tier
value in an Antrea-native NetworkPolicy determines the primary level for flow priority. - The
spec.priority
value in an Antrea-native NetworkPolicy determines the secondary level for flow priority within the samespec.tier
. A lower value in this field corresponds to a higher priority for the flow. - The rule’s position within an Antrea-native NetworkPolicy also influences flow priority. Rules positioned closer to the beginning have higher priority for the flow.
Flows 3-6, whose priories are all 14500, are installed for the egress rule AllowFromClientL7
in the sample policy. These flows are described as follows:
- Flow 3 is used to match packets with the source IP address in set {10.10.0.26}, which has all IP addresses of the Pods selected by the label
app: client
, constituting the first dimension forcojunction
withconj_id
14. - Flow 4 is used to match packets with the output OVS port in set {0x25}, which has all the ports of the Pods selected by the label
app: web
, constituting the second dimension forconjunction
withconj_id
14. - Flow 5 is used to match packets with the destination TCP port in set {8080} specified in the rule, constituting the third dimension for
conjunction
withconj_id
14. - Flow 6 is used to match packets meeting all the three dimensions of
conjunction
withconj_id
14 and forward them to table IngressMetric, persistingconj_id
toIngressRuleCTLabel
consumed in table IngressMetric. Additionally, for the L7 protocol:L7NPRedirectCTMark
is persisted, indicating the packets should be redirected to an application-aware engine to be filtered according to L7 rules, such as methodGET
and path/api/v2/*
in the sample policy.- A VLAN ID allocated for the Antrea-native L7 NetworkPolicy is persisted in
L7NPRuleVlanIDCTLabel
, which will be consumed in table Output.
Flows 7-11, whose priorities are 14600, are installed for the egress rule AllowFromClient
in the sample policy. These flows are described as follows:
- Flow 7 is used to match packets with the source IP address in set {10.10.0.26}, which has all IP addresses of the Pods selected by the label
app: client
, constituting the first dimension forcojunction
withconj_id
6. - Flow 8 is used to match packets with the output OVS port in set {0x25}, which has all the ports of the Pods selected by the label
app: web
, constituting the second dimension forconjunction
withconj_id
6. - Flow 9 is used to match packets with the destination TCP port in set {80} specified in the rule, constituting the third dimension for
conjunction
withconj_id
6. - Flow 10 is used to match packets meeting all the three dimensions of
conjunction
withconj_id
6 and forward them to table IngressMetric, persistingconj_id
toIngressRuleCTLabel
consumed in table IngressMetric.
Flows 11-13, whose priorities are all 14599, are installed for the egress rule with a Drop
action defined after the rule AllowFromClient
in the sample policy, serves as a default rule. Unlike the default of K8s NetworkPolicy, Antrea-native NetworkPolicy has no default rule, and all rules should be explicitly defined. Hence, they are evaluated as-is, and there is no need for a table [AntreaPolicyIngressDefaultRule]. These flows are described as follows:
- Flow 11 is used to match any IP packets, constituting the second dimension for
conjunction
withconj_id
4. This flow, which matches all IP packets, exists because we need at least 2 dimensions for a conjunctive match. - Flow 12 is used to match packets with the output OVS port in set {0x25}, which has all the ports of the Pods selected by the label
app: web
, constituting the first dimension forconjunction
withconj_id
4. - Flow 13 is used to match packets meeting both dimensions of
conjunction
withconj_id
4.APDenyRegMark
that will be consumed in table IngressMetric to which the packets are forwarded is loaded.
Flow 14 is the table-miss flow to forward packets not matched by other flows to table IngressMetric.
IngressRule
This table is very similar to table EgressRule but implements ingress rules for K8s NetworkPolicies. Once again, you will need to keep in mind the K8s NetworkPolicy specification that we are using.
If you dump the flows of this table, you should see something like this:
1. table=IngressRule, priority=200,ip,nw_src=10.10.0.26 actions=conjunction(3,1/3)
2. table=IngressRule, priority=200,reg1=0x25 actions=conjunction(3,2/3)
3. table=IngressRule, priority=200,tcp,tp_dst=80 actions=conjunction(3,3/3)
4. table=IngressRule, priority=190,conj_id=3,ip actions=set_field:0x3->reg6,ct(commit,table=IngressMetric,zone=65520,exec(set_field:0x3/0xffffffff->ct_label))
5. table=IngressRule, priority=0 actions=goto_table:IngressDefaultRule
Flows 1-4 are installed for the ingress rule in the sample K8s NetworkPolicy. These flows are described as follows:
- Flow 1 is used to match packets with the source IP address in set {10.10.0.26}, which is from the Pods selected by the label
app: client
in thedefault
Namespace, constituting the first dimension forconjunction
withconj_id
3. - Flow 2 is used to match packets with the output port OVS in set {0x25}, which has all ports of the Pods selected by the label
app: web
in thedefault
Namespace, constituting the second dimension forconjunction
withconj_id
3. - Flow 3 is used to match packets with the destination TCP port in set {80} specified in the rule, constituting the third dimension for
conjunction
withconj_id
3. - Flow 4 is used to match packets meeting all the three dimensions of
conjunction
withconj_id
3 and forward them to table IngressMetric, persistingconj_id
toIngressRuleCTLabel
.
Flow 5 is the table-miss flow to forward packets not matched by other flows to table IngressDefaultRule.
IngressDefaultRule
This table is similar in its purpose to table EgressDefaultRule, and it complements table IngressRule for K8s NetworkPolicy ingress rule implementation. In Kubernetes, when a NetworkPolicy is applied to a set of Pods, then the default behavior for ingress connections for these Pods becomes “deny” (they become isolated Pods). This table is in charge of dropping traffic destined for Pods to which a NetworkPolicy (with an ingress rule) is applied, and which did not match any of the “allow” list rules.
If you dump the flows of this table, you may see the following:
1. table=IngressDefaultRule, priority=200,reg1=0x25 actions=drop
2. table=IngressDefaultRule, priority=0 actions=goto_table:IngressMetric
Flow 1, based on our sample K8s NetworkPolicy, is to drop traffic destined for OVS port 0x25, the port number associated with a Pod selected by the label app: web
.
Flow 2 is the table-miss flow to forward packets to table IngressMetric.
This table is also used to implement Antrea-native NetworkPolicy ingress rules created in the Baseline Tier. Since the Baseline Tier is meant to be enforced after K8s NetworkPolicies, the corresponding flows will be created at a lower priority than K8s NetworkPolicy default drop flows. These flows are similar to flows 3-9 in table AntreaPolicyIngressRule.
IngressMetric
This table is very similar to table EgressMetric, but used to collect ingress metrics for Antrea-native NetworkPolicies.
If you dump the flows of this table, you may see the following:
1. table=IngressMetric, priority=200,ct_state=+new,ct_label=0x3/0xffffffff,ip actions=goto_table:ConntrackCommit
2. table=IngressMetric, priority=200,ct_state=-new,ct_label=0x3/0xffffffff,ip actions=goto_table:ConntrackCommit
3. table=IngressMetric, priority=200,ct_state=+new,ct_label=0x6/0xffffffff,ip actions=goto_table:ConntrackCommit
4. table=IngressMetric, priority=200,ct_state=-new,ct_label=0x6/0xffffffff,ip actions=goto_table:ConntrackCommit
5. table=IngressMetric, priority=200,reg0=0x400/0x400,reg3=0x4 actions=drop
6. table=IngressMetric, priority=0 actions=goto_table:ConntrackCommit
Flows 1-2, matching packets with IngressRuleCTLabel
set to 3 (the conj_id
allocated for the sample K8s NetworkPolicy ingress rule and loaded in table IngressRule flow 4), are used to collect metrics for the ingress rule.
Flows 3-4, matching packets with IngressRuleCTLabel
set to 6 (the conj_id
allocated for the sample Antrea-native NetworkPolicy ingress rule and loaded in table AntreaPolicyIngressRule flow 10), are used to collect metrics for the ingress rule.
Flow 5 is the drop rule for the sample Antrea-native NetworkPolicy ingress rule. It drops the packets by matching APDenyRegMark
loaded in table AntreaPolicyIngressRule flow 13 and APConjIDField
set to 4 which is the conj_id
allocated for the ingress rule and loaded in table AntreaPolicyIngressRule flow 13.
Flow 6 is the table-miss flow.
ConntrackCommit
This table is in charge of committing non-Service connections in CtZone
.
If you dump the flows of this table, you may see the following:
1. table=ConntrackCommit, priority=200,ct_state=+new+trk-snat,ct_mark=0/0x10,ip actions=ct(commit,table=Output,zone=65520,exec(move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3]))
2. table=ConntrackCommit, priority=0 actions=goto_table:Output
Flow 1 is designed to match the first packet of non-Service connections with the “tracked” state and NotServiceCTMark
. Then it commits the relevant connections in CtZone
, persisting the value of PktSourceField
to ConnSourceCTMarkField
, and forwards the packets to table Output.
Flow 2 is the table-miss flow.
Output
This is the final table in the pipeline, responsible for handling the output of packets from OVS. It addresses the following cases:
- Output packets to an application-aware engine for further L7 protocol processing.
- Output packets to a target port and a mirroring port defined in a TrafficControl CR with
Mirror
action. - Output packets to a port defined in a TrafficControl CR with
Redirect
action. - Output packets from hairpin connections to the ingress port where they were received.
- Output packets to a target port.
- Output packets to the OpenFlow controller (Antrea Agent).
- Drop packets.
If you dump the flows of this table, you may see the following:
1. table=Output, priority=212,ct_mark=0x80/0x80,reg0=0x200000/0x600000 actions=push_vlan:0x8100,move:NXM_NX_CT_LABEL[64..75]->OXM_OF_VLAN_VID[],output:"antrea-l7-tap0"
2. table=Output, priority=211,reg0=0x200000/0x600000,reg4=0x400000/0xc00000 actions=output:NXM_NX_REG1[],output:NXM_NX_REG9[]
3. table=Output, priority=211,reg0=0x200000/0x600000,reg4=0x800000/0xc00000 actions=output:NXM_NX_REG9[]
4. table=Output, priority=210,ct_mark=0x40/0x40 actions=IN_PORT
5. table=Output, priority=200,reg0=0x200000/0x600000 actions=output:NXM_NX_REG1[]
6. table=Output, priority=200,reg0=0x2400000/0xfe600000 actions=meter:256,controller(reason=no_match,id=62373,userdata=01.01)
7. table=Output, priority=200,reg0=0x4400000/0xfe600000 actions=meter:256,controller(reason=no_match,id=62373,userdata=01.02)
8. table=Output, priority=0 actions=drop
Flow 1 is for case 1. It matches packets with L7NPRedirectCTMark
and OutputToOFPortRegMark
, and then outputs them to the port antrea-l7-tap0
specifically created for connecting to an application-aware engine. Notably, these packets are pushed with an 802.1Q header and loaded with the VLAN ID value persisted in L7NPRuleVlanIDCTLabel
before being output, due to the implementation of Antrea-native L7 NetworkPolicy. The application-aware engine enforcing L7 policies (e.g., Suricata) can leverage the VLAN ID to determine which set of rules to apply to the packet.
Flow 2 is for case 2. It matches packets with TrafficControlMirrorRegMark
and OutputToOFPortRegMark
, and then outputs them to the port specified in TargetOFPortField
and the port specified in TrafficControlTargetOFPortField
. Unlike the Redirect
action, the Mirror
action creates an additional copy of the packet.
Flow 3 is for case 3. It matches packets with TrafficControlRedirectRegMark
and OutputToOFPortRegMark
, and then outputs them to the port specified in TrafficControlTargetOFPortField
.
Flow 4 is for case 4. It matches packets from hairpin connections by matching HairpinCTMark
and outputs them back to the port where they were received.
Flow 5 is for case 5. It matches packets by matching OutputToOFPortRegMark
and outputs them to the OVS port specified by the value stored in TargetOFPortField
.
Flows 6-7 are for case 6. They match packets by matching OutputToControllerRegMark
and the value stored in PacketInOperationField
, then output them to the OpenFlow controller (Antrea Agent) with corresponding user data.
In practice, you will see additional flows similar to these ones to accommodate different scenarios (different PacketInOperationField values). Note that packets sent to controller are metered to avoid overrunning the antrea-agent and using too many resources.
Flow 8 is the table-miss flow for case 7. It drops packets that do not match any of the flows in this table.