PersistentPodState
FEATURE STATE: Kruise v1.2.0
With the development of cloud native, more and more companies start to deploy stateful services (e.g., Etcd, MQ) using Kubernetes. K8S StatefulSet is a workload for managing stateful services, and it considers the deployment characteristics of stateful services in many aspects. However, StatefulSet persistent only limited pod state, such as Pod Name is ordered and unchanging, PVC persistence, and can not cover other states, e.g. Pod IP retention, priority scheduling to previously deployed Nodes, etc. Typical Cases:
Service Discovery Middleware services are exceptionally sensitive to the Pod IP after deployment, requiring that the IP cannot be changed.
Database services persist data to the host disk, and changes to the Node to which they belong will result in data loss.
In response to the above description, by customizing PersistentPodState CRD, Kruise is able to persistent other states of the Pod, such as “IP Retention”.
For detailed design, please refer to: PPS Proposal.
Usage
Annotation Auto Generate PersistentPodState
apiVersion: apps.kruise.io/v1alpha1
kind: StatefulSet
metadata:
annotations:
# auto generate PersistentPodState
kruise.io/auto-generate-persistent-pod-state: "true"
# preferred node affinity, As follows, Pod rebuild will preferred deployment to the same node
kruise.io/preferred-persistent-topology: kubernetes.io/hostname[,other node labels]
# required node affinity, As follows, Pod rebuild will force deployment to the same zone
kruise.io/required-persistent-topology: failure-domain.beta.kubernetes.io/zone[,other node labels]
Some common PersistentPodState can be generated by annotation to satisfy most of the scenarios. For some complex scenarios, you can use PersistentPodState CRD to define them directly.
Define PersistentPodState CRD
apiVersion: apps.kruise.io/v1alpha1
kind: PersistentPodState
metadata:
name: echoserver
namespace: echoserver
spec:
targetRef:
# Native k8s or kruise StatefulSet
# only support StatefulSet
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
name: echoserver
# required node affinity. As follows, Pod rebuild will force deployment to the same zone
requiredPersistentTopology:
nodeTopologyKeys:
failure-domain.beta.kubernetes.io/zone[,other node labels]
# preferred node affinity. As follows, Pod rebuild will preferred deployment to the same node
preferredPersistentTopology:
- preference:
nodeTopologyKeys:
kubernetes.io/hostname[,other node labels]
# int [1, 100]
weight: 100
IP Retention Practice
“IP Retention” should be a common requirement for K8S deployments of stateful services. It does not mean “Specified Pod IP”, but requires that the Pod IP does not change after the first deployment, either by service release or by machine eviction. To achieve this, we need the K8S network component to support Pod IP retention and the ability to keep the IP as unchanged as possible. In this article, we have modified the Host-local plugin in the flannel network component so that it can achieve the effect of keeping the Pod IP unchanged under the same Node. Related principles will not be stated here, please refer to the code: host-local.
IP retention seems to be supported by the network component, how is it related with PersistentPodState? Well, there are some limitations to the implementation of “Pod IP unchanged” by network components. For example, flannel can only support the same Node to keep the Pod IP unchanged. However, the most important feature of K8S scheduling is “uncertainty”, so “how to ensure that Pods are rebuilt and scheduled to the same Node” is the problem that PersistentPodState solves.
1. Deploy stateful service echoserver, declaring “IP Retention” via annotations, as follows:
apiVersion: apps.kruise.io/v1alpha1
kind: StatefulSet
metadata:
name: echoserver
labels:
app: echoserver
annotations:
kruise.io/auto-generate-persistent-pod-state: "true"
kruise.io/preferred-persistent-topology: kubernetes.io/hostname
spec:
serviceName: echoserver
replicas: 2
selector:
matchLabels:
app: echoserver
template:
metadata:
labels:
app: echoserver
annotations:
# Notify the flannel network component that the Pod rebuild keeps the IP unchanged and "10" means the Pod is deleted until the next successful dispatch, with a maximum of 10 minutes in between
# Mainly consider scenarios such as deletion, capacity reduction, etc.
io.kubernetes.cri/reserved-ip-duration: "10"
spec:
terminationGracePeriodSeconds: 5
containers:
- name: echoserver
image: cilium/echoserver:latest
imagePullPolicy: IfNotPresent
2. According to the above configuration, kruise automatically generates PersistentPodState and records the node status of the first deployment of Pod in PersistentPodState.Status.
apiVersion: apps.kruise.io/v1alpha1
kind: PersistentPodState
metadata:
name: configserver
namespace: configserver
spec:
targetRef:
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
name: configserver
preferredPersistentTopology:
- preference:
nodeTopologyKeys:
kubernetes.io/hostname
weight: 100
status:
podStates:
# Record that pod-0 is deployed on worker2 node and pod-1 is deployed on worker1 node
configserver-0:
nodeName: worker2
nodeTopologyLabels:
kubernetes.io/hostname: worker2
configserver-1:
nodeName: worker1
nodeTopologyLabels:
kubernetes.io/hostname: worker1
3. After Pod rebuild due to service release or Node eviction, etc., kruise injects the recorded Pod node information into Pod NodeAffinity, which in turn enables the Pod IP to remain unchanged, as follows:
apiVersion: v1
kind: Pod
metadata:
name: configserver-0
namespace: configserver
annotations:
io.kubernetes.cri/reserved-ip-duration: 10
spec:
# kruise webhook injection
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker2
weight: 100
containers:
...