Optimize the pods recovery efficiency when edge nodes restart

Optimize the pods recovery efficiency when edge nodes restart

1. Requirement Analysis

OpenYurt extend the cloud native ability to edge computing and IoT scenarios. The cloud nodes provide the ability to deploy the services on the edge nodes and realize the whole life-cycle management of the applications. However, edge node network is in the weak connection status.

Thereby, the frequent restart of the edge nodes involves the OS restart, Kubernetes components restart and OpenYurt components restart. The recovery of service applications costs almost 1 minute. To recover the edge nodes faster, we should optimize the pods recovery efficiency when edge nodes restart.

The specific requirements of the optimization are summarized as follows:

The restart process will not be blocked, and all the components will be restarted successfully
All the pods can restart successfully and satisfy the stable status of the applications
Promote the restart efficiency and make the cost time less than 30s
The solutions can be used in most hardwares

2. Question Analysis

Kubelet is the controller of the Kubernetes work node. It is responsible for the creation and management of Pods on the work node. For an OpenYurt cluster, the components are the pods deployed on the Kubernetes cluster. When the work node restarts, Kubelet will recover the pods (Openyurt components) and the service pods. Therefore, detect the time delay of the Kubelet operations should be the focus.

The operations of Kubelet component when the work node restart are shown in Figure X. After the work node restart, Kubelet will initialize first. Then, Kubelet will start the static pods, for instance, YurtHub, etc. After that, YurtHub start-up and load the local cache. In the weak connection condition of the network, the Kubelet list/watch the pods from the YurtHub cache. Lastly, the Kubelet will recover the other pods according to the cache.

According to the analysis above, YurtHub is the important component for the work node restart. The YurtHub recovery cost a period of time. Moreover, Kubelet will recover the service pods according to the local cache. The process depends on the CacheManager and StorageManager of YurtHub.

Figure X shows the YurtHub init progress. According to the investigation of YurtHub source code, the start-up process are serialized. Each component part will start one by one. The time delay can be evaluated by setting logs.

From the above, the optimization will focus on the YurtHub start-up and Kubelet recover the service pods based on YurtHub local cache .

3. Experiments

3.0 Experiment Environment

Test Environment:

1 master node, 1 work node

Alibaba ECS 4C8G
OS: CentOS 8
K8s Version: v1.19.4

Instructions:

*The 10 service pods are created by nginx-latest image

*The table records the time-delay from the time point entering reboot order and the unit is ms

*The image pull strategy is IfNotPresent

*First Service Pod Recovery means the time point when the first nginx pod recover.

*Last Service Pod Recovery means the time point when the first nginx pod recover.

3.1 Native Kubernetes Start-up Time Test

3.1.1 Static pods and 10 service pods

Test Index	Entering `reboot` order	Kubelet Start	Static Pod (YurtHub) Start	First Service Pod Recovery	Last Service Pod Recovery
1	0	20000	29021	35627	36401
2	0	21000	30922	37712	38521
3	0	24000*	34010	40109	40821
Avg	0	21667	31318	37816	38581
Diff	-	21667	9651	6498	765

*the system restart time is not stable, and the time ranges from 19s to 24s.

3.2 OpenYurt Start-up Time Test

3.2.1 Edge Network Connected, 10 service pods

Test Index	Entering `reboot` order	Kubelet Start	Static Pod (YurtHub) Start	First Service Pod Recovery	Last Service Pod Recovery
1	0	26000	35012	47030	47890
2	0	25000	34921	48041	48892
3	0	25000	34829	47843	48602
Avg	0	25333	34921	47638	48461
Diff	-	25333	9588	12717	823

3.2.2 Edge Network Disconnected, 10 service pods, OpenYurt Edge

Test Index	Entering `reboot` order	Kubelet Start	Static Pod (YurtHub) Start	First Service Pod Recovery	Last Service Pod Recovery
1	0	26000	35219	48203	49260
2	0	25000	35621	48721	49422
3	0	27000	37015	50126	50921
Avg	0	26000	35952	49016	49867
Diff	-	26000	9951	13064	852

*According to the experiment, the time between Kubelet Start to work and First Service Pod Recovery is varied.

4. Time Delay Comparison between OpenYurt and Native Kubernetes

In this section, the time delay comparison between the OpenYurt Cluster (Network Connected and DIsconnected) and native Kubernetes are shown in the Table below.

Four time periods re defined for comparison.

Period 1: From OS restart begin to Kubelet start to work.

Period 2: From Kubelet start to Static Pod (YurtHub) start.

Period 3: From Static Pod (YurtHub) start first service pod recovery.

Period 4: From first service pod recovery to last service pod recovery.

Whole Recovery Period: From OS restart begin to last service pod recovery.

	Native Kubernetes	OpenYurt Cluster (Network Connected)	OpenYurt Cluster (Network Disconnected)
Period 1	21.67s	25.33s	26.00s
Period 2	9.65s	9.58s	9.95s
Period 3	6.50s	12.71s	13.06s
Period 4	0.76s	0.82s	0.85s
Whole Recovery Period	38.58s	48.46s	49.86s

According to the comparison table, OpenYurt Cluster spent 4-5s more than the time Native Kubernetes from OS restart begin to Kubelet start to work (Period 1). And the OS restart time to the Kubelet start time is varying. Additionally, OpenYurt Cluster spent almost 9.7s from Kubelet start to work to static pod (YurtHub) start (Period 2), which is the same as the native kubernetes cluster. In period 3, OpenYurt cluster costs 6s more than the native kubernetes cluster.

Generally, when openyurt edge node restart, the overall pods restart time are 11.28s longer than native Kubernetes node restart.

5. Detailed Analysis and Optimization

Based on the detailed test on openyurt edge node, the period from Kubelet start to YurtHub start and the period from YurtHub server work to first service pod recovery will be analyzed in detail.

5.1 Detailed analysis of the period from Kubelet start to static pod (YurtHub) Start

From Kubelet start to first static pod YurtHub start, this period costs almost 9.5s among three experiment scenarios. Kubelet start process invocation chain can be summarized in the following graph.

main                                                                             // cmd/kubelet/kubelet.go
 |--NewKubeletCommand                                                            // cmd/kubelet/app/server.go
   |--Run                                                                        // cmd/kubelet/app/server.go
      |--initForOS                                                               // cmd/kubelet/app/server.go
      |--run                                                                     // cmd/kubelet/app/server.go
        |--initConfigz                                                           // cmd/kubelet/app/server.go
        |--BuildAuth
        |--cm.NodeAllocatableRoot
        |--cadvisor.NewImageFsInfoProvider
        |--NewContainerManager
        |--ApplyOOMScoreAdj
        |--PreInitRuntimeService
        |--RunKubelet                                                            // cmd/kubelet/app/server.go
        | |--k = createAndInitKubelet                                            // cmd/kubelet/app/server.go
        | |  |--NewMainKubelet
        | |  |  |--watch k8s Service
        | |  |  |--watch k8s Node
        | |  |  |--klet := &Kubelet{}
        | |  |  |--init klet fields
        | |  |
        | |  |--k.BirthCry()
        | |  |--k.StartGarbageCollection()
        | |
        | |--startKubelet(k)                                                     // cmd/kubelet/app/server.go
        |    |--go k.Run()                                                       // -> pkg/kubelet/kubelet.go
        |    |  |--go cloudResourceSyncManager.Run()
        |    |  |--initializeModules
        |    |  |--go volumeManager.Run()
        |    |  |--go nodeLeaseController.Run()
        |    |  |--initNetworkUtil() // setup iptables
        |    |  |--go Until(PerformPodKillingWork, 1*time.Second, neverStop)
        |    |  |--statusManager.Start()
        |    |  |--runtimeClassManager.Start
        |    |  |--pleg.Start()
        |    |  |--syncLoop(updates, kl)                                         // pkg/kubelet/kubelet.go
        |    |
        |    |--k.ListenAndServe
        |
        |--go http.ListenAndServe(healthz)

According to the Kubelet running logs, from systemd started Kubelet to Run() function, it costs almost 2.5s. After that, between Run() to watch k8s Service inNewMainKubelet() , it costs 0.5s. From watching API Server to startKubelet() function executed finished, it costs 6.5s.

The total time cost is stable 9.5s to 10s among native Kubernetes cluster and openyurt network connected and disconnected cluster.

5.2 Detailed analysis of the period from YurtHub server work to first service pod recovery

After Kubelet start to work (startKubelet()finished and listen to 10250 port), the static pods and service pods will be start one by one. Kubelet start pods process syncLoop invocation chain can be summarized in the following graph.

kubelet.syncLoop    /pkg/kubelet/kubelet.go
|--kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh)
    |--u, open := <-configCh
    |--handler.HandlePodAdditions(u.Pods) //Kubelet.HandlePodAdditions
        |--sort.Sort(sliceutils.PodsByCreationTime(pods))
        |--kl.handleMirrorPod(pod, start)
            |--kl.dispatchWork
        |--kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
            |--kl.podWorkers.UpdatePod //podWorkers.UpdatePod    /pkg/kubelet/pod_worker.go
                |--p.managePodLoop
                    |--p.syncPodFn

In Kubelet syncLoop() function, syncTicker is assigned 1s for check the pod workers which is needed for sync. After that, housekeepingTicker is assigned 2s for test the pods whether need for cleaning. plegCh is set for test pods lifecycle.

In the HandlePodAdditions(u.Pods) function, the pods will be sorted by the create time stamp. The static pods will be first be executed and the other service pods will be filtered (to test whether it is terminated pods and whether it can be admit). Then, kl.dispatchWork() will be executed. kl.podWorkers.UpdatePod will concurrently invoke p.managePodLoop . p.syncPodFn is executed for establishing the pods.

The execution steps of syncPod method are summarized as follows Kubelet.syncPod:

Step 1: Record pod worker start latency if being created
Step 2: Create v1.PodStatus object
Step 3: Generate PodStatus
Step 4: Executing the admission handlers and make pods having privalage
Step 5: Create cgroups of the pod
Step 6: Make data dictionaries of the pod
Step 7: Wait for the volume attached
Step 8: Fetch the ImagePullSecrets
Step 9: Call the container runtime kl.containerRuntime.SyncPod(pod, podStatus, pullSecrets, kl.backOff)

The steps of creating the container runtime are summarized as follows kubeGenericRuntimeManager.SyncPod:

Step 1: Compute sandbox and container changes.
Step 2: Kill the pod if the sandbox has changed.
Step 3: kill any running containers in this pod which are not to keep.
Step 4: Create a sandbox for the pod if necessary.
Step 5: start ephemeral containers
Step 6: start the init container.
Step 7: start containers in podContainerChanges.ContainersToStart.

Step 4 and Step 7 will be detailed discussed as follows:

The detailed steps of Creating the sandbox for the pod can be listed as follows: RunPodSandbox()

Step 1: Pull the image for the sandbox.
Step 2: Create the sandbox container.
Step 3: Create Sandbox Checkpoint.
Step 4: Start the sandbox container.
Step 5: Setup networking for the sandbox.

Also, the detailed steps of starting the container are listed as follows: startContainer()

Step 1: pull the image.
Step 2: create the container.
Step 3: start the container.
Step 4: execute the post start hook.

On the basis of the pod start procedure by Kubelet, when the edge nodes restart and Kubelet initialized and start, YurtHub will start to work first. According to YurtHub relys on host network, it can be started without CNI start. There will be 1s between Kubelet started and YurtHub started. Also, there are 1.5s between YurtHub started and YurtHub server work. After YurtHub server work, it plays the role of apiserver in the weak network condition.

The recovery of nginx pods are blocked in createSandBox because they relys on CNI, and flannel as the CNI plugin is not ready.

Aug 26 16:04:28 openyurt-node-02 kubelet[1193]: E0826 16:04:28.209598    1193 pod_workers.go:191] Error syncing pod 464fc7d4-2a53-4a20-abc3-c51a919f1b1a ("nginx-06-78df84cfc7-b8fc2_default(464fc7d4-2a53-4a20-abc3-c51a919f1b1a)"), skipping: failed to "CreatePodSandbox" for "nginx-06-78df84cfc7-b8fc2_default(464fc7d4-2a53-4a20-abc3-c51a919f1b1a)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nginx-06-78df84cfc7-b8fc2_default(464fc7d4-2a53-4a20-abc3-c51a919f1b1a)\" failed: rpc error: code = Unknown desc = failed to set up sandbox container \"ec15044992d3d0df0185a41d00adaca0fa7895f8ac717399b00f24a68ae3fa3e\" network for pod \"nginx-06-78df84cfc7-b8fc2\": networkPlugin cni failed to set up pod \"nginx-06-78df84cfc7-b8fc2_default\" network: open /run/flannel/subnet.env: no such file or directory"

The time between YurtHub start to work and flannel start is 9s, however, in the native Kubernetes cluster, there only needs 3s. The openYurt cluster costs 6s more than native cluster in this period. According to the Kubelet logs, 6 seconds are blocked in the following procedure.

The time cost may because the first attempt to start flannel-cni plugin failed 3s after the YurtHub server works. The third attempt (after 9s) will be successful.

Aug 28 23:57:12 openyurt-node-02 kubelet[1185]: E0828 23:57:12.447073    1185 kuberuntime_manager.go:815] init container &Container{Name:install-cni-plugin,Image:docker.io/rancher/mirrored-flannelcni-flannel-cni-plugin:v1.1.0,Command:[cp],Args:[-f /flannel /opt/cni/bin/flannel],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:cni-plugin,ReadOnly:false,MountPath:/opt/cni/bin,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:flannel-token-2m92x,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod kube-flannel-ds-dskwd_kube-flannel(9f07d412-f6f4-40f6-a6b2-01d013004dd1): CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars

According to the experiment, Flannel start will cost 2.5s. After that, nginx pods will restart in 1s. The whole period will cost almost 12-13s in the disconnected network of openyurt cluster.

6. Optimization Strategy

Make the edge node service pods networking strategy from CNI to Host.
- It will save 8-9s because the host network service pods will not wait for CNI plugin flannel ready.
Claim the image pulling strategy to IfNotPresent.
- According to the docker images have been pulled by edge node before. IfNotPresent strategy make the docker only find the image from local node and it will not sync from the registry again. This period may save 10s-20s.