Optimize the pods recovery efficiency when edge nodes restart

1. Requirement Analysis

OpenYurt extend the cloud native ability to edge computing and IoT scenarios. The cloud nodes provide the ability to deploy the services on the edge nodes and realize the whole life-cycle management of the applications. However, edge node network is in the weak connection status.

Thereby, the frequent restart of the edge nodes involves the OS restart, Kubernetes components restart and OpenYurt components restart. The recovery of service applications costs almost 1 minute. To recover the edge nodes faster, we should optimize the pods recovery efficiency when edge nodes restart.

The specific requirements of the optimization are summarized as follows:

  • The restart process will not be blocked, and all the components will be restarted successfully
  • All the pods can restart successfully and satisfy the stable status of the applications
  • Promote the restart efficiency and make the cost time less than 30s
  • The solutions can be used in most hardwares

2. Question Analysis

Kubelet is the controller of the Kubernetes work node. It is responsible for the creation and management of Pods on the work node. For an OpenYurt cluster, the components are the pods deployed on the Kubernetes cluster. When the work node restarts, Kubelet will recover the pods (Openyurt components) and the service pods. Therefore, detect the time delay of the Kubelet operations should be the focus.

Optimize the pods recovery efficiency when edge nodes restart - 图1

The operations of Kubelet component when the work node restart are shown in Figure X. After the work node restart, Kubelet will initialize first. Then, Kubelet will start the static pods, for instance, YurtHub, etc. After that, YurtHub start-up and load the local cache. In the weak connection condition of the network, the Kubelet list/watch the pods from the YurtHub cache. Lastly, the Kubelet will recover the other pods according to the cache.

According to the analysis above, YurtHub is the important component for the work node restart. The YurtHub recovery cost a period of time. Moreover, Kubelet will recover the service pods according to the local cache. The process depends on the CacheManager and StorageManager of YurtHub.

Optimize the pods recovery efficiency when edge nodes restart - 图2

Figure X shows the YurtHub init progress. According to the investigation of YurtHub source code, the start-up process are serialized. Each component part will start one by one. The time delay can be evaluated by setting logs.

From the above, the optimization will focus on the YurtHub start-up and Kubelet recover the service pods based on YurtHub local cache .

3. Experiments

3.0 Experiment Environment

Test Environment:

1 master node, 1 work node

  • Alibaba ECS 4C8G
  • OS: CentOS 8
  • K8s Version: v1.19.4

Instructions:

*The 10 service pods are created by nginx-latest image

*The table records the time-delay from the time point entering reboot order and the unit is ms

*The image pull strategy is IfNotPresent

*First Service Pod Recovery means the time point when the first nginx pod recover.

*Last Service Pod Recovery means the time point when the first nginx pod recover.

3.1 Native Kubernetes Start-up Time Test

3.1.1 Static pods and 10 service pods

Test IndexEntering reboot orderKubelet StartStatic Pod (YurtHub) StartFirst Service Pod RecoveryLast Service Pod Recovery
1020000290213562736401
2021000309223771238521
3024000*340104010940821
Avg021667313183781638581
Diff-2166796516498765

*the system restart time is not stable, and the time ranges from 19s to 24s.

3.2 OpenYurt Start-up Time Test

3.2.1 Edge Network Connected, 10 service pods

Test IndexEntering reboot orderKubelet StartStatic Pod (YurtHub) StartFirst Service Pod RecoveryLast Service Pod Recovery
1026000350124703047890
2025000349214804148892
3025000348294784348602
Avg025333349214763848461
Diff-25333958812717823

3.2.2 Edge Network Disconnected, 10 service pods, OpenYurt Edge

Test IndexEntering reboot orderKubelet StartStatic Pod (YurtHub) StartFirst Service Pod RecoveryLast Service Pod Recovery
1026000352194820349260
2025000356214872149422
3027000370155012650921
Avg026000359524901649867
Diff-26000995113064852

*According to the experiment, the time between Kubelet Start to work and First Service Pod Recovery is varied.

4. Time Delay Comparison between OpenYurt and Native Kubernetes

In this section, the time delay comparison between the OpenYurt Cluster (Network Connected and DIsconnected) and native Kubernetes are shown in the Table below.

Four time periods re defined for comparison.

Period 1: From OS restart begin to Kubelet start to work.

Period 2: From Kubelet start to Static Pod (YurtHub) start.

Period 3: From Static Pod (YurtHub) start first service pod recovery.

Period 4: From first service pod recovery to last service pod recovery.

Whole Recovery Period: From OS restart begin to last service pod recovery.

Native KubernetesOpenYurt Cluster (Network Connected)OpenYurt Cluster (Network Disconnected)
Period 121.67s25.33s26.00s
Period 29.65s9.58s9.95s
Period 36.50s12.71s13.06s
Period 40.76s0.82s0.85s
Whole Recovery Period38.58s48.46s49.86s

According to the comparison table, OpenYurt Cluster spent 4-5s more than the time Native Kubernetes from OS restart begin to Kubelet start to work (Period 1). And the OS restart time to the Kubelet start time is varying. Additionally, OpenYurt Cluster spent almost 9.7s from Kubelet start to work to static pod (YurtHub) start (Period 2), which is the same as the native kubernetes cluster. In period 3, OpenYurt cluster costs 6s more than the native kubernetes cluster.

Generally, when openyurt edge node restart, the overall pods restart time are 11.28s longer than native Kubernetes node restart.

5. Detailed Analysis and Optimization

Based on the detailed test on openyurt edge node, the period from Kubelet start to YurtHub start and the period from YurtHub server work to first service pod recovery will be analyzed in detail.

5.1 Detailed analysis of the period from Kubelet start to static pod (YurtHub) Start

From Kubelet start to first static pod YurtHub start, this period costs almost 9.5s among three experiment scenarios. Kubelet start process invocation chain can be summarized in the following graph.

  1. main // cmd/kubelet/kubelet.go
  2. |--NewKubeletCommand // cmd/kubelet/app/server.go
  3. |--Run // cmd/kubelet/app/server.go
  4. |--initForOS // cmd/kubelet/app/server.go
  5. |--run // cmd/kubelet/app/server.go
  6. |--initConfigz // cmd/kubelet/app/server.go
  7. |--BuildAuth
  8. |--cm.NodeAllocatableRoot
  9. |--cadvisor.NewImageFsInfoProvider
  10. |--NewContainerManager
  11. |--ApplyOOMScoreAdj
  12. |--PreInitRuntimeService
  13. |--RunKubelet // cmd/kubelet/app/server.go
  14. | |--k = createAndInitKubelet // cmd/kubelet/app/server.go
  15. | | |--NewMainKubelet
  16. | | | |--watch k8s Service
  17. | | | |--watch k8s Node
  18. | | | |--klet := &Kubelet{}
  19. | | | |--init klet fields
  20. | | |
  21. | | |--k.BirthCry()
  22. | | |--k.StartGarbageCollection()
  23. | |
  24. | |--startKubelet(k) // cmd/kubelet/app/server.go
  25. | |--go k.Run() // -> pkg/kubelet/kubelet.go
  26. | | |--go cloudResourceSyncManager.Run()
  27. | | |--initializeModules
  28. | | |--go volumeManager.Run()
  29. | | |--go nodeLeaseController.Run()
  30. | | |--initNetworkUtil() // setup iptables
  31. | | |--go Until(PerformPodKillingWork, 1*time.Second, neverStop)
  32. | | |--statusManager.Start()
  33. | | |--runtimeClassManager.Start
  34. | | |--pleg.Start()
  35. | | |--syncLoop(updates, kl) // pkg/kubelet/kubelet.go
  36. | |
  37. | |--k.ListenAndServe
  38. |
  39. |--go http.ListenAndServe(healthz)

According to the Kubelet running logs, from systemd started Kubelet to Run() function, it costs almost 2.5s. After that, between Run() to watch k8s Service inNewMainKubelet() , it costs 0.5s. From watching API Server to startKubelet() function executed finished, it costs 6.5s.

The total time cost is stable 9.5s to 10s among native Kubernetes cluster and openyurt network connected and disconnected cluster.

5.2 Detailed analysis of the period from YurtHub server work to first service pod recovery

After Kubelet start to work (startKubelet()finished and listen to 10250 port), the static pods and service pods will be start one by one. Kubelet start pods process syncLoop invocation chain can be summarized in the following graph.

  1. kubelet.syncLoop /pkg/kubelet/kubelet.go
  2. |--kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh)
  3. |--u, open := <-configCh
  4. |--handler.HandlePodAdditions(u.Pods) //Kubelet.HandlePodAdditions
  5. |--sort.Sort(sliceutils.PodsByCreationTime(pods))
  6. |--kl.handleMirrorPod(pod, start)
  7. |--kl.dispatchWork
  8. |--kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
  9. |--kl.podWorkers.UpdatePod //podWorkers.UpdatePod /pkg/kubelet/pod_worker.go
  10. |--p.managePodLoop
  11. |--p.syncPodFn

In Kubelet syncLoop() function, syncTicker is assigned 1s for check the pod workers which is needed for sync. After that, housekeepingTicker is assigned 2s for test the pods whether need for cleaning. plegCh is set for test pods lifecycle.

In the HandlePodAdditions(u.Pods) function, the pods will be sorted by the create time stamp. The static pods will be first be executed and the other service pods will be filtered (to test whether it is terminated pods and whether it can be admit). Then, kl.dispatchWork() will be executed. kl.podWorkers.UpdatePod will concurrently invoke p.managePodLoop . p.syncPodFn is executed for establishing the pods.

The execution steps of syncPod method are summarized as follows Kubelet.syncPod:

  1. Step 1: Record pod worker start latency if being created
  2. Step 2: Create v1.PodStatus object
  3. Step 3: Generate PodStatus
  4. Step 4: Executing the admission handlers and make pods having privalage
  5. Step 5: Create cgroups of the pod
  6. Step 6: Make data dictionaries of the pod
  7. Step 7: Wait for the volume attached
  8. Step 8: Fetch the ImagePullSecrets
  9. Step 9: Call the container runtime kl.containerRuntime.SyncPod(pod, podStatus, pullSecrets, kl.backOff)

The steps of creating the container runtime are summarized as follows kubeGenericRuntimeManager.SyncPod:

  1. Step 1: Compute sandbox and container changes.
  2. Step 2: Kill the pod if the sandbox has changed.
  3. Step 3: kill any running containers in this pod which are not to keep.
  4. Step 4: Create a sandbox for the pod if necessary.
  5. Step 5: start ephemeral containers
  6. Step 6: start the init container.
  7. Step 7: start containers in podContainerChanges.ContainersToStart.

Step 4 and Step 7 will be detailed discussed as follows:

The detailed steps of Creating the sandbox for the pod can be listed as follows: RunPodSandbox()

  1. Step 1: Pull the image for the sandbox.
  2. Step 2: Create the sandbox container.
  3. Step 3: Create Sandbox Checkpoint.
  4. Step 4: Start the sandbox container.
  5. Step 5: Setup networking for the sandbox.

Also, the detailed steps of starting the container are listed as follows: startContainer()

  1. Step 1: pull the image.
  2. Step 2: create the container.
  3. Step 3: start the container.
  4. Step 4: execute the post start hook.

On the basis of the pod start procedure by Kubelet, when the edge nodes restart and Kubelet initialized and start, YurtHub will start to work first. According to YurtHub relys on host network, it can be started without CNI start. There will be 1s between Kubelet started and YurtHub started. Also, there are 1.5s between YurtHub started and YurtHub server work. After YurtHub server work, it plays the role of apiserver in the weak network condition.

The recovery of nginx pods are blocked in createSandBox because they relys on CNI, and flannel as the CNI plugin is not ready.

  1. Aug 26 16:04:28 openyurt-node-02 kubelet[1193]: E0826 16:04:28.209598 1193 pod_workers.go:191] Error syncing pod 464fc7d4-2a53-4a20-abc3-c51a919f1b1a ("nginx-06-78df84cfc7-b8fc2_default(464fc7d4-2a53-4a20-abc3-c51a919f1b1a)"), skipping: failed to "CreatePodSandbox" for "nginx-06-78df84cfc7-b8fc2_default(464fc7d4-2a53-4a20-abc3-c51a919f1b1a)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nginx-06-78df84cfc7-b8fc2_default(464fc7d4-2a53-4a20-abc3-c51a919f1b1a)\" failed: rpc error: code = Unknown desc = failed to set up sandbox container \"ec15044992d3d0df0185a41d00adaca0fa7895f8ac717399b00f24a68ae3fa3e\" network for pod \"nginx-06-78df84cfc7-b8fc2\": networkPlugin cni failed to set up pod \"nginx-06-78df84cfc7-b8fc2_default\" network: open /run/flannel/subnet.env: no such file or directory"

The time between YurtHub start to work and flannel start is 9s, however, in the native Kubernetes cluster, there only needs 3s. The openYurt cluster costs 6s more than native cluster in this period. According to the Kubelet logs, 6 seconds are blocked in the following procedure.

The time cost may because the first attempt to start flannel-cni plugin failed 3s after the YurtHub server works. The third attempt (after 9s) will be successful.

  1. Aug 28 23:57:12 openyurt-node-02 kubelet[1185]: E0828 23:57:12.447073 1185 kuberuntime_manager.go:815] init container &Container{Name:install-cni-plugin,Image:docker.io/rancher/mirrored-flannelcni-flannel-cni-plugin:v1.1.0,Command:[cp],Args:[-f /flannel /opt/cni/bin/flannel],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:cni-plugin,ReadOnly:false,MountPath:/opt/cni/bin,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:flannel-token-2m92x,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod kube-flannel-ds-dskwd_kube-flannel(9f07d412-f6f4-40f6-a6b2-01d013004dd1): CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars

According to the experiment, Flannel start will cost 2.5s. After that, nginx pods will restart in 1s. The whole period will cost almost 12-13s in the disconnected network of openyurt cluster.

6. Optimization Strategy

  • Make the edge node service pods networking strategy from CNI to Host.
    • It will save 8-9s because the host network service pods will not wait for CNI plugin flannel ready.
  • Claim the image pulling strategy to IfNotPresent.
    • According to the docker images have been pulled by edge node before. IfNotPresent strategy make the docker only find the image from local node and it will not sync from the registry again. This period may save 10s-20s.