Cluster boot sequence

Kubernetes Bootstrap

This is an overview of how a Kubernetes cluster comes up, when using kOps.

From spec to complete configuration

The kOps tool itself takes the (minimal) spec of a cluster that the user specifies, and computes a complete configuration, setting defaults where values are not specified, and deriving appropriate dependencies. The “complete” specification includes the set of all flags that will be passed to all components. All decisions about how to install the cluster are made at this stage, and thus every decision can in theory be changed if the user specifies a value in the spec.

This complete specification is set in the LaunchTemplate for the AutoScaling Group (on AWS), or the Managed Instance Group (on GCE).

On both AWS & GCE, everything (nodes & masters) runs in an ASG/MIG; this means that failures (or the user) can terminate machines and the system will self-heal.

nodeup: from image to kubelet

nodeup is the component that installs packages and sets up the OS, sufficiently for Kubelet. The core requirements are:

  • Docker must be installed. nodeup will install Docker 1.13.1, the version of Docker tested with Kubernetes 1.8
  • Kubelet, which is installed a systemd service

In addition, nodeup installs:

  • Protokube, which is a kops-specific component

/etc/kubernetes/manifests

kubelet starts pods as controlled by the files in /etc/kubernetes/manifests. These files are created by nodeup and protokube (ideally all by protokube, but currently split between the two).

These pods are declared using the standard k8s manifests, just as if they were stored in the API. But these are used to break the circular dependency for the bring-up of our core components, such as etcd & kube-apiserver.

On masters:

  • kube-apiserver
  • kube-controller-manager (which runs miscellaneous controllers)
  • kube-scheduler (which assigns pods to nodes)
  • etcd (this is actually created by protokube though)
  • dns-controller

On nodes:

  • kube-proxy (which configures iptables so that the k8s-network will work)

It is possible to add custom static pods by using fileAssets in the cluster spec. This might be useful for any custom bootstraping that doesn’t fit into additionalUserData or hooks.

kubelet start

Kubelet starts up, starts (and restarts) all the containers in /etc/kubernetes/manifests.

It also tries to contact the API server (which the master kubelet will itself eventually start), register the node. Once a node is registered, kube-controller-manager will allocate it a PodCIDR, which is an allocation of the k8s-network IP range. kube-controller-manager updates the node, setting the PodCIDR field. Once kubelet sees this allocation, it will set up the local bridge with this CIDR, which allows docker to start. Before this happens, only pods that have hostNetwork will work - so all the “core” containers run with hostNetwork=true.

api-server bringup

APIServer also listens on the HTTPS port (443) on all interfaces. This is a secured endpoint, and requires valid authentication/authorization to use it. This is the endpoint that node kubelets will reach, and also that end-users will reach.

kOps uses DNS to allow nodes and end-users to discover the api-server. The apiserver pod manifest (in /etc/kubernetes/manifests) includes annotations that will cause the dns-controller to create the records. It creates api.internal.mycluster.com for use inside the cluster (using InternalIP addresses), and it creates api.mycluster.com for use outside the cluster (using ExternalIP addresses).

etcd bringup

etcd is where we have put all of our synchronization logic, so it is more complicated than most other pieces, and we must be really careful when bringing it up.

kOps follows CoreOS’s recommend procedure for bring-up of etcd on clouds:

  • We have one EBS volume for each etcd cluster member (in different nodes)
  • We attach the EBS volume to a master, and bring up etcd on that master
  • We set up DNS names pointing to the etcd process
  • We set up etcd with a static cluster, with those DNS names

Because the data is persistent and the cluster membership is also a static set of DNS names, this means we don’t need to manage etcd directly. We just try to make sure that some master always have each volume mounted with etcd running and DNS set correctly. That is the job of protokube.

Protokube:

  • discovers EBS volumes that hold etcd data (using tags)
  • tries to safe_format_and_mount them
  • if successful in mounting the volume, it will write a manifest for etcd into /etc/kubernetes/manifests
  • configures DNS for the etcd nodes (we can’t use dns-controller, because the API is not yet up)
  • kubelet then starts and runs etcd

node bringup

Most of this has focused on things that happen on the master, but the node bringup is very similar but simplified:

  • nodeup installs docker & kubelet
  • in /etc/kubernetes/manifests, we have kube-proxy

So kubelet will start up, as will kube-proxy. It will try to reach the api-server on the internal DNS name, and once the master is up it will succeed. Then:

  • kubelet creates a Node object representing itself
  • kube-controller-manager sees the node creation and assigns it a PodCIDR
  • kubelet sees the PodCIDR assignment and configures the local docker bridge (cbr0)
  • the node will be marked as Ready, and kube-scheduler will start assigning pods to the node
  • when kubelet sees assigned pods it will run them