Kubernetes Infrastructure

Kubernetes Infrastructure

Overview

Within OKD, Kubernetes manages containerized applications across a set of containers or hosts and provides mechanisms for deployment, maintenance, and application-scaling. The container runtime packages, instantiates, and runs containerized applications. A Kubernetes cluster consists of one or more masters and a set of nodes.

You can optionally configure your masters for high availability (HA) to ensure that the cluster has no single point of failure.

OKD uses Kubernetes 1.11 and Docker 1.13.1.

Masters

The master is the host or hosts that contain the control plane components, including the API server, controller manager server, and etcd. The master manages nodes in its Kubernetes cluster and schedules pods to run on those nodes.

Table 1. Master Components
Component	Description
API Server	The Kubernetes API server validates and configures the data for pods, services, and replication controllers. It also assigns pods to nodes and synchronizes pod information with service configuration.
etcd	etcd stores the persistent master state while other components watch etcd for changes to bring themselves into the desired state. etcd can be optionally configured for high availability, typically deployed with 2n+1 peer services.
Controller Manager Server	The controller manager server watches etcd for changes to replication controller objects and then uses the API to enforce the desired state. Several such processes create a cluster with one active leader at a time.
HAProxy	Optional, used when configuring highly-available masters with the `native` method to balance load between API master endpoints. The cluster installation process can configure HAProxy for you with the `native` method. Alternatively, you can use the `native` method but pre-configure your own load balancer of choice.

Control Plane Static Pods

The core control plane components, the API server and the controller manager components, run as static pods operated by the kubelet.

For masters that have etcd co-located on the same host, etcd is also moved to static pods. RPM-based etcd is still supported on etcd hosts that are not also masters.

In addition, the node components openshift-sdn and openvswitch are now run using a DaemonSet instead of a systemd service.

Figure 1. Control plane host architecture changes

Even with control plane components running as static pods, master hosts still source their configuration from the /etc/origin/master/master-config.yaml file, as described in the Master and Node Configuration topic.

Startup Sequence Overview

Hyperkube is a binary that contains all of Kubernetes (kube-apiserver, controller-manager, scheduler, proxy, and kubelet). On startup, the kubelet creates the kubepods.slice. Next, the kublet creates the QoS-level slices burstable.slice and best-effort.slice inside the kubepods.slice. When a pod starts, the kubelet creats a pod-level slice with the format pod<UUID-of-pod>.slice and passes that path to the runtime on the other side of the Container Runtime Interface (CRI). Docker or CRI-O then creates the container-level slices inside the pod-level slice.

Mirror Pods

The kubelet on master nodes automatically creates mirror pods on the API server for each of the control plane static pods so that they are visible in the cluster in the kube-system project. Manifests for these static pods are installed by default by the openshift-ansible installer, located in the /etc/origin/node/pods directory on the master host.

These pods have the following hostPath volumes defined:

*/etc/origin/master*	Contains all certificates, configuration files, and the *admin.kubeconfig* file.
*/var/lib/origin*	Contains volumes and potential core dumps of the binary.
*/etc/origin/cloudprovider*	Contains cloud provider specific configuration (AWS, Azure, etc.).
*/usr/libexec/kubernetes/kubelet-plugins*	Contains additional third party volume plug-ins.
*/etc/origin/kubelet-plugins*	Contains additional third party volume plug-ins for system containers.

The set of operations you can do on the static pods is limited. For example:

$ oc logs master-api-<hostname> -n kube-system

returns the standard output from the API server. However:

$ oc delete pod master-api-<hostname> -n kube-system

will not actually delete the pod.

As another example, a cluster administrator might want to perform a common operation, such as increasing the loglevel of the API server to provide more verbose data if a problem occurs. You must edit the /etc/origin/master/master.env file, where the --loglevel parameter in the OPTIONS variable can be modified, because this value is passed to the process running inside the container. Changes require a restart of the process running inside the container.

Restarting Master Services

To restart control plane services running in control plane static pods, use the master-restart command on the master host.

To restart the master API:

# master-restart api

To restart the controllers:

# master-restart controllers

To restart etcd:

# master-restart etcd

Viewing Master Service Logs

To view logs for control plane services running in control plane static pods, use the master-logs command for the respective component:

# master-logs api api
# master-logs controllers controllers
# master-logs etcd etcd

High Availability Masters

You can optionally configure your masters for high availability (HA) to ensure that the cluster has no single point of failure.

To mitigate concerns about availability of the master, two activities are recommended:

A runbook entry should be created for reconstructing the master. A runbook entry is a necessary backstop for any highly-available service. Additional solutions merely control the frequency that the runbook must be consulted. For example, a cold standby of the master host can adequately fulfill SLAs that require no more than minutes of downtime for creation of new applications or recovery of failed application components.
Use a high availability solution to configure your masters and ensure that the cluster has no single point of failure. The cluster installation documentation provides specific examples using the native HA method and configuring HAProxy. You can also take the concepts and apply them towards your existing HA solutions using the native method instead of HAProxy.

In production OKD clusters, you must maintain high availability of the API Server load balancer. If the API Server load balancer is not available, nodes cannot report their status, all their pods are marked dead, and the pods’ endpoints are removed from the service.

In addition to configuring HA for OKD, you must separately configure HA for the API Server load balancer. To configure HA, it is much preferred to integrate an enterprise load balancer (LB) such as an F5 Big-IP™ or a Citrix Netscaler™ appliance. If such solutions are not available, it is possible to run multiple HAProxy load balancers and use Keepalived to provide a floating virtual IP address for HA. However, this solution is not recommended for production instances.

When using the native HA method with HAProxy, master components have the following availability:

Table 2. Availability Matrix with HAProxy
Role	Style	Notes
etcd	Active-active	Fully redundant deployment with load balancing. Can be installed on separate hosts or collocated on master hosts.
API Server	Active-active	Managed by HAProxy.
Controller Manager Server	Active-passive	One instance is elected as a cluster leader at a time.
HAProxy	Active-passive	Balances load between API master endpoints.

While clustered etcd requires an odd number of hosts for quorum, the master services have no quorum or requirement that they have an odd number of hosts. However, since you need at least two master services for HA, it is common to maintain a uniform odd number of hosts when collocating master services and etcd.

Nodes

A node provides the runtime environments for containers. Each node in a Kubernetes cluster has the required services to be managed by the master. Nodes also have the required services to run pods, including the container runtime, a kubelet, and a service proxy.

OKD creates nodes from a cloud provider, physical systems, or virtual systems. Kubernetes interacts with node objects that are a representation of those nodes. The master uses the information from node objects to validate nodes with health checks. A node is ignored until it passes the health checks, and the master continues checking nodes until they are valid. The Kubernetes documentation has more information on node statuses and management.

Administrators can manage nodes in an OKD instance using the CLI. To define full configuration and security options when launching node servers, use dedicated node configuration files.

See the cluster limits section for the recommended maximum number of nodes.

Kubelet

Each node has a kubelet that updates the node as specified by a container manifest, which is a YAML file that describes a pod. The kubelet uses a set of manifests to ensure that its containers are started and that they continue to run.

A container manifest can be provided to a kubelet by:

A file path on the command line that is checked every 20 seconds.
An HTTP endpoint passed on the command line that is checked every 20 seconds.
The kubelet watching an etcd server, such as /registry/hosts/$(hostname -f), and acting on any changes.
The kubelet listening for HTTP and responding to a simple API to submit a new manifest.

Service Proxy

Each node also runs a simple network proxy that reflects the services defined in the API on that node. This allows the node to do simple TCP and UDP stream forwarding across a set of back ends.

Node Object Definition

The following is an example node object definition in Kubernetes:

apiVersion: v1 (1)
kind: Node (2)
metadata:
  creationTimestamp: null
  labels: (3)
    kubernetes.io/hostname: node1.example.com
  name: node1.example.com (4)
spec:
  externalID: node1.example.com (5)
status:
  nodeInfo:
    bootID: ""
    containerRuntimeVersion: ""
    kernelVersion: ""
    kubeProxyVersion: ""
    kubeletVersion: ""
    machineID: ""
    osImage: ""
    systemUUID: ""

1	`apiVersion` defines the API version to use.
2	`kind` set to `Node` identifies this as a definition for a node object.
3	`metadata.labels` lists any labels that have been added to the node.
4	`metadata.name` is a required value that defines the name of the node object. This value is shown in the `NAME` column when running the `oc get nodes` command.
5	`spec.externalID` defines the fully-qualified domain name where the node can be reached. Defaults to the `metadata.name` value when empty.

Node Bootstrapping

A node’s configuration is bootstrapped from the master, which means nodes pull their pre-defined configuration and client and server certificates from the master. This allows faster node start-up by reducing the differences between nodes, as well as centralizing more configuration and letting the cluster converge on the desired state. Certificate rotation and centralized certificate management are enabled by default.

Figure 2. Node bootstrapping workflow overview

When node services are started, the node checks if the /etc/origin/node/node.kubeconfig file and other node configuration files exist before joining the cluster. If they do not, the node pulls the configuration from the master, then joins the cluster.

ConfigMaps are used to store the node configuration in the cluster, which populates the configuration file on the node host at /etc/origin/node/node-config.yaml. For definitions of the set of default node groups and their ConfigMaps, see Defining Node Groups and Host Mappings in Installing Clusters.

Node Bootstrap Workflow

The process for automatic node bootstrapping uses the following workflow:

By default during cluster installation, a set of clusterrole, clusterrolebinding and serviceaccount objects are created for use in node bootstrapping:

The system:node-bootstrapper cluster role is used for creating certificate signing requests (CSRs) during node bootstrapping:

# oc describe clusterrole.authorization.openshift.io/system:node-bootstrapper
Name:            system:node-bootstrapper
Created:        17 hours ago
Labels:            kubernetes.io/bootstrapping=rbac-defaults
Annotations:        authorization.openshift.io/system-only=true
            openshift.io/reconcile-protect=false
Verbs            Non-Resource URLs    Resource Names    API Groups        Resources
[create get list watch]    []            []        [certificates.k8s.io]    [certificatesigningrequests]

The following node-bootstrapper service account is created in the openshift-infra project:

# oc describe sa node-bootstrapper -n openshift-infra
Name:                node-bootstrapper
Namespace:           openshift-infra
Labels:              <none>
Annotations:         <none>
Image pull secrets:  node-bootstrapper-dockercfg-f2n8r
Mountable secrets:   node-bootstrapper-token-79htp
                     node-bootstrapper-dockercfg-f2n8r
Tokens:              node-bootstrapper-token-79htp
                     node-bootstrapper-token-mqn2q
Events:              <none>

The following system:node-bootstrapper cluster role binding is for the node bootstrapper cluster role and service account:

# oc describe clusterrolebindings system:node-bootstrapper
Name:            system:node-bootstrapper
Created:        17 hours ago
Labels:            <none>
Annotations:        openshift.io/reconcile-protect=false
Role:            /system:node-bootstrapper
Users:            <none>
Groups:            <none>
ServiceAccounts:    openshift-infra/node-bootstrapper
Subjects:        <none>
Verbs            Non-Resource URLs    Resource Names    API Groups        Resources
[create get list watch]    []            []        [certificates.k8s.io]    [certificatesigningrequests]

Also by default during cluster installation, the openshift-ansible installer creates a OKD certificate authority and various other certificates, keys, and kubeconfig files in the /etc/origin/master directory. Two files of note are:

/etc/origin/master/admin.kubeconfig
Uses the system:admin user.
/etc/origin/master/bootstrap.kubeconfig
Used for node bootstrapping nodes other than masters.
1. The /etc/origin/master/bootstrap.kubeconfig is created when the installer uses the node-bootstrapper service account as follows:
```
$ oc --config=/etc/origin/master/admin.kubeconfig \
    serviceaccounts create-kubeconfig node-bootstrapper \
    -n openshift-infra
```
2. On master nodes, the /etc/origin/master/admin.kubeconfig is used as a bootstrapping file and is copied to /etc/origin/node/boostrap.kubeconfig. On other, non-master nodes, the /etc/origin/master/bootstrap.kubeconfig file is copied to all other nodes in at /etc/origin/node/boostrap.kubeconfig on each node host.
3. The /etc/origin/master/bootstrap.kubeconfig is then passed to kubelet using the flag --bootstrap-kubeconfig as follows:
```
--bootstrap-kubeconfig=/etc/origin/node/bootstrap.kubeconfig
```

The kubelet is first started with the supplied /etc/origin/node/bootstrap.kubeconfig file. After initial connection internally, the kubelet creates certificate signing requests (CSRs) and sends them to the master.

The CSRs are verified and approved via the controller manager (specifically the certificate signing controller). If approved, the kubelet client and server certificates are created in the /etc/origin/node/ceritificates directory. For example:

# ls -al /etc/origin/node/certificates/
total 12
drwxr-xr-x. 2 root root  212 Jun 18 21:56 .
drwx------. 4 root root  213 Jun 19 15:18 ..
-rw-------. 1 root root 2826 Jun 18 21:53 kubelet-client-2018-06-18-21-53-15.pem
-rw-------. 1 root root 1167 Jun 18 21:53 kubelet-client-2018-06-18-21-53-45.pem
lrwxrwxrwx. 1 root root   68 Jun 18 21:53 kubelet-client-current.pem -> /etc/origin/node/certificates/kubelet-client-2018-06-18-21-53-45.pem
-rw-------. 1 root root 1447 Jun 18 21:56 kubelet-server-2018-06-18-21-56-52.pem
lrwxrwxrwx. 1 root root   68 Jun 18 21:56 kubelet-server-current.pem -> /etc/origin/node/certificates/kubelet-server-2018-06-18-21-56-52.pem

After the CSR approval, the node.kubeconfig file is created at /etc/origin/node/node.kubeconfig.
The kubelet is restarted with the /etc/origin/node/node.kubeconfig file and the certificates in the /etc/origin/node/certificates/ directory, after which point it is ready to join the cluster.

Node Configuration Workflow

Sourcing a node’s configuration uses the following workflow:

Initially the node’s kubelet is started with the bootstrap configuration file, bootstrap-node-config.yaml in the /etc/origin/node/ directory, created at the time of node provisioning.
On each node, the node service file uses the local script openshift-node in the /usr/local/bin/ directory to start the kubelet with the supplied bootstrap-node-config.yaml.
On each master, the directory /etc/origin/node/pods contains pod manifests for apiserver, controller and etcd which are created as static pods on masters.
During cluster installation, a sync DaemonSet is created which creates a sync pod on each node. The sync pod monitors changes in the file /etc/sysconfig/atomic-openshift-node. It specifically watches for BOOTSTRAP_CONFIG_NAME to be set. BOOTSTRAP_CONFIG_NAME is set by the openshift-ansible installer and is the name of the ConfigMap based on the node configuration group the node belongs to.

By default, the installer creates the following node configuration groups:
- node-config-master
- node-config-infra
- node-config-compute
- node-config-all-in-one
- node-config-master-infra

A ConfigMap for each group is created in the **openshift-node** project.

The sync pod extracts the appropriate ConfigMap based on the value set in BOOTSTRAP_CONFIG_NAME.
The sync pod converts the ConfigMap data into kubelet configurations and creates a /etc/origin/node/node-config.yaml for that node host. If a change is made to this file (or it is the file’s initial creation), the kubelet is restarted.

Modifying Node Configurations

A node’s configuration is modified by editing the appropriate ConfigMap in the openshift-node project. The /etc/origin/node/node-config.yaml must not be modified directly.

For example, for a node that is in the node-config-compute group, edit the ConfigMap using:

$ oc edit cm node-config-compute -n openshift-node

*/etc/origin/master/admin.kubeconfig*	Uses the system:admin user.
*/etc/origin/master/bootstrap.kubeconfig*	Used for node bootstrapping nodes other than masters.