Kubernetes Infrastructure
Overview
Within OKD, Kubernetes manages containerized applications across a set of containers or hosts and provides mechanisms for deployment, maintenance, and application-scaling. The container runtime packages, instantiates, and runs containerized applications. A Kubernetes cluster consists of one or more masters and a set of nodes.
You can optionally configure your masters for high availability (HA) to ensure that the cluster has no single point of failure.
OKD uses Kubernetes 1.11 and Docker 1.13.1. |
Masters
The master is the host or hosts that contain the control plane components, including the API server, controller manager server, and etcd. The master manages nodes in its Kubernetes cluster and schedules pods to run on those nodes.
Component | Description |
---|---|
API Server | The Kubernetes API server validates and configures the data for pods, services, and replication controllers. It also assigns pods to nodes and synchronizes pod information with service configuration. |
etcd | etcd stores the persistent master state while other components watch etcd for changes to bring themselves into the desired state. etcd can be optionally configured for high availability, typically deployed with 2n+1 peer services. |
Controller Manager Server | The controller manager server watches etcd for changes to replication controller objects and then uses the API to enforce the desired state. Several such processes create a cluster with one active leader at a time. |
HAProxy | Optional, used when configuring highly-available masters with the |
Control Plane Static Pods
The core control plane components, the API server and the controller manager components, run as static pods operated by the kubelet.
For masters that have etcd co-located on the same host, etcd is also moved to static pods. RPM-based etcd is still supported on etcd hosts that are not also masters.
In addition, the node components openshift-sdn and openvswitch are now run using a DaemonSet instead of a systemd service.
Figure 1. Control plane host architecture changes
Even with control plane components running as static pods, master hosts still source their configuration from the /etc/origin/master/master-config.yaml file, as described in the Master and Node Configuration topic.
Startup Sequence Overview
Hyperkube is a binary that contains all of Kubernetes (kube-apiserver, controller-manager, scheduler, proxy, and kubelet). On startup, the kubelet creates the kubepods.slice. Next, the kublet creates the QoS-level slices burstable.slice and best-effort.slice inside the kubepods.slice. When a pod starts, the kubelet creats a pod-level slice with the format pod<UUID-of-pod>.slice
and passes that path to the runtime on the other side of the Container Runtime Interface (CRI). Docker or CRI-O then creates the container-level slices inside the pod-level slice.
Mirror Pods
The kubelet on master nodes automatically creates mirror pods on the API server for each of the control plane static pods so that they are visible in the cluster in the kube-system project. Manifests for these static pods are installed by default by the openshift-ansible installer, located in the /etc/origin/node/pods directory on the master host.
These pods have the following hostPath
volumes defined:
/etc/origin/master | Contains all certificates, configuration files, and the admin.kubeconfig file. |
/var/lib/origin | Contains volumes and potential core dumps of the binary. |
/etc/origin/cloudprovider | Contains cloud provider specific configuration (AWS, Azure, etc.). |
/usr/libexec/kubernetes/kubelet-plugins | Contains additional third party volume plug-ins. |
/etc/origin/kubelet-plugins | Contains additional third party volume plug-ins for system containers. |
The set of operations you can do on the static pods is limited. For example:
$ oc logs master-api-<hostname> -n kube-system
returns the standard output from the API server. However:
$ oc delete pod master-api-<hostname> -n kube-system
will not actually delete the pod.
As another example, a cluster administrator might want to perform a common operation, such as increasing the loglevel
of the API server to provide more verbose data if a problem occurs. You must edit the /etc/origin/master/master.env file, where the --loglevel
parameter in the OPTIONS
variable can be modified, because this value is passed to the process running inside the container. Changes require a restart of the process running inside the container.
Restarting Master Services
To restart control plane services running in control plane static pods, use the master-restart
command on the master host.
To restart the master API:
# master-restart api
To restart the controllers:
# master-restart controllers
To restart etcd:
# master-restart etcd
Viewing Master Service Logs
To view logs for control plane services running in control plane static pods, use the master-logs
command for the respective component:
# master-logs api api
# master-logs controllers controllers
# master-logs etcd etcd
High Availability Masters
You can optionally configure your masters for high availability (HA) to ensure that the cluster has no single point of failure.
To mitigate concerns about availability of the master, two activities are recommended:
A runbook entry should be created for reconstructing the master. A runbook entry is a necessary backstop for any highly-available service. Additional solutions merely control the frequency that the runbook must be consulted. For example, a cold standby of the master host can adequately fulfill SLAs that require no more than minutes of downtime for creation of new applications or recovery of failed application components.
Use a high availability solution to configure your masters and ensure that the cluster has no single point of failure. The cluster installation documentation provides specific examples using the
native
HA method and configuring HAProxy. You can also take the concepts and apply them towards your existing HA solutions using thenative
method instead of HAProxy.
In production OKD clusters, you must maintain high availability of the API Server load balancer. If the API Server load balancer is not available, nodes cannot report their status, all their pods are marked dead, and the pods’ endpoints are removed from the service. In addition to configuring HA for OKD, you must separately configure HA for the API Server load balancer. To configure HA, it is much preferred to integrate an enterprise load balancer (LB) such as an F5 Big-IP™ or a Citrix Netscaler™ appliance. If such solutions are not available, it is possible to run multiple HAProxy load balancers and use Keepalived to provide a floating virtual IP address for HA. However, this solution is not recommended for production instances. |
When using the native
HA method with HAProxy, master components have the following availability:
Role | Style | Notes |
---|---|---|
etcd | Active-active | Fully redundant deployment with load balancing. Can be installed on separate hosts or collocated on master hosts. |
API Server | Active-active | Managed by HAProxy. |
Controller Manager Server | Active-passive | One instance is elected as a cluster leader at a time. |
HAProxy | Active-passive | Balances load between API master endpoints. |
While clustered etcd requires an odd number of hosts for quorum, the master services have no quorum or requirement that they have an odd number of hosts. However, since you need at least two master services for HA, it is common to maintain a uniform odd number of hosts when collocating master services and etcd.
Nodes
A node provides the runtime environments for containers. Each node in a Kubernetes cluster has the required services to be managed by the master. Nodes also have the required services to run pods, including the container runtime, a kubelet, and a service proxy.
OKD creates nodes from a cloud provider, physical systems, or virtual systems. Kubernetes interacts with node objects that are a representation of those nodes. The master uses the information from node objects to validate nodes with health checks. A node is ignored until it passes the health checks, and the master continues checking nodes until they are valid. The Kubernetes documentation has more information on node statuses and management.
Administrators can manage nodes in an OKD instance using the CLI. To define full configuration and security options when launching node servers, use dedicated node configuration files.
See the cluster limits section for the recommended maximum number of nodes. |
Kubelet
Each node has a kubelet that updates the node as specified by a container manifest, which is a YAML file that describes a pod. The kubelet uses a set of manifests to ensure that its containers are started and that they continue to run.
A container manifest can be provided to a kubelet by:
A file path on the command line that is checked every 20 seconds.
An HTTP endpoint passed on the command line that is checked every 20 seconds.
The kubelet watching an etcd server, such as /registry/hosts/$(hostname -f), and acting on any changes.
The kubelet listening for HTTP and responding to a simple API to submit a new manifest.
Service Proxy
Each node also runs a simple network proxy that reflects the services defined in the API on that node. This allows the node to do simple TCP and UDP stream forwarding across a set of back ends.
Node Object Definition
The following is an example node object definition in Kubernetes:
apiVersion: v1 (1)
kind: Node (2)
metadata:
creationTimestamp: null
labels: (3)
kubernetes.io/hostname: node1.example.com
name: node1.example.com (4)
spec:
externalID: node1.example.com (5)
status:
nodeInfo:
bootID: ""
containerRuntimeVersion: ""
kernelVersion: ""
kubeProxyVersion: ""
kubeletVersion: ""
machineID: ""
osImage: ""
systemUUID: ""
1 | apiVersion defines the API version to use. |
2 | kind set to Node identifies this as a definition for a node object. |
3 | metadata.labels lists any labels that have been added to the node. |
4 | metadata.name is a required value that defines the name of the node object. This value is shown in the NAME column when running the oc get nodes command. |
5 | spec.externalID defines the fully-qualified domain name where the node can be reached. Defaults to the metadata.name value when empty. |
Node Bootstrapping
A node’s configuration is bootstrapped from the master, which means nodes pull their pre-defined configuration and client and server certificates from the master. This allows faster node start-up by reducing the differences between nodes, as well as centralizing more configuration and letting the cluster converge on the desired state. Certificate rotation and centralized certificate management are enabled by default.
Figure 2. Node bootstrapping workflow overview
When node services are started, the node checks if the /etc/origin/node/node.kubeconfig file and other node configuration files exist before joining the cluster. If they do not, the node pulls the configuration from the master, then joins the cluster.
ConfigMaps are used to store the node configuration in the cluster, which populates the configuration file on the node host at /etc/origin/node/node-config.yaml. For definitions of the set of default node groups and their ConfigMaps, see Defining Node Groups and Host Mappings in Installing Clusters.
Node Bootstrap Workflow
The process for automatic node bootstrapping uses the following workflow:
By default during cluster installation, a set of
clusterrole
,clusterrolebinding
andserviceaccount
objects are created for use in node bootstrapping:The system:node-bootstrapper cluster role is used for creating certificate signing requests (CSRs) during node bootstrapping:
# oc describe clusterrole.authorization.openshift.io/system:node-bootstrapper
Name: system:node-bootstrapper
Created: 17 hours ago
Labels: kubernetes.io/bootstrapping=rbac-defaults
Annotations: authorization.openshift.io/system-only=true
openshift.io/reconcile-protect=false
Verbs Non-Resource URLs Resource Names API Groups Resources
[create get list watch] [] [] [certificates.k8s.io] [certificatesigningrequests]
The following node-bootstrapper service account is created in the openshift-infra project:
# oc describe sa node-bootstrapper -n openshift-infra
Name: node-bootstrapper
Namespace: openshift-infra
Labels: <none>
Annotations: <none>
Image pull secrets: node-bootstrapper-dockercfg-f2n8r
Mountable secrets: node-bootstrapper-token-79htp
node-bootstrapper-dockercfg-f2n8r
Tokens: node-bootstrapper-token-79htp
node-bootstrapper-token-mqn2q
Events: <none>
The following system:node-bootstrapper cluster role binding is for the node bootstrapper cluster role and service account:
# oc describe clusterrolebindings system:node-bootstrapper
Name: system:node-bootstrapper
Created: 17 hours ago
Labels: <none>
Annotations: openshift.io/reconcile-protect=false
Role: /system:node-bootstrapper
Users: <none>
Groups: <none>
ServiceAccounts: openshift-infra/node-bootstrapper
Subjects: <none>
Verbs Non-Resource URLs Resource Names API Groups Resources
[create get list watch] [] [] [certificates.k8s.io] [certificatesigningrequests]
Also by default during cluster installation, the openshift-ansible installer creates a OKD certificate authority and various other certificates, keys, and kubeconfig files in the /etc/origin/master directory. Two files of note are:
/etc/origin/master/admin.kubeconfig Uses the system:admin user.
/etc/origin/master/bootstrap.kubeconfig Used for node bootstrapping nodes other than masters.
The /etc/origin/master/bootstrap.kubeconfig is created when the installer uses the node-bootstrapper service account as follows:
$ oc --config=/etc/origin/master/admin.kubeconfig \
serviceaccounts create-kubeconfig node-bootstrapper \
-n openshift-infra
On master nodes, the /etc/origin/master/admin.kubeconfig is used as a bootstrapping file and is copied to /etc/origin/node/boostrap.kubeconfig. On other, non-master nodes, the /etc/origin/master/bootstrap.kubeconfig file is copied to all other nodes in at /etc/origin/node/boostrap.kubeconfig on each node host.
The /etc/origin/master/bootstrap.kubeconfig is then passed to kubelet using the flag
--bootstrap-kubeconfig
as follows:--bootstrap-kubeconfig=/etc/origin/node/bootstrap.kubeconfig
The kubelet is first started with the supplied /etc/origin/node/bootstrap.kubeconfig file. After initial connection internally, the kubelet creates certificate signing requests (CSRs) and sends them to the master.
The CSRs are verified and approved via the controller manager (specifically the certificate signing controller). If approved, the kubelet client and server certificates are created in the /etc/origin/node/ceritificates directory. For example:
# ls -al /etc/origin/node/certificates/
total 12
drwxr-xr-x. 2 root root 212 Jun 18 21:56 .
drwx------. 4 root root 213 Jun 19 15:18 ..
-rw-------. 1 root root 2826 Jun 18 21:53 kubelet-client-2018-06-18-21-53-15.pem
-rw-------. 1 root root 1167 Jun 18 21:53 kubelet-client-2018-06-18-21-53-45.pem
lrwxrwxrwx. 1 root root 68 Jun 18 21:53 kubelet-client-current.pem -> /etc/origin/node/certificates/kubelet-client-2018-06-18-21-53-45.pem
-rw-------. 1 root root 1447 Jun 18 21:56 kubelet-server-2018-06-18-21-56-52.pem
lrwxrwxrwx. 1 root root 68 Jun 18 21:56 kubelet-server-current.pem -> /etc/origin/node/certificates/kubelet-server-2018-06-18-21-56-52.pem
After the CSR approval, the node.kubeconfig file is created at /etc/origin/node/node.kubeconfig.
The kubelet is restarted with the /etc/origin/node/node.kubeconfig file and the certificates in the /etc/origin/node/certificates/ directory, after which point it is ready to join the cluster.
Node Configuration Workflow
Sourcing a node’s configuration uses the following workflow:
Initially the node’s kubelet is started with the bootstrap configuration file, bootstrap-node-config.yaml in the /etc/origin/node/ directory, created at the time of node provisioning.
On each node, the node service file uses the local script openshift-node in the /usr/local/bin/ directory to start the kubelet with the supplied bootstrap-node-config.yaml.
On each master, the directory /etc/origin/node/pods contains pod manifests for apiserver, controller and etcd which are created as static pods on masters.
During cluster installation, a sync DaemonSet is created which creates a sync pod on each node. The sync pod monitors changes in the file /etc/sysconfig/atomic-openshift-node. It specifically watches for
BOOTSTRAP_CONFIG_NAME
to be set.BOOTSTRAP_CONFIG_NAME
is set by the openshift-ansible installer and is the name of the ConfigMap based on the node configuration group the node belongs to.By default, the installer creates the following node configuration groups:
node-config-master
node-config-infra
node-config-compute
node-config-all-in-one
node-config-master-infra
A ConfigMap for each group is created in the **openshift-node** project.
The sync pod extracts the appropriate ConfigMap based on the value set in
BOOTSTRAP_CONFIG_NAME
.The sync pod converts the ConfigMap data into kubelet configurations and creates a /etc/origin/node/node-config.yaml for that node host. If a change is made to this file (or it is the file’s initial creation), the kubelet is restarted.
Modifying Node Configurations
A node’s configuration is modified by editing the appropriate ConfigMap in the openshift-node project. The /etc/origin/node/node-config.yaml must not be modified directly.
For example, for a node that is in the node-config-compute group, edit the ConfigMap using:
$ oc edit cm node-config-compute -n openshift-node