Recommended Host Practices

Recommended Practices for OKD Master Hosts

In addition to pod traffic, the most-used data-path in an OKD infrastructure is between the OKD master hosts and etcd. The OKD API server (part of the master binary) consults etcd for node status, network configuration, secrets, and more.

Optimize this traffic path by:

  • Running etcd on master hosts. By default, etcd runs in a static pod on all master hosts.

  • Ensuring an uncongested, low latency LAN communication link between master hosts.

The OKD master caches deserialized versions of resources aggressively to ease CPU load. However, in smaller clusters of less than 1000 pods, this cache can waste a lot of memory for negligible CPU load reduction. The default cache size is 50,000 entries, which, depending on the size of your resources, can grow to occupy 1 to 2 GB of memory. This cache size can be reduced using the following setting the in /etc/origin/master/master-config.yaml:

  1. kubernetesMasterConfig:
  2. apiServerArguments:
  3. deserialization-cache-size:
  4. - "1000"

The number of client requests or API calls that are sent to the API server is determined by the Queries per second (QPS) value and the number of concurrent requests that can be processed by the API server is determined by the maxRequestsInFlight setting. The number of requests the client can make in excess of the QPS rate depends on the burst value, this is helpful for applications that are bursty in nature and can make irregular number of requests. The Response times for requests might have high latencies when there are large numbers of concurrent requests being handled by the API server especially for large and/or dense clusters. It is recommended to monitor the apiserver_request_count rate metric in Prometheus and adjust the maxRequestsInFlight and QPS accordingly.

There needs to be a good balance when changing the default values as the CPU and memory consumption of API server and etcd IOPS will increase when it is handling more requests in parallel. Also note that heavy non-watch requests might overload the API server as they get cancelled after a fixed 60 second timeout and the client starts retrying.

Provided sufficient CPU and memory resources are available on the API server system(s), the API server requests overloading issue can be safely alleviated by taking into account the factors mentioned above and bumping up the maxRequestsInFlight, API qps and burst values in the /etc/origin/master/master-config.yaml file:

  1. masterClients:
  2. openshiftLoopbackClientConnectionOverrides:
  3. burst: 600
  4. qps: 300
  5. servingInfo:
  6. maxRequestsInFlight: 500

The maxRequestsInFlight, qps and burst values above are defaults for OKD. The qps can be higher than maxRequestsInFlight value if the requests take less than a second. If `maxRequestsInFlight’ is set to zero, there is no limit on the number of concurrent requests that can be processed by the server.

Recommended Practices for OKD Node Hosts

The OKD node configuration file contains important options, such as the iptables synchronization period, the Maximum Transmission Unit (MTU) of the SDN network, and the proxy-mode. To configure your nodes, modify the appropriate node configuration map.

Do not edit the node-config.yaml file directly.

The node configuration file allows you to pass arguments to the kubelet (node) process. You can view a list of possible options by running kubelet --help.

Not all kubelet options are supported by OKD, and are used in the upstream Kubernetes. This means certain options are in limited support.

See the Cluster maximums page for the maximum supported limits for each version of OKD.

In the /etc/origin/node/node-config.yaml file, two parameters control the maximum number of pods that can be scheduled to a node: pods-per-core and max-pods. When both options are in use, the lower of the two limits the number of pods on a node. Exceeding these values can result in:

  • Increased CPU utilization on both OKD and Docker.

  • Slow pod scheduling.

  • Potential out-of-memory scenarios (depends on the amount of memory in the node).

  • Exhausting the pool of IP addresses.

  • Resource overcommitting, leading to poor user application performance.

In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.

pods-per-core sets the number of pods the node can run based on the number of processor cores on the node. For example, if pods-per-core is set to 10 on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.

  1. kubeletArguments:
  2. pods-per-core:
  3. - "10"

Setting pods-per-core to 0 disables this limit.

max-pods sets the number of pods the node can run to a fixed value, regardless of the properties of the node. Cluster Limits documents maximum supported values for max-pods.

  1. kubeletArguments:
  2. max-pods:
  3. - "250"

Using the above example, the default value for pods-per-core is 10 and the default value for max-pods is 250. This means that unless the node has 25 cores or more, by default, pods-per-core will be the limiting factor.

See the Sizing Considerations section in the installation documentation for the recommended limits for an OKD cluster. The recommended sizing accounts for OKD and container engine coordination for container status updates. This coordination puts CPU pressure on the master and container engine processes, which can include writing a large amount of log data.

The rate at which kubelet talks to API server depends on qps and burst values. The default values are good enough if there are limited pods running on each node. Provided there are enough CPU and memory resources on the node, the qps and burst values can be tweaked in the /etc/origin/node/node-config.yaml file:

  1. kubeletArguments:
  2. kube-api-qps:
  3. - "20"
  4. kube-api-burst:
  5. - "40"

The qps and burst values above are defaults for OKD.

Recommended Practices for OKD etcd Hosts

etcd is a distributed key-value store that OKD uses for configuration.

OKD Version

etcd version

storage schema version

3.3 and earlier

2.x

v2

3.4 and 3.5

3.x

v2

3.6

3.x

v2 (upgrades)

3.6

3.x

v3 (new installations)

3.7 and later

3.x

v3

etcd 3.x introduces important scalability and performance improvements that reduce CPU, memory, network, and disk requirements for any size cluster. etcd 3.x also implements a backwards compatible storage API that facilitates a two-step migration of the on-disk etcd database. For migration purposes, the storage mode used by etcd 3.x in OKD 3.5 remained in v2 mode. As of OKD 3.6, new installations use storage mode v3. Upgrades from previous versions of OKD will not automatically migrate data from v2 to v3. You must use the supplied playbooks and follow the documented process to migrate the data.

Version 3 of etcd implements a backwards compatible storage API that facilitates a two-step migration of the on-disk etcd database. For migration purposes, the storage mode used by etcd 3.x in OKD 3.5 remained in v2 mode. As of OKD 3.6, new installations use storage mode v3. As part of the process to upgrade to OKD 3.7, you upgrade your etcd storage API to v3, if required. In versions 3.7 and later, you must use the v3 API.

In addition to changing the storage mode for new installs to v3, OKD 3.6 also begins enforcing quorum reads for all OKD types. This is done to ensure that queries against etcd do not return stale data. In single-node etcd clusters, stale data is not a concern. In highly available etcd deployments typically found in production clusters, quorum reads ensure valid query results. A quorum read is linearizable in database terms - every client sees the latest updated state of the cluster, and all clients see the same sequence of reads and writes. See the etcd 3.1 announcement for more information on performance improvements.

It is important to note that OKD uses etcd for storing additional information beyond what Kubernetes itself requires. For example, OKD stores information about images, builds, and other components in etcd, as is required by features that OKD adds on top of Kubernetes. Ultimately, this means that guidance around performance and sizing for etcd hosts will differ from Kubernetes and other recommendations in salient ways. Red Hat tests etcd scalability and performance with the OKD use-case and parameters in mind to generate the most accurate recommendations.

Performance improvements were quantified using a 300-node OKD 3.6 cluster using the cluster-loader utility. Comparing etcd 3.x (storage mode v2) versus etcd 3.x (storage mode v3), clear improvements are identified in the charts below.

Storage IOPS under load is significantly reduced:

Full Run IOPS

Storage IOPS in steady state is also significantly reduced:

Steady State IOPS

Viewing the same I/O data, plotting the average IOPS in both modes:

Read+Write IOPS

CPU utilization by both the API server (master) and etcd processes is reduced:

CPU Usage

Memory utilization by both the API server (master) and etcd processes is also reduced:

Memory Usage

After profiling etcd under OKD, etcd frequently performs small amounts of storage input and output. Using etcd with storage that handles small read/write operations quickly, such as SSD, is highly recommended.

Looking at the size I/O operations done by a 3-node cluster of etcd 3.1 (using storage v3 mode and with quorum reads enforced), read sizes are as follows:

Histogram of etcd I/O sizes

And writes:

Histogram of etcd I/O sizes

etcd processes are typically memory intensive. Master / API server processes are CPU intensive. This makes them a reasonable co-location pair within a single machine or virtual machine (VM). Optimize communication between etcd and master hosts either by co-locating them on the same host, or providing a dedicated network.

Providing Storage to an etcd Node Using PCI Passthrough with OpenStack

To provide fast storage to an etcd node so that etcd is stable at large scale, use PCI passthrough to pass a non-volatile memory express (NVMe) device directly to the etcd node. To set this up with Red Hat OpenStack 11 or later, complete the following on the OpenStack compute nodes where the PCI device exists.

  1. Ensure Intel Vt-x is enabled in BIOS.

  2. Enable the input–output memory management unit (IOMMU). In the /etc/sysconfig/grub file, add intel_iommu=on iommu=pt to the end of the GRUB_CMDLINX_LINUX line, within the quotation marks.

  3. Regenerate /etc/grub2.cfg by running:

    1. $ grub2-mkconfig -o /etc/grub2.cfg
  4. Reboot the system.

  5. On controllers in /etc/nova.conf:

    1. [filter_scheduler]
    2. enabled_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,DiskFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter
    3. available_filters=nova.scheduler.filters.all_filters
    4. [pci]
    5. alias = { "vendor_id":"144d", "product_id":"a820",
    6. "device_type":"type-PCI", "name":"nvme" }
  6. Restart nova-api and nova-scheduler on the controllers.

  7. On compute nodes in /etc/nova/nova.conf:

    1. [pci]
    2. passthrough_whitelist = { "address": "0000:06:00.0" }
    3. alias = { "vendor_id":"144d", "product_id":"a820",
    4. "device_type":"type-PCI", "name":"nvme" }

    To retrieve the required address, vendor_id, and product_id values of the NVMe device you want to passthrough, run:

    1. # lspci -nn | grep devicename
  8. Restart nova-compute on the compute nodes.

  9. Configure the OpenStack version you are running to use the NVMe and launch the etcd node.

Scaling Hosts Using the Tuned Profile

Tuned is a tuning profile delivery mechanism enabled by default in Red Hat Enterprise Linux (RHEL) and other Red Hat products. Tuned customizes Linux settings, such as sysctls, power management, and kernel command line options, to optimize the operating system for different workload performance and scalability requirements.

OKD leverages the tuned daemon and includes Tuned profiles called openshift, openshift-node and openshift-control-plane. These profiles safely increase some of the commonly encountered vertical scaling limits present in the kernel, and are automatically applied to your system during installation.

The Tuned profiles support inheritance between profiles. They also support an auto-parent functionality which selects a parent profile based on whether the profile is used in a virtual environment. The openshift profile uses both of these features and is a parent of openshift-node and openshift-control-plane profiles. It contains tuning relevant to both OKD application nodes and control plane nodes respectively. The openshift-node and openshift-control-plane profiles are set on application and control plane nodes respectively.

The profile hierarchy with the openshift profile as a parent ensures the tuning delivered to the OKD system is a union of throughput-performance (the default for RHEL) for bare metal hosts and virtual-guest for RHEL and atomic-guest for RHEL Atomic Host nodes.

To see which Tuned profile is enabled on your system, run:

  1. # tuned-adm active
  2. Current active profile: openshift-node

See the Red Hat Enterprise Linux Performance Tuning Guide for more information about Tuned.