Environment health checks
You are viewing documentation for a release that is no longer supported. The latest supported version of version 3 is [3.11]. For the most recent version 4, see [4]
You are viewing documentation for a release that is no longer supported. The latest supported version of version 3 is [3.11]. For the most recent version 4, see [4]
This topic contains steps to verify the overall health of the OKD cluster and the various components, as well as describing the intended behavior.
Knowing the verification process for the various components is the first step to troubleshooting issues. If experiencing issues, you can use the checks provided in this section to diagnose any problems.
Checking complete environment health
To verify the end-to-end functionality of an OKD cluster, build and deploy an example application.
Procedure
Create a new project named
validate
, as well as an example application from thecakephp-mysql-example
template:$ oc new-project validate
$ oc new-app cakephp-mysql-example
You can check the logs to follow the build:
$ oc logs -f bc/cakephp-mysql-example
Once the build is complete, two pods should be running: a database and an application:
$ oc get pods
NAME READY STATUS RESTARTS AGE
cakephp-mysql-example-1-build 0/1 Completed 0 1m
cakephp-mysql-example-2-247xm 1/1 Running 0 39s
mysql-1-hbk46 1/1 Running 0 1m
Visit the application URL. The Cake PHP framework welcome page should be visible. The URL should have the following format
cakephp-mysql-example-validate.<app_domain>
.Once the functionality has been verified, the
validate
project can be deleted:$ oc delete project validate
All resources within the project will be deleted as well.
Creating alerts using Prometheus
You can integrate OKD with Prometheus to create visuals and alerts to help diagnose any environment issues before they arise. These issues can include if a node goes down, if a pod is consuming too much CPU or memory, and more.
See the Prometheus on OpenShift Container Platform section in the Installation and configuration guide for more information.
Prometheus on OKD is a Technology Preview feature only. |
Host health
To verify that the cluster is up and running, connect to a master instance, and run the following:
$ oc get nodes
NAME STATUS AGE VERSION
ocp-infra-node-1clj Ready 1h v1.6.1+5115d708d7
ocp-infra-node-86qr Ready 1h v1.6.1+5115d708d7
ocp-infra-node-g8qw Ready 1h v1.6.1+5115d708d7
ocp-master-94zd Ready,SchedulingDisabled 1h v1.6.1+5115d708d7
ocp-master-gjkm Ready,SchedulingDisabled 1h v1.6.1+5115d708d7
ocp-master-wc8w Ready,SchedulingDisabled 1h v1.6.1+5115d708d7
ocp-node-c5dg Ready 1h v1.6.1+5115d708d7
ocp-node-ghxn Ready 1h v1.6.1+5115d708d7
ocp-node-w135 Ready 1h v1.6.1+5115d708d7
The above cluster example consists of three master hosts, three infrastructure node hosts, and three node hosts. All of them are running. All hosts in the cluster should be visible in this output.
The Ready
status means that master hosts can communicate with node hosts and that the nodes are ready to run pods (excluding the nodes in which scheduling is disabled).
Before you run etcd commands, source the etcd.conf file:
# source /etc/etcd/etcd.conf
You can check the basic etcd health status from any master instance with the etcdctl
command:
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE \
--ca-file=/etc/etcd/ca.crt --endpoints=$ETCD_LISTEN_CLIENT_URLS cluster-health
member 59df5107484b84df is healthy: got healthy result from https://10.156.0.5:2379
member 6df7221a03f65299 is healthy: got healthy result from https://10.156.0.6:2379
member fea6dfedf3eecfa3 is healthy: got healthy result from https://10.156.0.9:2379
cluster is healthy
However, to get more information about etcd hosts, including the associated master host:
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE \
--ca-file=/etc/etcd/ca.crt --endpoints=$ETCD_LISTEN_CLIENT_URLS member list
295750b7103123e0: name=ocp-master-zh8d peerURLs=https://10.156.0.7:2380 clientURLs=https://10.156.0.7:2379 isLeader=true
b097a72f2610aea5: name=ocp-master-qcg3 peerURLs=https://10.156.0.11:2380 clientURLs=https://10.156.0.11:2379 isLeader=false
fea6dfedf3eecfa3: name=ocp-master-j338 peerURLs=https://10.156.0.9:2380 clientURLs=https://10.156.0.9:2379 isLeader=false
All etcd hosts should contain the master host name if the etcd cluster is co-located with master services, or all etcd instances should be visible if etcd is running separately.
|
Router and registry health
To check if a router service is running:
$ oc -n default get deploymentconfigs/router
NAME REVISION DESIRED CURRENT TRIGGERED BY
router 1 3 3 config
The values in the DESIRED
and CURRENT
columns should match the number of nodes hosts.
Use the same command to check the registry status:
$ oc -n default get deploymentconfigs/docker-registry
NAME REVISION DESIRED CURRENT TRIGGERED BY
docker-registry 1 3 3 config
Multiple running instances of the container registry require backend storage supporting writes by multiple processes. If the chosen infrastructure provider does not contain this ability, running a single instance of a container registry is acceptable. |
To verify that all pods are running and on which hosts:
$ oc -n default get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
docker-registry-1-54nhl 1/1 Running 0 2d 172.16.2.3 ocp-infra-node-tl47
docker-registry-1-jsm2t 1/1 Running 0 2d 172.16.8.2 ocp-infra-node-62rc
docker-registry-1-qbt4g 1/1 Running 0 2d 172.16.14.3 ocp-infra-node-xrtz
registry-console-2-gbhcz 1/1 Running 0 2d 172.16.8.4 ocp-infra-node-62rc
router-1-6zhf8 1/1 Running 0 2d 10.156.0.4 ocp-infra-node-62rc
router-1-ffq4g 1/1 Running 0 2d 10.156.0.10 ocp-infra-node-tl47
router-1-zqxbl 1/1 Running 0 2d 10.156.0.8 ocp-infra-node-xrtz
If OKD is using an external container registry, the internal registry service does not need to be running. |
Network connectivity
Network connectivity has two main networking layers: the cluster network for node interaction, and the software defined network (SDN) for pod interaction. OKD supports multiple network configurations, often optimized for a specific infrastructure provider.
Due to the complexity of networking, not all verification scenarios are covered in this section. |
Connectivity on master hosts
etcd and master hosts
Master services keep their state synchronized using the etcd key-value store. Communication between master and etcd services is important, whether those etcd services are collocated on master hosts, or running on hosts designated only for the etcd service. This communication happens on TCP ports 2379
and 2380
. See the Host health section for methods to check this communication.
SkyDNS
SkyDNS
provides name resolution of local services running in OKD. This service uses TCP
and UDP
port 8053
.
To verify the name resolution:
$ dig +short docker-registry.default.svc.cluster.local
172.30.150.7
If the answer matches the output of the following, SkyDNS
service is working correctly:
$ oc get svc/docker-registry -n default
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry 172.30.150.7 <none> 5000/TCP 3d
API service and web console
Both the API service and web console share the same port, usually TCP
8443
or 443
, depending on the setup. This port needs to be available within the cluster and to everyone who needs to work with the deployed environment. The URLs under which this port is reachable may differ for internal cluster and for external clients.
In the following example, the [https://internal-master.example.com:443](https://internal-master.example.com:443)
URL is used by the internal cluster, and the [https://master.example.com:443](https://master.example.com:443)
URL is used by external clients. On any node host:
$ curl -k https://internal-master.example.com:443/version
{
"major": "1",
"minor": "6",
"gitVersion": "v1.6.1+5115d708d7",
"gitCommit": "fff65cf",
"gitTreeState": "clean",
"buildDate": "2017-10-11T22:44:25Z",
"goVersion": "go1.7.6",
"compiler": "gc",
"platform": "linux/amd64"
}
This must be reachable from client’s network:
$ curl -k https://master.example.com:443/healthz
ok
Connectivity on node instances
The SDN connecting pod communication on nodes uses UDP
port 4789
by default.
To verify node host functionality, create a new application. The following example ensures the node reaches the docker registry, which is running on an infrastructure node:
Procedure
Create a new project:
$ oc new-project sdn-test
Deploy an httpd application:
$ oc new-app centos/httpd-24-centos7~https://github.com/sclorg/httpd-ex
Wait until the build is complete:
$ oc get pods
NAME READY STATUS RESTARTS AGE
httpd-ex-1-205hz 1/1 Running 0 34s
httpd-ex-1-build 0/1 Completed 0 1m
Connect to the running pod:
$ oc rsh po/<pod-name>
For example:
$ oc rsh po/httpd-ex-1-205hz
Check the
healthz
path of the internal registry service:$ curl -kv https://docker-registry.default.svc.cluster.local:5000/healthz
* About to connect() to docker-registry.default.svc.cluster.locl port 5000 (#0)
* Trying 172.30.150.7...
* Connected to docker-registry.default.svc.cluster.local (172.30.150.7) port 5000 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
* subject: CN=172.30.150.7
* start date: Nov 30 17:21:51 2017 GMT
* expire date: Nov 30 17:21:52 2019 GMT
* common name: 172.30.150.7
* issuer: CN=openshift-signer@1512059618
> GET /healthz HTTP/1.1
> User-Agent: curl/7.29.0
> Host: docker-registry.default.svc.cluster.local:5000
> Accept: */*
>
< HTTP/1.1 200 OK
< Cache-Control: no-cache
< Date: Mon, 04 Dec 2017 16:26:49 GMT
< Content-Length: 0
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host docker-registry.default.svc.cluster.local left intact
sh-4.2$ *exit*
The
HTTP/1.1 200 OK
response means the node is correctly connecting.Clean up the test project:
$ oc delete project sdn-test
project "sdn-test" deleted
The node host is listening on
TCP
port10250
. This port needs to be reachable by all master hosts on any node, and if monitoring is deployed in the cluster, the infrastructure nodes must have access to this port on all instances as well. Broken communication on this port can be detected with the following command:$ oc get nodes
NAME STATUS AGE VERSION
ocp-infra-node-1clj Ready 4d v1.6.1+5115d708d7
ocp-infra-node-86qr Ready 4d v1.6.1+5115d708d7
ocp-infra-node-g8qw Ready 4d v1.6.1+5115d708d7
ocp-master-94zd Ready,SchedulingDisabled 4d v1.6.1+5115d708d7
ocp-master-gjkm Ready,SchedulingDisabled 4d v1.6.1+5115d708d7
ocp-master-wc8w Ready,SchedulingDisabled 4d v1.6.1+5115d708d7
ocp-node-c5dg Ready 4d v1.6.1+5115d708d7
ocp-node-ghxn Ready 4d v1.6.1+5115d708d7
ocp-node-w135 NotReady 4d v1.6.1+5115d708d7
In the output above, the node service on the
ocp-node-w135
node is not reachable by the master services, which is represented by itsNotReady
status.The last service is the router, which is responsible for routing connections to the correct services running in the OKD cluster. Routers listen on
TCP
ports80
and443
on infrastructure nodes for ingress traffic. Before routers can start working, DNS must be configured:$ dig *.apps.example.com
; <<>> DiG 9.11.1-P3-RedHat-9.11.1-8.P3.fc27 <<>> *.apps.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45790
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;*.apps.example.com. IN A
;; ANSWER SECTION:
*.apps.example.com. 3571 IN CNAME apps.example.com.
apps.example.com. 3561 IN A 35.xx.xx.92
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Dec 05 16:03:52 CET 2017
;; MSG SIZE rcvd: 105
The IP address, in this case
35.xx.xx.92
, should be pointing to the load balancer distributing ingress traffic to all infrastructure nodes. To verify the functionality of the routers, check the registry service once more, but this time from outside the cluster:$ curl -kv https://docker-registry-default.apps.example.com/healthz
* Trying 35.xx.xx.92...
* TCP_NODELAY set
* Connected to docker-registry-default.apps.example.com (35.xx.xx.92) port 443 (#0)
...
< HTTP/2 200
< cache-control: no-cache
< content-type: text/plain; charset=utf-8
< content-length: 0
< date: Tue, 05 Dec 2017 15:13:27 GMT
<
* Connection #0 to host docker-registry-default.apps.example.com left intact
Storage
Master instances need at least 40 GB of hard disk space for the /var
directory. Check the disk usage of a master host using the df
command:
$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 xfs 45G 2.8G 43G 7% /
devtmpfs devtmpfs 3.6G 0 3.6G 0% /dev
tmpfs tmpfs 3.6G 0 3.6G 0% /dev/shm
tmpfs tmpfs 3.6G 63M 3.6G 2% /run
tmpfs tmpfs 3.6G 0 3.6G 0% /sys/fs/cgroup
tmpfs tmpfs 732M 0 732M 0% /run/user/1000
tmpfs tmpfs 732M 0 732M 0% /run/user/0
Node instances need at least 15 GB space for the /var
directory, and at least another 15 GB for Docker storage (/var/lib/docker
in this case). Depending on the size of the cluster and the amount of ephemeral storage desired for pods, a separate partition should be created for /var/lib/origin/openshift.local.volumes
on the nodes.
$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 xfs 25G 2.4G 23G 10% /
devtmpfs devtmpfs 3.6G 0 3.6G 0% /dev
tmpfs tmpfs 3.6G 0 3.6G 0% /dev/shm
tmpfs tmpfs 3.6G 147M 3.5G 4% /run
tmpfs tmpfs 3.6G 0 3.6G 0% /sys/fs/cgroup
/dev/sdb xfs 25G 2.7G 23G 11% /var/lib/docker
/dev/sdc xfs 50G 33M 50G 1% /var/lib/origin/openshift.local.volumes
tmpfs tmpfs 732M 0 732M 0% /run/user/1000
Persistent storage for pods should be handled outside of the instances running the OKD cluster. Persistent volumes for pods can be provisioned by the infrastructure provider, or with the use of container native storage or container ready storage.
Docker storage
Docker Storage can be backed by one of two options. The first is a thin pool logical volume with device mapper, the second, since Red Hat Enterprise Linux version 7.4, is an overlay2 file system. The overlay2 file system is generally recommended due to the ease of setup and increased performance.
The Docker storage disk is mounted as /var/lib/docker
and formatted with xfs
file system. Docker storage is configured to use overlay2 filesystem:
$ cat /etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS='--storage-driver overlay2'
To verify this storage driver is used by Docker:
# docker info
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 4
Server Version: 1.12.6
Storage Driver: overlay2
Backing Filesystem: xfs
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
Volume: local
Network: overlay host bridge null
Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-693.11.1.el7.x86_64
Operating System: Employee SKU
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 2
Total Memory: 7.147 GiB
Name: ocp-infra-node-1clj
ID: T7T6:IQTG:WTUX:7BRU:5FI4:XUL5:PAAM:4SLW:NWKL:WU2V:NQOW:JPHC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
127.0.0.0/8
Registries: registry.access.redhat.com (secure), registry.access.redhat.com (secure), docker.io (secure)
API service status
The OpenShift API service runs on all master instances. To see the status of the service, view the master-api pods in the kube-system project:
oc get pod -n kube-system -l openshift.io/component=api
NAME READY STATUS RESTARTS AGE
master-api-myserver.com 1/1 Running 0 56d
The API service exposes a health check, which can be queried externally using the API host name:
oc get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE
master-api-myserver.com 1/1 Running 0 7h 10.240.0.16 myserver.com/healthz
$ curl -k https://myserver.com/healthz
ok
Controller role verification
The OKD controller service, is available across all master hosts. The service runs in active/passive mode, meaning it should only be running on one master at any time.
The OKD controllers execute a procedure to choose which host runs the service. The current running value is stored in an annotation in a special configmap
stored in the kube-system
project.
Verify the master host running the controller service as a cluster-admin
user:
$ oc get -n kube-system cm openshift-master-controllers -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"master-ose-master-0.example.com-10.19.115.212-dnwrtcl4","leaseDurationSeconds":15,"acquireTime":"2018-02-17T18:16:54Z","renewTime":"2018-02-19T13:50:33Z","leaderTransitions":16}'
creationTimestamp: 2018-02-02T10:30:04Z
name: openshift-master-controllers
namespace: kube-system
resourceVersion: "17349662"
selfLink: /api/v1/namespaces/kube-system/configmaps/openshift-master-controllers
uid: 08636843-0804-11e8-8580-fa163eb934f0
The command outputs the current master controller in the control-plane.alpha.kubernetes.io/leader
annotation, within the holderIdentity
property as:
master-<hostname>-<ip>-<8_random_characters>
Find the hostname of the master host by filtering the output using the following:
$ oc get -n kube-system cm openshift-master-controllers -o json | jq -r '.metadata.annotations[] | fromjson.holderIdentity | match("^master-(.*)-[0-9.]*-[0-9a-z]{8}$") | .captures[0].string'
ose-master-0.example.com
Verifying correct Maximum Transmission Unit (MTU) size
Verifying the maximum transmission unit (MTU) prevents a possible networking misconfiguration that can masquerade as an SSL certificate issue.
When a packet is larger than the MTU size that is transmitted over HTTP, the physical network router is able to break the packet into multiple packets to transmit the data. However, when a packet is larger than the MTU size is that transmitted over HTTPS, the router is forced to drop the packet.
Installation produces certificates that provide secure connections to multiple components that include:
master hosts
node hosts
infrastructure nodes
registry
router
These certificates can be found within the /etc/origin/master
directory for the master nodes and /etc/origin/node
directory for the infra and app nodes.
After installation, you can verify connectivity to the REGISTRY_OPENSHIFT_SERVER_ADDR
using the process outlined in the Network connectivity section.
Prerequisites
From a master host, get the HTTPS address:
$ oc -n default get dc docker-registry -o jsonpath='{.spec.template.spec.containers[].env[?(@.name=="REGISTRY_OPENSHIFT_SERVER_ADDR")].value}{"\n"}'
docker-registry.default.svc:5000
The above gives the output of
docker-registry.default.svc:5000
.Append
/healthz
to the value given above, use it to check on all hosts (master, infrastructure, node):$ curl -v https://docker-registry.default.svc:5000/healthz
* About to connect() to docker-registry.default.svc port 5000 (#0)
* Trying 172.30.11.171...
* Connected to docker-registry.default.svc (172.30.11.171) port 5000 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
* subject: CN=172.30.11.171
* start date: Oct 18 05:30:10 2017 GMT
* expire date: Oct 18 05:30:11 2019 GMT
* common name: 172.30.11.171
* issuer: CN=openshift-signer@1508303629
> GET /healthz HTTP/1.1
> User-Agent: curl/7.29.0
> Host: docker-registry.default.svc:5000
> Accept: */*
>
< HTTP/1.1 200 OK
< Cache-Control: no-cache
< Date: Tue, 24 Oct 2017 19:42:35 GMT
< Content-Length: 0
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host docker-registry.default.svc left intact
The above example output shows the MTU size being used to ensure the SSL connection is correct. The attempt to connect is successful, followed by connectivity being established and completes with initializing the NSS with the certpath and all the server certificate information regarding the docker-registry.
An improper MTU size results in a timeout:
$ curl -v https://docker-registry.default.svc:5000/healthz
* About to connect() to docker-registry.default.svc port 5000 (#0)
* Trying 172.30.11.171...
* Connected to docker-registry.default.svc (172.30.11.171) port 5000 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
The above example shows that the connection is established, but it cannot finish initializing NSS with certpath. The issue deals with improper MTU size set within the appropriate node configuration map.
To fix this issue, adjust the MTU size within the node configuration map to 50 bytes smaller than the MTU size that the OpenShift SDN Ethernet device uses.
View the MTU size of the desired Ethernet device (i.e.
eth0
):$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
link/ether fa:16:3e:92:6a:86 brd ff:ff:ff:ff:ff:ff
The above shows MTU set to 1500.
To change the MTU size, modify the appropriate node configuration map and set a value that is 50 bytes smaller than output provided by the
ip
command.For example, if the MTU size is set to 1500, adjust the MTU size to 1450 within the node configuraton map:
networkConfig:
mtu: 1450
Save the changes and reboot the node:
You must change the MTU size on all masters and nodes that are part of the OKD SDN. Also, the MTU size of the tun0 interface must be the same across all nodes that are part of the cluster.
Once the node is back online, confirm the issue no longer exists by re-running the original
curl
command.$ curl -v https://docker-registry.default.svc:5000/healthz
If the timeout persists, continue to adjust the MTU size in increments of 50 bytes and repeat the process.