Monitoring application health by using health checks

In software systems, components can become unhealthy due to transient issues such as temporary connectivity loss, configuration errors, or problems with external dependencies. OKD applications have a number of options to detect and handle unhealthy containers.

Understanding health checks

A health check periodically performs diagnostics on a running container using any combination of the readiness, liveness, and startup health checks.

You can include one or more probes in the specification for the pod that contains the container which you want to perform the health checks.

If you want to add or edit health checks in an existing pod, you must edit the pod DeploymentConfig object or use the Developer perspective in the web console. You cannot use the CLI to add or edit health checks for an existing pod.

Readiness probe

A readiness probe determines if a container is ready to accept service requests. If the readiness probe fails for a container, the kubelet removes the pod from the list of available service endpoints.

After a failure, the probe continues to examine the pod. If the pod becomes available, the kubelet adds the pod to the list of available service endpoints.

Liveness health check

A liveness probe determines if a container is still running. If the liveness probe fails due to a condition such as a deadlock, the kubelet kills the container. The pod then responds based on its restart policy.

For example, a liveness probe on a pod with a restartPolicy of Always or OnFailure kills and restarts the container.

Startup probe

A startup probe indicates whether the application within a container is started. All other probes are disabled until the startup succeeds. If the startup probe does not succeed within a specified time period, the kubelet kills the container, and the container is subject to the pod restartPolicy.

Some applications can require additional startup time on their first initialization. You can use a startup probe with a liveness or readiness probe to delay that probe long enough to handle lengthy start-up time using the failureThreshold and periodSeconds parameters.

For example, you can add a startup probe, with a failureThreshold of 30 failures and a periodSeconds of 10 seconds (30 * 10s = 300s) for a maximum of 5 minutes, to a liveness probe. After the startup probe succeeds the first time, the liveness probe takes over.

You can configure liveness, readiness, and startup probes with any of the following types of tests:

  • HTTP GET: When using an HTTP GET test, the test determines the healthiness of the container by using a web hook. The test is successful if the HTTP response code is between 200 and 399.

    You can use an HTTP GET test with applications that return HTTP status codes when completely initialized.

  • Container Command: When using a container command test, the probe executes a command inside the container. The probe is successful if the test exits with a 0 status.

  • TCP socket: When using a TCP socket test, the probe attempts to open a socket to the container. The container is only considered healthy if the probe can establish a connection. You can use a TCP socket test with applications that do not start listening until initialization is complete.

You can configure several fields to control the behavior of a probe:

  • initialDelaySeconds: The time, in seconds, after the container starts before the probe can be scheduled. The default is 0.

  • periodSeconds: The delay, in seconds, between performing probes. The default is 10. This value must be greater than timeoutSeconds.

  • timeoutSeconds: The number of seconds of inactivity after which the probe times out and the container is assumed to have failed. The default is 1. This value must be lower than periodSeconds.

  • successThreshold: The number of times that the probe must report success after a failure to reset the container status to successful. The value must be 1 for a liveness probe. The default is 1.

  • failureThreshold: The number of times that the probe is allowed to fail. The default is 3. After the specified attempts:

    • for a liveness probe, the container is restarted

    • for a readiness probe, the pod is marked Unready

    • for a startup probe, the container is killed and is subject to the pod’s restartPolicy

Example probes

The following are samples of different probes as they would appear in an object specification.

Sample readiness probe with a container command readiness probe in a pod spec

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. labels:
  5. test: health-check
  6. name: my-application
  7. ...
  8. spec:
  9. containers:
  10. - name: goproxy-app (1)
  11. args:
  12. image: k8s.gcr.io/goproxy:0.1 (2)
  13. readinessProbe: (3)
  14. exec: (4)
  15. command: (5)
  16. - cat
  17. - /tmp/healthy
  18. ...
1The container name.
2The container image to deploy.
3A readiness probe.
4A container command test.
5The commands to execute on the container.

Sample container command startup probe and liveness probe with container command tests in a pod spec

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. labels:
  5. test: health-check
  6. name: my-application
  7. ...
  8. spec:
  9. containers:
  10. - name: goproxy-app (1)
  11. args:
  12. image: k8s.gcr.io/goproxy:0.1 (2)
  13. livenessProbe: (3)
  14. httpGet: (4)
  15. scheme: HTTPS (5)
  16. path: /healthz
  17. port: 8080 (6)
  18. httpHeaders:
  19. - name: X-Custom-Header
  20. value: Awesome
  21. startupProbe: (7)
  22. httpGet: (8)
  23. path: /healthz
  24. port: 8080 (9)
  25. failureThreshold: 30 (10)
  26. periodSeconds: 10 (11)
  27. ...
1The container name.
2Specify the container image to deploy.
3A liveness probe.
4An HTTP GET test.
5The internet scheme: HTTP or HTTPS. The default value is HTTP.
6The port on which the container is listening.
7A startup probe.
8An HTTP GET test.
9The port on which the container is listening.
10The number of times to try the probe after a failure.
11The number of seconds to perform the probe.

Sample liveness probe with a container command test that uses a timeout in a pod spec

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. labels:
  5. test: health-check
  6. name: my-application
  7. ...
  8. spec:
  9. containers:
  10. - name: goproxy-app (1)
  11. args:
  12. image: k8s.gcr.io/goproxy:0.1 (2)
  13. livenessProbe: (3)
  14. exec: (4)
  15. command: (5)
  16. - /bin/bash
  17. - '-c'
  18. - timeout 60 /opt/eap/bin/livenessProbe.sh
  19. periodSeconds: 10 (6)
  20. successThreshold: 1 (7)
  21. failureThreshold: 3 (8)
  22. ...
1The container name.
2Specify the container image to deploy.
3The liveness probe.
4The type of probe, here a container command probe.
5The command line to execute inside the container.
6How often in seconds to perform the probe.
7The number of consecutive successes needed to show success after a failure.
8The number of times to try the probe after a failure.

Sample readiness probe and liveness probe with a TCP socket test in a deployment

  1. kind: Deployment
  2. apiVersion: apps/v1
  3. ...
  4. spec:
  5. ...
  6. template:
  7. spec:
  8. containers:
  9. - resources: {}
  10. readinessProbe: (1)
  11. tcpSocket:
  12. port: 8080
  13. timeoutSeconds: 1
  14. periodSeconds: 10
  15. successThreshold: 1
  16. failureThreshold: 3
  17. terminationMessagePath: /dev/termination-log
  18. name: ruby-ex
  19. livenessProbe: (2)
  20. tcpSocket:
  21. port: 8080
  22. initialDelaySeconds: 15
  23. timeoutSeconds: 1
  24. periodSeconds: 10
  25. successThreshold: 1
  26. failureThreshold: 3
  27. ...
1The readiness probe.
2The liveness probe.

Configuring health checks using the CLI

To configure readiness, liveness, and startup probes, add one or more probes to the specification for the pod that contains the container which you want to perform the health checks

If you want to add or edit health checks in an existing pod, you must edit the pod DeploymentConfig object or use the Developer perspective in the web console. You cannot use the CLI to add or edit health checks for an existing pod.

Procedure

To add probes for a container:

  1. Create a Pod object to add one or more probes:

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. labels:
    5. test: health-check
    6. name: my-application
    7. spec:
    8. containers:
    9. - name: my-container (1)
    10. args:
    11. image: k8s.gcr.io/goproxy:0.1 (2)
    12. livenessProbe: (3)
    13. tcpSocket: (4)
    14. port: 8080 (5)
    15. initialDelaySeconds: 15 (6)
    16. periodSeconds: 20 (7)
    17. timeoutSeconds: 10 (8)
    18. readinessProbe: (9)
    19. httpGet: (10)
    20. host: my-host (11)
    21. scheme: HTTPS (12)
    22. path: /healthz
    23. port: 8080 (13)
    24. startupProbe: (14)
    25. exec: (15)
    26. command: (16)
    27. - cat
    28. - /tmp/healthy
    29. failureThreshold: 30 (17)
    30. periodSeconds: 20 (18)
    31. timeoutSeconds: 10 (19)
    1Specify the container name.
    2Specify the container image to deploy.
    3Optional: Create a Liveness probe.
    4Specify a test to perform, here a TCP Socket test.
    5Specify the port on which the container is listening.
    6Specify the time, in seconds, after the container starts before the probe can be scheduled.
    7Specify the number of seconds to perform the probe. The default is 10. This value must be greater than timeoutSeconds.
    8Specify the number of seconds of inactivity after which the probe is assumed to have failed. The default is 1. This value must be lower than periodSeconds.
    9Optional: Create a Readiness probe.
    10Specify the type of test to perform, here an HTTP test.
    11Specify a host IP address. When host is not defined, the PodIP is used.
    12Specify HTTP or HTTPS. When scheme is not defined, the HTTP scheme is used.
    13Specify the port on which the container is listening.
    14Optional: Create a Startup probe.
    15Specify the type of test to perform, here an Container Execution probe.
    16Specify the commands to execute on the container.
    17Specify the number of times to try the probe after a failure.
    18Specify the number of seconds to perform the probe. The default is 10. This value must be greater than timeoutSeconds.
    19Specify the number of seconds of inactivity after which the probe is assumed to have failed. The default is 1. This value must be lower than periodSeconds.

    If the initialDelaySeconds value is lower than the periodSeconds value, the first Readiness probe occurs at some point between the two periods due to an issue with timers.

    The timeoutSeconds value must be lower than the periodSeconds value.

  2. Create the Pod object:

    1. $ oc create -f <file-name>.yaml
  3. Verify the state of the health check pod:

    1. $ oc describe pod health-check

    Example output

    1. Events:
    2. Type Reason Age From Message
    3. ---- ------ ---- ---- -------
    4. Normal Scheduled 9s default-scheduler Successfully assigned openshift-logging/liveness-exec to ip-10-0-143-40.ec2.internal
    5. Normal Pulling 2s kubelet, ip-10-0-143-40.ec2.internal pulling image "k8s.gcr.io/liveness"
    6. Normal Pulled 1s kubelet, ip-10-0-143-40.ec2.internal Successfully pulled image "k8s.gcr.io/liveness"
    7. Normal Created 1s kubelet, ip-10-0-143-40.ec2.internal Created container
    8. Normal Started 1s kubelet, ip-10-0-143-40.ec2.internal Started container

    The following is the output of a failed probe that restarted a container:

    Sample Liveness check output with unhealthy container

    1. $ oc describe pod pod1

    Example output

    1. ....
    2. Events:
    3. Type Reason Age From Message
    4. ---- ------ ---- ---- -------
    5. Normal Scheduled <unknown> Successfully assigned aaa/liveness-http to ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj
    6. Normal AddedInterface 47s multus Add eth0 [10.129.2.11/23]
    7. Normal Pulled 46s kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Successfully pulled image "k8s.gcr.io/liveness" in 773.406244ms
    8. Normal Pulled 28s kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Successfully pulled image "k8s.gcr.io/liveness" in 233.328564ms
    9. Normal Created 10s (x3 over 46s) kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Created container liveness
    10. Normal Started 10s (x3 over 46s) kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Started container liveness
    11. Warning Unhealthy 10s (x6 over 34s) kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Liveness probe failed: HTTP probe failed with statuscode: 500
    12. Normal Killing 10s (x2 over 28s) kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Container liveness failed liveness probe, will be restarted
    13. Normal Pulling 10s (x3 over 47s) kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Pulling image "k8s.gcr.io/liveness"
    14. Normal Pulled 10s kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj Successfully pulled image "k8s.gcr.io/liveness" in 244.116568ms

Monitoring application health using the Developer perspective

You can use the Developer perspective to add three types of health probes to your container to ensure that your application is healthy:

  • Use the Readiness probe to check if the container is ready to handle requests.

  • Use the Liveness probe to check if the container is running.

  • Use the Startup probe to check if the application within the container has started.

You can add health checks either while creating and deploying an application, or after you have deployed an application.

Adding health checks using the Developer perspective

You can use the Topology view to add health checks to your deployed application.

Prerequisites:

  • You have switched to the Developer perspective in the web console.

  • You have created and deployed an application on OKD using the Developer perspective.

Procedure

  1. In the Topology view, click on the application node to see the side panel. If the container does not have health checks added to ensure the smooth running of your application, a Health Checks notification is displayed with a link to add health checks.

  2. In the displayed notification, click the Add Health Checks link.

  3. Alternatively, you can also click the Actions drop-down list and select Add Health Checks. Note that if the container already has health checks, you will see the Edit Health Checks option instead of the add option.

  4. In the Add Health Checks form, if you have deployed multiple containers, use the Container drop-down list to ensure that the appropriate container is selected.

  5. Click the required health probe links to add them to the container. Default data for the health checks is prepopulated. You can add the probes with the default data or further customize the values and then add them. For example, to add a Readiness probe that checks if your container is ready to handle requests:

    1. Click Add Readiness Probe, to see a form containing the parameters for the probe.

    2. Click the Type drop-down list to select the request type you want to add. For example, in this case, select Container Command to select the command that will be executed inside the container.

    3. In the Command field, add an argument cat, similarly, you can add multiple arguments for the check, for example, add another argument /tmp/healthy.

    4. Retain or modify the default values for the other parameters as required.

      The Timeout value must be lower than the Period value. The Timeout default value is 1. The Period default value is 10.

    5. Click the check mark at the bottom of the form. The Readiness Probe Added message is displayed.

  6. Click Add to add the health check. You are redirected to the Topology view and the container is restarted.

  7. In the side panel, verify that the probes have been added by clicking on the deployed pod under the Pods section.

  8. In the Pod Details page, click the listed container in the Containers section.

  9. In the Container Details page, verify that the Readiness probe - Exec Command cat /tmp/healthy has been added to the container.

Editing health checks using the Developer perspective

You can use the Topology view to edit health checks added to your application, modify them, or add more health checks.

Prerequisites:

  • You have switched to the Developer perspective in the web console.

  • You have created and deployed an application on OKD using the Developer perspective.

  • You have added health checks to your application.

Procedure

  1. In the Topology view, right-click your application and select Edit Health Checks. Alternatively, in the side panel, click the Actions drop-down list and select Edit Health Checks.

  2. In the Edit Health Checks page:

    • To remove a previously added health probe, click the minus sign adjoining it.

    • To edit the parameters of an existing probe:

      1. Click the Edit Probe link next to a previously added probe to see the parameters for the probe.

      2. Modify the parameters as required, and click the check mark to save your changes.

    • To add a new health probe, in addition to existing health checks, click the add probe links. For example, to add a Liveness probe that checks if your container is running:

      1. Click Add Liveness Probe, to see a form containing the parameters for the probe.

      2. Edit the probe parameters as required.

        The Timeout value must be lower than the Period value. The Timeout default value is 1. The Period default value is 10.

      3. Click the check mark at the bottom of the form. The Liveness Probe Added message is displayed.

  1. Click Save to save your modifications and add the additional probes to your container. You are redirected to the Topology view.

  2. In the side panel, verify that the probes have been added by clicking on the deployed pod under the Pods section.

  3. In the Pod Details page, click the listed container in the Containers section.

  4. In the Container Details page, verify that the Liveness probe - HTTP Get 10.129.4.65:8080/ has been added to the container, in addition to the earlier existing probes.

Monitoring health check failures using the Developer perspective

In case an application health check fails, you can use the Topology view to monitor these health check violations.

Prerequisites:

  • You have switched to the Developer perspective in the web console.

  • You have created and deployed an application on OKD using the Developer perspective.

  • You have added health checks to your application.

Procedure

  1. In the Topology view, click on the application node to see the side panel.

  2. Click the Observe tab to see the health check failures in the Events (Warning) section.

  3. Click the down arrow adjoining Events (Warning) to see the details of the health check failure.

Additional resources