Configure Health Probes

Overview

Implementing health probes in your application is a great way for Kubernetes to automate some tasks to improve availability in the event of an error.

Because OSM reconfigures application Pods to redirect all incoming and outgoing network traffic through the proxy sidecar, httpGet and tcpSocket health probes invoked by the kubelet will fail due to the lack of any mTLS context required by the proxy.

For httpGet health probes to continue to work as expected from within the mesh, OSM adds configuration to expose the probe endpoint via the proxy and rewrites the probe definitions for new Pods to refer to the proxy-exposed endpoint. All of the functionality of the original probe is still used, OSM simply fronts it with the proxy so the kubelet can communicate with it.

Special configuration is required to support tcpSocket health probes in the mesh. Since OSM redirects all network traffic through Envoy, all ports appear open in the Pod. This causes all TCP connections routed to Pod’s injected with an Envoy sidecar to appear successful. For tcpSocket health probes to work as expected in the mesh, OSM rewrites the probes to be httpGet probes and adds an iptables command to bypass the Envoy proxy at the osm-healthcheck exposed endpoint. The osm-healthcheck container is added to the Pod and handles the HTTP health probe requests from kubelet. The handler gets the original TCP port from the request’s Original-Tcp-port header and attempts to open a socket on the specified port. The response status code for the httpGet probe will reflect if the TCP connection was successful.

ProbePathPort
Liveness/osm-liveness-probe15901
Readiness/osm-readiness-probe15902
Startup/osm-startup-probe15903
Healthcheck/osm-healthcheck15904

For HTTP and tcpSocket probes, the port and path are modified. For HTTPS probes, the port is modified, but the path is left unchanged.

Only predefined httpGet and tcpSocket probes are modified. If a probe is undefined, one will not be added in its place. exec probes (including those using grpc_health_probe) are never modified and will continue to function as expected as long as the command does not require network access outside of localhost.

Examples

The following examples show how OSM handles health probes for Pods in a mesh.

HTTP

Consider a Pod spec defining a container with the following livenessProbe:

  1. livenessProbe:
  2. httpGet:
  3. path: /liveness
  4. port: 14001
  5. scheme: HTTP

When the Pod is created, OSM will modify the probe to be the following:

  1. livenessProbe:
  2. httpGet:
  3. path: /osm-liveness-probe
  4. port: 15901
  5. scheme: HTTP

The Pod’s proxy will contain the following Envoy configuration.

An Envoy cluster which maps to the original probe port 14001:

  1. {
  2. "cluster": {
  3. "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
  4. "name": "liveness_cluster",
  5. "type": "STATIC",
  6. "connect_timeout": "1s",
  7. "load_assignment": {
  8. "cluster_name": "liveness_cluster",
  9. "endpoints": [
  10. {
  11. "lb_endpoints": [
  12. {
  13. "endpoint": {
  14. "address": {
  15. "socket_address": {
  16. "address": "0.0.0.0",
  17. "port_value": 14001
  18. }
  19. }
  20. }
  21. }
  22. ]
  23. }
  24. ]
  25. }
  26. },
  27. "last_updated": "2021-03-29T21:02:59.086Z"
  28. }

A listener for the new proxy-exposed HTTP endpoint at /osm-liveness-probe on port 15901 mapping to the cluster above:

  1. {
  2. "listener": {
  3. "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
  4. "name": "liveness_listener",
  5. "address": {
  6. "socket_address": {
  7. "address": "0.0.0.0",
  8. "port_value": 15901
  9. }
  10. },
  11. "filter_chains": [
  12. {
  13. "filters": [
  14. {
  15. "name": "envoy.filters.network.http_connection_manager",
  16. "typed_config": {
  17. "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
  18. "stat_prefix": "health_probes_http",
  19. "route_config": {
  20. "name": "local_route",
  21. "virtual_hosts": [
  22. {
  23. "name": "local_service",
  24. "domains": [
  25. "*"
  26. ],
  27. "routes": [
  28. {
  29. "match": {
  30. "prefix": "/osm-liveness-probe"
  31. },
  32. "route": {
  33. "cluster": "liveness_cluster",
  34. "prefix_rewrite": "/liveness"
  35. }
  36. }
  37. ]
  38. }
  39. ]
  40. },
  41. "http_filters": [...],
  42. "access_log": [...]
  43. }
  44. }
  45. ]
  46. }
  47. ]
  48. },
  49. "last_updated": "2021-03-29T21:02:59.092Z"
  50. }

HTTPS

Consider a Pod spec defining a container with the following livenessProbe:

  1. livenessProbe:
  2. httpGet:
  3. path: /liveness
  4. port: 14001
  5. scheme: HTTPS

When the Pod is created, OSM will modify the probe to be the following:

  1. livenessProbe:
  2. httpGet:
  3. path: /liveness
  4. port: 15901
  5. scheme: HTTPS

The Pod’s proxy will contain the following Envoy configuration.

An Envoy cluster which maps to the original probe port 14001:

  1. {
  2. "cluster": {
  3. "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
  4. "name": "liveness_cluster",
  5. "type": "STATIC",
  6. "connect_timeout": "1s",
  7. "load_assignment": {
  8. "cluster_name": "liveness_cluster",
  9. "endpoints": [
  10. {
  11. "lb_endpoints": [
  12. {
  13. "endpoint": {
  14. "address": {
  15. "socket_address": {
  16. "address": "0.0.0.0",
  17. "port_value": 14001
  18. }
  19. }
  20. }
  21. }
  22. ]
  23. }
  24. ]
  25. }
  26. },
  27. "last_updated": "2021-03-29T21:02:59.086Z"
  28. }

A listener for the new proxy-exposed TCP endpoint on port 15901 mapping to the cluster above:

  1. {
  2. "listener": {
  3. "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
  4. "name": "liveness_listener",
  5. "address": {
  6. "socket_address": {
  7. "address": "0.0.0.0",
  8. "port_value": 15901
  9. }
  10. },
  11. "filter_chains": [
  12. {
  13. "filters": [
  14. {
  15. "name": "envoy.filters.network.tcp_proxy",
  16. "typed_config": {
  17. "@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
  18. "stat_prefix": "health_probes",
  19. "cluster": "liveness_cluster",
  20. "access_log": [...]
  21. }
  22. }
  23. ]
  24. }
  25. ]
  26. },
  27. "last_updated": "2021-04-07T15:09:22.704Z"
  28. }

tcpSocket

Consider a Pod spec defining a container with the following livenessProbe:

  1. livenessProbe:
  2. tcpSocket:
  3. port: 14001

When the Pod is created, OSM will modify the probe to be the following:

  1. livenessProbe:
  2. httpGet:
  3. httpHeaders:
  4. - name: Original-Tcp-Port
  5. value: "14001"
  6. path: /osm-healthcheck
  7. port: 15904
  8. scheme: HTTP

Requests to port 15904 bypass the Envoy proxy and are directed to the osm-healthcheck endpoint.

How to Verify Health of Pods in the Mesh

Kubernetes will automatically poll the health endpoints of Pods configured with startup, liveness, and readiness probes.

When a startup probe fails, Kubernetes will generate an Event (visible by kubectl describe pod <pod name>) and restart the Pod. The kubectl describe output may look like this:

  1. ...
  2. Events:
  3. Type Reason Age From Message
  4. ---- ------ ---- ---- -------
  5. Normal Scheduled 17s default-scheduler Successfully assigned bookstore/bookstore-v1-699c79b9dc-5g8zn to osm-control-plane
  6. Normal Pulled 16s kubelet Successfully pulled image "openservicemesh/init:v0.8.0" in 26.5835ms
  7. Normal Created 16s kubelet Created container osm-init
  8. Normal Started 16s kubelet Started container osm-init
  9. Normal Pulling 16s kubelet Pulling image "openservicemesh/init:v0.8.0"
  10. Normal Pulling 15s kubelet Pulling image "envoyproxy/envoy-alpine:v1.17.2"
  11. Normal Pulling 15s kubelet Pulling image "openservicemesh/bookstore:v0.8.0"
  12. Normal Pulled 15s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 319.9863ms
  13. Normal Started 15s kubelet Started container bookstore-v1
  14. Normal Created 15s kubelet Created container bookstore-v1
  15. Normal Pulled 14s kubelet Successfully pulled image "envoyproxy/envoy-alpine:v1.17.2" in 755.2666ms
  16. Normal Created 14s kubelet Created container envoy
  17. Normal Started 14s kubelet Started container envoy
  18. Warning Unhealthy 13s kubelet Startup probe failed: Get "http://10.244.0.23:15903/osm-startup-probe": dial tcp 10.244.0.23:15903: connect: connection refused
  19. Warning Unhealthy 3s (x2 over 8s) kubelet Startup probe failed: HTTP probe failed with statuscode: 503

When a liveness probe fails, Kubernetes will generate an Event (visible by kubectl describe pod <pod name>) and restart the Pod. The kubectl describe output may look like this:

  1. ...
  2. Events:
  3. Type Reason Age From Message
  4. ---- ------ ---- ---- -------
  5. Normal Scheduled 59s default-scheduler Successfully assigned bookstore/bookstore-v1-746977967c-jqjt4 to osm-control-plane
  6. Normal Pulling 58s kubelet Pulling image "openservicemesh/init:v0.8.0"
  7. Normal Created 58s kubelet Created container osm-init
  8. Normal Started 58s kubelet Started container osm-init
  9. Normal Pulled 58s kubelet Successfully pulled image "openservicemesh/init:v0.8.0" in 23.415ms
  10. Normal Pulled 57s kubelet Successfully pulled image "envoyproxy/envoy-alpine:v1.17.2" in 678.1391ms
  11. Normal Pulled 57s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 230.3681ms
  12. Normal Created 57s kubelet Created container envoy
  13. Normal Pulling 57s kubelet Pulling image "envoyproxy/envoy-alpine:v1.17.2"
  14. Normal Started 56s kubelet Started container envoy
  15. Normal Pulled 44s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 20.6731ms
  16. Normal Created 44s (x2 over 57s) kubelet Created container bookstore-v1
  17. Normal Started 43s (x2 over 57s) kubelet Started container bookstore-v1
  18. Normal Pulling 32s (x3 over 58s) kubelet Pulling image "openservicemesh/bookstore:v0.8.0"
  19. Warning Unhealthy 32s (x6 over 50s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
  20. Normal Killing 32s (x2 over 44s) kubelet Container bookstore-v1 failed liveness probe, will be restarted

When a readiness probe fails, Kubernetes will generate an Event (visible with kubectl describe pod <pod name>) and ensure no traffic destined for Services the Pod may be backing is routed to the unhealthy Pod. The kubectl describe output for a Pod with a failing readiness probe may look like this:

  1. ...
  2. Events:
  3. Type Reason Age From Message
  4. ---- ------ ---- ---- -------
  5. Normal Scheduled 32s default-scheduler Successfully assigned bookstore/bookstore-v1-5848999cb6-hp6qg to osm-control-plane
  6. Normal Pulling 31s kubelet Pulling image "openservicemesh/init:v0.8.0"
  7. Normal Pulled 31s kubelet Successfully pulled image "openservicemesh/init:v0.8.0" in 19.8726ms
  8. Normal Created 31s kubelet Created container osm-init
  9. Normal Started 31s kubelet Started container osm-init
  10. Normal Created 30s kubelet Created container bookstore-v1
  11. Normal Pulled 30s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 314.3628ms
  12. Normal Pulling 30s kubelet Pulling image "openservicemesh/bookstore:v0.8.0"
  13. Normal Started 30s kubelet Started container bookstore-v1
  14. Normal Pulling 30s kubelet Pulling image "envoyproxy/envoy-alpine:v1.17.2"
  15. Normal Pulled 29s kubelet Successfully pulled image "envoyproxy/envoy-alpine:v1.17.2" in 739.3931ms
  16. Normal Created 29s kubelet Created container envoy
  17. Normal Started 29s kubelet Started container envoy
  18. Warning Unhealthy 0s (x3 over 20s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503

The Pod’s status will also indicate that it is not ready which is shown in its kubectl get pod output. For example:

  1. NAME READY STATUS RESTARTS AGE
  2. bookstore-v1-5848999cb6-hp6qg 1/2 Running 0 85s

The Pods’ health probes may also be invoked manually by forwarding the Pod’s necessary port and using curl or any other HTTP client to issue requests. For example, to verify the liveness probe for the bookstore-v1 demo Pod, forward port 15901:

  1. kubectl port-forward -n bookstore deployment/bookstore-v1 15901

Then, in a separate terminal instance, curl may be used to check the endpoint. The following example shows a healthy bookstore-v1:

  1. $ curl -i localhost:15901/osm-liveness-probe
  2. HTTP/1.1 200 OK
  3. date: Wed, 31 Mar 2021 16:00:01 GMT
  4. content-length: 1396
  5. content-type: text/html; charset=utf-8
  6. x-envoy-upstream-service-time: 1
  7. server: envoy
  8. <!doctype html>
  9. <html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
  10. ...
  11. </html>

Known issues

Troubleshooting

If any health probes are consistently failing, perform the following steps to identify the root cause:

  1. Verify httpGet and tcpSocket probes on Pods in the mesh have been modified.

    Startup, liveness, and readiness httpGet probes must be modified by OSM in order to continue to function while in a mesh. Ports must be modified to 15901, 15902, and 15903 for liveness, readiness, and startup httpGet probes, respectively. Only HTTP (not HTTPS) probes will have paths modified in addition to be /osm-liveness-probe, /osm-readiness-probe, or /osm-startup-probe.

    Also, verify the Pod’s Envoy configuration contains a listener for the modified endpoint.

    For tcpSocket probes to function in the mesh, they must be rewritten to httpGet probes. The ports must be modified to 15904 for liveness, readiness, and startup probes. The path the must be set to /osm-healthcheck. A HTTP header, Original-Tcp-Port, must be set to the original port specified in the tcpSocket probe definition. Also, verify that the osm-healthcheck container is running. Inspect the osm-healthcheck logs for more information.

    See the examples above for more details.

  2. Determine if Kubernetes encountered any other errors while scheduling or starting the Pod.

    Look for any errors that may have recently occurred with kubectl describe of the unhealthy Pod. Resolve any errors and verify the Pod’s health again.

  3. Determine if the Pod encountered a runtime error.

    Look for any errors that may have occurred after the container started by inspecting its logs with kubectl logs. Resolve any errors and verify the Pod’s health again.