Health Check

Description

This article mainly introduces the health check function of Apache APISIX. The health check function can proxy requests to healthy nodes when the upstream node fails or migrates, avoiding the problem of service unavailability to the greatest extent. The health check function of APISIX is implemented using lua-resty-healthcheck, which is divided into active check and passive check.

Active check

Active health check mainly means that APISIX actively detects the survivability of upstream nodes through preset probe types. APISIX supports three probe types: HTTP, HTTPS, and TCP.

When N consecutive probes sent to healthy node A fail, the node will be marked as unhealthy, and the unhealthy node will be ignored by APISIX’s load balancer and cannot receive requests; if For an unhealthy node, if M consecutive probes are successful, the node will be re-marked as healthy and can be proxied.

Passive check

Passive health check refers to judging whether the corresponding upstream node is healthy by judging the response status of the request forwarded from APISIX to the upstream node. Compared with the active health check, the passive health check method does not need to initiate additional probes, but it cannot sense the node status in advance, and there may be a certain amount of failed requests.

If N consecutive requests to a healthy node A fail, the node will be marked as unhealthy.

Health Check - 图1note

Since unhealthy nodes cannot receive requests, nodes cannot be re-marked as healthy using the passive health check strategy alone, so combining the active health check strategy is usually necessary.

Health Check - 图2tip
  • We only start the health check when the upstream is hit by a request. There won’t be any health check if an upstream is configured but isn’t in used.
  • If there is no healthy node can be chosen, we will continue to access the upstream.
  • We won’t start the health check when the upstream only has one node, as we will access it whether this unique node is healthy or not.

Configuration instructions

NameConfiguration typeValue typeValid valuesDefaultDescription
upstream.checks.active.typeActive checkstringhttp https tcphttpThe type of active check.
upstream.checks.active.timeoutActive checkinteger1The timeout period of the active check (unit: second).
upstream.checks.active.concurrencyActive checkinteger10The number of targets to be checked at the same time during the active check.
upstream.checks.active.http_pathActive checkstring/The HTTP request path that is actively checked.
upstream.checks.active.hostActive checkstring${upstream.node.host}The hostname of the HTTP request actively checked.
upstream.checks.active.portActive checkinteger1 to 65535${upstream.node.port}The host port of the HTTP request that is actively checked.
upstream.checks.active.https_verify_certificateActive checkbooleantrueActive check whether to check the SSL certificate of the remote host when HTTPS type checking is used.
upstream.checks.active.req_headersActive checkarray[]Active check When using HTTP or HTTPS type checking, set additional request header information.
upstream.checks.active.healthy.intervalActive check (healthy node)integer>= 11Active check (healthy node) check interval (unit: second)
upstream.checks.active.healthy.http_statusesActive check (healthy node)array200 to 599[200, 302]Active check (healthy node) HTTP or HTTPS type check, the HTTP status code of the healthy node.
upstream.checks.active.healthy.successesActive check (healthy node)integer1 to 2542Active check (healthy node) determine the number of times a node is healthy.
upstream.checks.active.unhealthy.intervalActive check (unhealthy node)integer>= 11Active check (unhealthy node) check interval (unit: second)
upstream.checks.active.unhealthy.http_statusesActive check (unhealthy node)array200 to 599[429, 404, 500, 501, 502, 503, 504, 505]Active check (unhealthy node) HTTP or HTTPS type check, the HTTP status code of the non-healthy node.
upstream.checks.active.unhealthy.http_failuresActive check (unhealthy node)integer1 to 2545Active check (unhealthy node) HTTP or HTTPS type check, determine the number of times that the node is not healthy.
upstream.checks.active.unhealthy.tcp_failuresActive check (unhealthy node)integer1 to 2542Active check (unhealthy node) TCP type check, determine the number of times that the node is not healthy.
upstream.checks.active.unhealthy.timeoutsActive check (unhealthy node)integer1 to 2543Active check (unhealthy node) to determine the number of timeouts for unhealthy nodes.
upstream.checks.passive.typePassive checkstringhttp https tcphttpThe type of passive check.
upstream.checks.passive.healthy.http_statusesPassive check (healthy node)array200 to 599[200, 201, 202, 203, 204, 205, 206, 207, 208, 226, 300, 301, 302, 303, 304, 305, 306, 307, 308]Passive check (healthy node) HTTP or HTTPS type check, the HTTP status code of the healthy node.
upstream.checks.passive.healthy.successesPassive check (healthy node)integer0 to 2545Passive checks (healthy node) determine the number of times a node is healthy.
upstream.checks.passive.unhealthy.http_statusesPassive check (unhealthy node)array200 to 599[429, 500, 503]Passive check (unhealthy node) HTTP or HTTPS type check, the HTTP status code of the non-healthy node.
upstream.checks.passive.unhealthy.tcp_failuresPassive check (unhealthy node)integer0 to 2542Passive check (unhealthy node) When TCP type is checked, determine the number of times that the node is not healthy.
upstream.checks.passive.unhealthy.timeoutsPassive check (unhealthy node)integer0 to 2547Passive checks (unhealthy node) determine the number of timeouts for unhealthy nodes.
upstream.checks.passive.unhealthy.http_failuresPassive check (unhealthy node)integer0 to 2545Passive check (unhealthy node) The number of times that the node is not healthy during HTTP or HTTPS type checking.

Configuration example

You can enable health checks in routes via the Admin API:

Health Check - 图3note

You can fetch the admin_key from config.yaml and save to an environment variable with the following command:

  1. admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 's/"//g')
  1. curl http://127.0.0.1:9180/apisix/admin/routes/1 -H "X-API-KEY: $admin_key" -X PUT -d '
  2. {
  3. "uri": "/index.html",
  4. "plugins": {
  5. "limit-count": {
  6. "count": 2,
  7. "time_window": 60,
  8. "rejected_code": 503,
  9. "key": "remote_addr"
  10. }
  11. },
  12. "upstream": {
  13. "nodes": {
  14. "127.0.0.1:1980": 1,
  15. "127.0.0.1:1970": 1
  16. },
  17. "type": "roundrobin",
  18. "retries": 2,
  19. "checks": {
  20. "active": {
  21. "timeout": 5,
  22. "http_path": "/status",
  23. "host": "foo.com",
  24. "healthy": {
  25. "interval": 2,
  26. "successes": 1
  27. },
  28. "unhealthy": {
  29. "interval": 1,
  30. "http_failures": 2
  31. },
  32. "req_headers": ["User-Agent: curl/7.29.0"]
  33. },
  34. "passive": {
  35. "healthy": {
  36. "http_statuses": [200, 201],
  37. "successes": 3
  38. },
  39. "unhealthy": {
  40. "http_statuses": [500],
  41. "http_failures": 3,
  42. "tcp_failures": 3
  43. }
  44. }
  45. }
  46. }
  47. }'

If APISIX detects an unhealthy node, the following logs will be output in the error log:

  1. enabled healthcheck passive while logging request
  2. failed to receive status line from 'nil (127.0.0.1:1980)': closed
  3. unhealthy TCP increment (1/2) for '(127.0.0.1:1980)'
  4. failed to receive status line from 'nil (127.0.0.1:1980)': closed
  5. unhealthy TCP increment (2/2) for '(127.0.0.1:1980'
Health Check - 图4tip

To observe the above log information, you need to adjust the error log level to info.

The health check status can be fetched via GET /v1/healthcheck in Control API.

  1. curl http://127.0.0.1:9090/v1/healthcheck/upstreams/healthycheck -s | jq .

Health Check Status

APISIX provides comprehensive health check information, with particular emphasis on the status and counter parameters for effective health monitoring. In the APISIX context, nodes exhibit four states: healthy, unhealthy, mostly_unhealthy, and mostly_healthy. The mostly_healthy status indicates that the current node is considered healthy, but during health checks, the node’s health status is not consistently successful. The mostly_unhealthy status indicates that the current node is considered unhealthy, but during health checks, the node’s health detection is not consistently unsuccessful. The transition of a node’s state depends on the success or failure of the current health check, along with the recording of four key metrics in the counter: tcp_failure, http_failure, success, and timeout_failure.

To retrieve health check information, you can use the following curl command:

  1. curl -i http://127.0.0.1:9090/v1/healthcheck

Response Example:

  1. [
  2. {
  3. "nodes": {},
  4. "name": "/apisix/routes/1",
  5. "type": "http"
  6. },
  7. {
  8. "nodes": [
  9. {
  10. "port": 1970,
  11. "hostname": "127.0.0.1",
  12. "status": "healthy",
  13. "ip": "127.0.0.1",
  14. "counter": {
  15. "tcp_failure": 0,
  16. "http_failure": 0,
  17. "success": 0,
  18. "timeout_failure": 0
  19. }
  20. },
  21. {
  22. "port": 1980,
  23. "hostname": "127.0.0.1",
  24. "status": "healthy",
  25. "ip": "127.0.0.1",
  26. "counter": {
  27. "tcp_failure": 0,
  28. "http_failure": 0,
  29. "success": 0,
  30. "timeout_failure": 0
  31. }
  32. }
  33. ],
  34. "name": "/apisix/routes/example-hc-route",
  35. "type": "http"
  36. }
  37. ]

State Transition Diagram

image

Note that all nodes start with the healthy status without any initial probes, and the counter only resets and updates with a state change. Hence, when nodes are healthy and all subsequent checks are successful, the success counter is not updated and remains zero.

Counter Information

In the event of a health check failure, the success count in the counter will be reset to zero. Upon a successful health check, the tcp_failure, http_failure, and timeout_failure data will be reset to zero.

NameDescriptionPurpose
successNumber of successful health checksWhen success exceeds the configured healthy.successes value, the node transitions to a healthy state.
tcp_failureNumber of TCP health check failuresWhen tcp_failure exceeds the configured unhealthy.tcp_failures value, the node transitions to an unhealthy state.
http_failureNumber of HTTP health check failuresWhen http_failure exceeds the configured unhealthy.http_failures value, the node transitions to an unhealthy state.
timeout_failureNumber of health check timeoutsWhen timeout_failure exceeds the configured unhealthy.timeouts value, the node transitions to an unhealthy state.