Outlier detection
Outlier detection and ejection is the process of dynamically determining whether some number of hosts in an upstream cluster are performing unlike the others and removing them from the healthy load balancing set. Performance might be along different axes such as consecutive failures, temporal success rate, temporal latency, etc. Outlier detection is a form of passive health checking. Envoy also supports active health checking. Passive and active health checking can be enabled together or independently, and form the basis for an overall upstream health checking solution. Outlier detection is part of cluster configuration and it needs filters to report errors, timeouts, resets. Currently the following filters support outlier detection: http router, tcp proxy and redis proxy.
Detected errors fall into two categories: externally and locally originated errors. Externally generated errors are transaction specific and occur on the upstream server in response to the received request. For example, HTTP server returning error code 500 or redis server returning payload which cannot be decoded. Those errors are generated on the upstream host after Envoy has successfully connected to it. Locally originated errors are generated by Envoy in response to an event which interrupted or prevented communication with the upstream host. Examples of locally originated errors are timeout, TCP reset, inability to connect to a specified port, etc.
Type of detected errors depends on filter type. http router filter for example detects locally originated errors (timeouts, resets - errors related to connection to upstream host) and because it also understands HTTP protocol it reports errors returned by HTTP server (externally generated errors). In such scenario, even when connection to upstream HTTP server is successful, transaction with the server may fail. On the contrary, tcp proxy filter does not understand any protocol above TCP layer and reports only locally originated errors.
In default configuration (outlier_detection.split_external_local_origin_errors is false) locally originated errors are not distinguished from externally generated (transaction) errors and all end up in the same bucket and are compared against outlier_detection.consecutive_5xx, outlier_detection.consecutive_gateway_failure and outlier_detection.success_rate_stdev_factor configuration items. For example, if connection to an upstream HTTP server fails twice because of timeout and then, after successful connection, the server returns error code 500, the total error count will be 3.
Outlier detection may also be configured to distinguish locally originated errors from externally originated (transaction) errors. It is done via outlier_detection.split_external_local_origin_errors configuration item. In that mode locally originated errors are tracked by separate counters than externally originated (transaction) errors and the outlier detector may be configured to react to locally originated errors and ignore externally originated errors or vice-versa.
It is important to understand that a cluster may be shared among several filter chains. If one filter chain ejects a host based on its outlier detection type, other filter chains will be also affected even though their outlier detection type would not eject that host.
Ejection algorithm
Depending on the type of outlier detection, ejection either runs inline (for example in the case of consecutive 5xx) or at a specified interval (for example in the case of periodic success rate). The ejection algorithm works as follows:
A host is determined to be an outlier.
If no hosts have been ejected, Envoy will eject the host immediately. Otherwise, it checks to make sure the number of ejected hosts is below the allowed threshold (specified via the outlier_detection.max_ejection_percent setting). If the number of ejected hosts is above the threshold, the host is not ejected.
The host is ejected for some number of milliseconds. Ejection means that the host is marked unhealthy and will not be used during load balancing unless the load balancer is in a panic scenario. The number of milliseconds is equal to the outlier_detection.base_ejection_time_ms value multiplied by the number of times the host has been ejected. This causes hosts to get ejected for longer and longer periods if they continue to fail.
An ejected host will automatically be brought back into service after the ejection time has been satisfied. Generally, outlier detection is used alongside active health checking for a comprehensive health checking solution.
Detection types
Envoy supports the following outlier detection types:
Consecutive 5xx
In default mode (outlier_detection.split_external_local_origin_errors is false) this detection type takes into account all generated errors: locally originated and externally originated (transaction) type of errors. Errors generated by non-HTTP filters, like tcp proxy or redis proxy are internally mapped to HTTP 5xx codes and treated as such.
In split mode (outlier_detection.split_external_local_origin_errors is true) this detection type takes into account only externally originated (transaction) errors ignoring locally originated errors. If an upstream host is HTTP-server, only 5xx types of error are taken into account (see Consecutive Gateway Failure for exceptions). For redis servers, served via redis proxy only malformed responses from the server are taken into account. Properly formatted responses, even when they carry operational error (like index not found, access denied) are not taken into account.
If an upstream host returns some number of errors which are treated as consecutive 5xx type errors, it will be ejected. The number of consecutive 5xx required for ejection is controlled by the outlier_detection.consecutive_5xx value.
Consecutive Gateway Failure
This detection type takes into account subset of 5xx errors, called “gateway errors” (502, 503 or 504 status code) and is supported only by http router.
If an upstream host returns some number of consecutive “gateway errors” (502, 503 or 504 status code), it will be ejected. The number of consecutive gateway failures required for ejection is controlled by the outlier_detection.consecutive_gateway_failure value.
Consecutive Local Origin Failure
This detection type is enabled only when outlier_detection.split_external_local_origin_errors is true and takes into account only locally originated errors (timeout, reset, etc). If Envoy repeatedly cannot connect to an upstream host or communication with the upstream host is repeatedly interrupted, it will be ejected. Various locally originated problems are detected: timeout, TCP reset, ICMP errors, etc. The number of consecutive locally originated failures required for ejection is controlled by the outlier_detection.consecutive_local_origin_failure value. This detection type is supported by http router, tcp proxy and redis proxy.
Success Rate
Success Rate based outlier ejection aggregates success rate data from every host in a cluster. Then at given intervals ejects hosts based on statistical outlier detection. Success Rate outlier ejection will not be calculated for a host if its request volume over the aggregation interval is less than the outlier_detection.success_rate_request_volume value. Moreover, detection will not be performed for a cluster if the number of hosts with the minimum required request volume in an interval is less than the outlier_detection.success_rate_minimum_hosts value.
In default configuration mode (outlier_detection.split_external_local_origin_errors is false) this detection type takes into account all type of errors: locally and externally originated. outlier_detection.enforcing_local_origin_success config item is ignored.
In split mode (outlier_detection.split_external_local_origin_errors is true), locally originated errors and externally originated (transaction) errors are counted and treated separately. Most configuration items, namely outlier_detection.success_rate_minimum_hosts, outlier_detection.success_rate_request_volume, outlier_detection.success_rate_stdev_factor apply to both types of errors, but outlier_detection.enforcing_success_rate applies to externally originated errors only and outlier_detection.enforcing_local_origin_success_rate applies to locally originated errors only.
Failure Percentage
Failure Percentage based outlier ejection functions similarly to the success rate detecion type, in that it relies on success rate data from each host in a cluster. However, rather than compare those values to the mean success rate of the cluster as a whole, they are compared to a flat user-configured threshold. This threshold is configured via the outlier_detection.failure_percentage_threshold field.
The other configuration fields for failure percentage based ejection are similar to the fields for success rate ejection. Failure percentage based ejection also obeys outlier_detection.split_external_local_origin_errors; the enforcement percentages for externally- and locally-originated errors are controlled by outlier_detection.enforcing_failure_percentage and outlier_detection.enforcing_failure_percentage_local_origin, respectively. As with success rate detection, detection will not be performed for a host if its request volume over the aggregation interval is less than the outlier_detection.failure_percentage_request_volume value. Detection also will not be performed for a cluster if the number of hosts with the minimum required request volume in an interval is less than the outlier_detection.failure_percentage_minimum_hosts value.
gRPC
For gRPC requests, the outlier detection will use the HTTP status mapped from the grpc-status response header. This behavior is guarded by the runtime feature envoy.reloadable_features.outlier_detection_support_for_grpc_status which defaults to true.
Ejection event logging
A log of outlier ejection events can optionally be produced by Envoy. This is extremely useful during daily operations since global stats do not provide enough information on which hosts are being ejected and for what reasons. The log is structured as protobuf-based dumps of OutlierDetectionEvent messages. Ejection event logging is configured in the Cluster manager outlier detection configuration.
Configuration reference
Cluster manager global configuration
Per cluster configuration
Runtime settings
Statistics reference