Statistics

General

The cluster manager has a statistics tree rooted at cluster_manager. with the following statistics. Any : character in the stats name is replaced with _. Stats include all clusters managed by the cluster manager, including both clusters used for data plane upstreams and control plane xDS clusters.

NameTypeDescription
cluster_addedCounterTotal clusters added (either via static config or CDS)
cluster_modifiedCounterTotal clusters modified (via CDS)
cluster_removedCounterTotal clusters removed (via CDS)
cluster_updatedCounterTotal cluster updates
cluster_updated_via_mergeCounterTotal cluster updates applied as merged updates
update_merge_cancelledCounterTotal merged updates that got cancelled and delivered early
update_out_of_merge_windowCounterTotal updates which arrived out of a merge window
active_clustersGaugeNumber of currently active (warmed) clusters
warming_clustersGaugeNumber of currently warming (not active) clusters

Every cluster has a statistics tree rooted at cluster.. with the following statistics:

NameTypeDescription
upstream_cx_totalCounterTotal connections
upstream_cx_activeGaugeTotal active connections
upstream_cx_http1_totalCounterTotal HTTP/1.1 connections
upstream_cx_http2_totalCounterTotal HTTP/2 connections
upstream_cx_connect_failCounterTotal connection failures
upstream_cx_connect_timeoutCounterTotal connection connect timeouts
upstream_cx_idle_timeoutCounterTotal connection idle timeouts
upstream_cx_connect_attempts_exceededCounterTotal consecutive connection failures exceeding configured connection attempts
upstream_cx_overflowCounterTotal times that the cluster’s connection circuit breaker overflowed
upstream_cx_connect_msHistogramConnection establishment milliseconds
upstream_cx_length_msHistogramConnection length milliseconds
upstream_cx_destroyCounterTotal destroyed connections
upstream_cx_destroy_localCounterTotal connections destroyed locally
upstream_cx_destroy_remoteCounterTotal connections destroyed remotely
upstream_cx_destroy_with_active_rqCounterTotal connections destroyed with 1+ active request
upstream_cx_destroy_local_with_active_rqCounterTotal connections destroyed locally with 1+ active request
upstream_cx_destroy_remote_with_active_rqCounterTotal connections destroyed remotely with 1+ active request
upstream_cx_close_notifyCounterTotal connections closed via HTTP/1.1 connection close header or HTTP/2 GOAWAY
upstream_cx_rx_bytes_totalCounterTotal received connection bytes
upstream_cx_rx_bytes_bufferedGaugeReceived connection bytes currently buffered
upstream_cx_tx_bytes_totalCounterTotal sent connection bytes
upstream_cx_tx_bytes_bufferedGaugeSend connection bytes currently buffered
upstream_cx_protocol_errorCounterTotal connection protocol errors
upstream_cx_max_requestsCounterTotal connections closed due to maximum requests
upstream_cx_none_healthyCounterTotal times connection not established due to no healthy hosts
upstream_rq_totalCounterTotal requests
upstream_rq_activeGaugeTotal active requests
upstream_rq_pending_totalCounterTotal requests pending a connection pool connection
upstream_rq_pending_overflowCounterTotal requests that overflowed connection pool circuit breaking and were failed
upstream_rq_pending_failure_ejectCounterTotal requests that were failed due to a connection pool connection failure
upstream_rq_pending_activeGaugeTotal active requests pending a connection pool connection
upstream_rq_cancelledCounterTotal requests cancelled before obtaining a connection pool connection
upstream_rq_maintenance_modeCounterTotal requests that resulted in an immediate 503 due to maintenance mode
upstream_rq_timeoutCounterTotal requests that timed out waiting for a response
upstream_rq_per_try_timeoutCounterTotal requests that hit the per try timeout
upstream_rq_rx_resetCounterTotal requests that were reset remotely
upstream_rq_tx_resetCounterTotal requests that were reset locally
upstream_rq_retryCounterTotal request retries
upstream_rq_retry_successCounterTotal request retry successes
upstream_rq_retry_overflowCounterTotal requests not retried due to circuit breaking
upstream_flow_control_paused_reading_totalCounterTotal number of times flow control paused reading from upstream
upstream_flow_control_resumed_reading_totalCounterTotal number of times flow control resumed reading from upstream
upstream_flow_control_backed_up_totalCounterTotal number of times the upstream connection backed up and paused reads from downstream
upstream_flow_control_drained_totalCounterTotal number of times the upstream connection drained and resumed reads from downstream
upstream_internal_redirect_failed_totalCounterTotal number of times failed internal redirects resulted in redirects being passed downstream.
upstream_internal_redirect_succeed_totalCounterTotal number of times internal redirects resulted in a second upstream request.
membership_changeCounterTotal cluster membership changes
membership_healthyGaugeCurrent cluster healthy total (inclusive of both health checking and outlier detection)
membership_degradedGaugeCurrent cluster degraded total
membership_totalGaugeCurrent cluster membership total
retry_or_shadow_abandonedCounterTotal number of times shadowing or retry buffering was canceled due to buffer limits
config_reloadCounterTotal API fetches that resulted in a config reload due to a different config
update_attemptCounterTotal cluster membership update attempts
update_successCounterTotal cluster membership update successes
update_failureCounterTotal cluster membership update failures
update_emptyCounterTotal cluster membership updates ending with empty cluster load assignment and continuing with previous config
update_no_rebuildCounterTotal successful cluster membership updates that didn’t result in any cluster load balancing structure rebuilds
versionGaugeHash of the contents from the last successful API fetch
max_host_weightGaugeMaximum weight of any host in the cluster
bind_errorsCounterTotal errors binding the socket to the configured source address

Health check statistics

If health check is configured, the cluster has an additional statistics tree rooted at cluster..health_check. with the following statistics:

NameTypeDescription
attemptCounterNumber of health checks
successCounterNumber of successful health checks
failureCounterNumber of immediately failed health checks (e.g. HTTP 503) as well as network failures
passive_failureCounterNumber of health check failures due to passive events (e.g. x-envoy-immediate-health-check-fail)
network_failureCounterNumber of health check failures due to network error
verify_clusterCounterNumber of health checks that attempted cluster name verification
healthyGaugeNumber of healthy members

Outlier detection statistics

If outlier detection is configured for a cluster, statistics will be rooted at cluster..outlier_detection. and contain the following:

NameTypeDescription
ejections_enforced_totalCounterNumber of enforced ejections due to any outlier type
ejections_activeGaugeNumber of currently ejected hosts
ejections_overflowCounterNumber of ejections aborted due to the max ejection %
ejections_enforced_consecutive_5xxCounterNumber of enforced consecutive 5xx ejections
ejections_detected_consecutive_5xxCounterNumber of detected consecutive 5xx ejections (even if unenforced)
ejections_enforced_success_rateCounterNumber of enforced success rate outlier ejections
ejections_detected_success_rateCounterNumber of detected success rate outlier ejections (even if unenforced)
ejections_enforced_consecutive_gateway_failureCounterNumber of enforced consecutive gateway failure ejections
ejections_detected_consecutive_gateway_failureCounterNumber of detected consecutive gateway failure ejections (even if unenforced)
ejections_totalCounterDeprecated. Number of ejections due to any outlier type (even if unenforced)
ejections_consecutive_5xxCounterDeprecated. Number of consecutive 5xx ejections (even if unenforced)

Circuit breakers statistics

Circuit breakers statistics will be rooted at cluster..circuit_breakers.. and contain the following:

NameTypeDescription
cx_openGaugeWhether the connection circuit breaker is closed (0) or open (1)
cx_pool_openGaugeWhether the connection pool circuit breaker is closed (0) or open (1)
rq_pending_openGaugeWhether the pending requests circuit breaker is closed (0) or open (1)
rq_openGaugeWhether the requests circuit breaker is closed (0) or open (1)
rq_retry_openGaugeWhether the retry circuit breaker is closed (0) or open (1)
remaining_cxGaugeNumber of remaining connections until the circuit breaker opens
remaining_pendingGaugeNumber of remaining pending requests until the circuit breaker opens
remaining_rqGaugeNumber of remaining requests until the circuit breaker opens
remaining_retriesGaugeNumber of remaining retries until the circuit breaker opens

Dynamic HTTP statistics

If HTTP is used, dynamic HTTP response code statistics are also available. These are emitted by various internal systems as well as some filters such as the router filter and rate limit filter. They are rooted at cluster.. and contain the following statistics:

NameTypeDescription
upstreamrq_completedCounterTotal upstream requests completed
upstream_rq<xx>CounterAggregate HTTP response codes (e.g., 2xx, 3xx, etc.)
upstreamrq<>CounterSpecific HTTP response codes (e.g., 201, 302, etc.)
upstreamrq_timeHistogramRequest time milliseconds
canary.upstream_rq_completedCounterTotal upstream canary requests completed
canary.upstream_rq<xx>CounterUpstream canary aggregate HTTP response codes
canary.upstreamrq<>CounterUpstream canary specific HTTP response codes
canary.upstreamrq_timeHistogramUpstream canary request time milliseconds
internal.upstream_rq_completedCounterTotal internal origin requests completed
internal.upstream_rq<xx>CounterInternal origin aggregate HTTP response codes
internal.upstreamrq<>CounterInternal origin specific HTTP response codes
internal.upstreamrq_timeHistogramInternal origin request time milliseconds
external.upstream_rq_completedCounterTotal external origin requests completed
external.upstream_rq<xx>CounterExternal origin aggregate HTTP response codes
external.upstreamrq<>CounterExternal origin specific HTTP response codes
external.upstream_rq_timeHistogramExternal origin request time milliseconds

Alternate tree dynamic HTTP statistics

If alternate tree statistics are configured, they will be present in the cluster... namespace. The statistics produced are the same as documented in the dynamic HTTP statistics section above.

Per service zone dynamic HTTP statistics

If the service zone is available for the local service (via --service-zone) and the upstream cluster, Envoy will track the following statistics in cluster..zone... namespace.

NameTypeDescription
upstreamrq<xx>CounterAggregate HTTP response codes (e.g., 2xx, 3xx, etc.)
upstreamrq<>CounterSpecific HTTP response codes (e.g., 201, 302, etc.)
upstream_rq_timeHistogramRequest time milliseconds

Load balancer statistics

Statistics for monitoring load balancer decisions. Stats are rooted at cluster.. and contain the following statistics:

NameTypeDescription
lb_recalculate_zone_structuresCounterThe number of times locality aware routing structures are regenerated for fast decisions on upstream locality selection
lb_healthy_panicCounterTotal requests load balanced with the load balancer in panic mode
lb_zone_cluster_too_smallCounterNo zone aware routing because of small upstream cluster size
lb_zone_routing_all_directlyCounterSending all requests directly to the same zone
lb_zone_routing_sampledCounterSending some requests to the same zone
lb_zone_routing_cross_zoneCounterZone aware routing mode but have to send cross zone
lb_local_cluster_not_okCounterLocal host set is not set or it is panic mode for local cluster
lb_zone_number_differsCounterNumber of zones in local and upstream cluster different
lb_zone_no_capacity_leftCounterTotal number of times ended with random zone selection due to rounding error
original_dst_host_invalidCounterTotal number of invalid hosts passed to original destination load balancer

Load balancer subset statistics

Statistics for monitoring load balancer subset decisions. Stats are rooted at cluster.. and contain the following statistics:

NameTypeDescription
lb_subsets_activeGaugeNumber of currently available subsets
lb_subsets_createdCounterNumber of subsets created
lb_subsets_removedCounterNumber of subsets removed due to no hosts
lb_subsets_selectedCounterNumber of times any subset was selected for load balancing
lb_subsets_fallbackCounterNumber of times the fallback policy was invoked
lb_subsets_fallback_panicCounterNumber of times the subset panic mode triggered

Ring hash load balancer statistics

Statistics for monitoring the size and effective distribution of hashes when using the ring hash load balancer. Stats are rooted at cluster..ring_hash_lb. and contain the following statistics:

NameTypeDescription
sizeGaugeTotal number of host hashes on the ring
min_hashes_per_hostGaugeMinimum number of hashes for a single host
max_hashes_per_hostGaugeMaximum number of hashes for a single host

Maglev load balancer statistics

Statistics for monitoring effective host weights when using the Maglev load balancer. Stats are rooted at cluster..maglev_lb. and contain the following statistics:

NameTypeDescription
min_entries_per_hostGaugeMinimum number of entries for a single host
max_entries_per_hostGaugeMaximum number of entries for a single host