Management Server
Management Server Unreachability
When an Envoy instance loses connectivity with the management server, Envoy will latch on to the previous configuration while actively retrying in the background to reestablish the connection with the management server.
It is important that Envoy detects when a connection to a management server is unhealthy so that it can try to establish a new connection. Configuring either TCP keep-alives or HTTP/2 keepalives in the cluster that connects to the management server is recommended.
Envoy debug logs the fact that it is not able to establish a connection with the management server every time it attempts a connection.
connected_state statistic provides a signal for monitoring this behavior.
Statistics
Management Server has a statistics tree rooted at control_plane. with the following statistics:
Name | Type | Description |
---|---|---|
connected_state | Gauge | A boolean (1 for connected and 0 for disconnected) that indicates the current connection state with management server |
rate_limit_enforced | Counter | Total number of times rate limit was enforced for management server requests |
pending_requests | Gauge | Total number of pending requests when the rate limit was enforced |
identifier | TextReadout | The identifier of the control plane instance that sent the last discovery response |
xDS subscription statistics
Envoy discovers its various dynamic resources via discovery services referred to as xDS. Resources are requested via subscriptions, by specifying a filesystem path to watch, initiating gRPC streams or polling a REST-JSON URL.
The following statistics are generated for all subscriptions.
Name | Type | Description |
---|---|---|
config_reload | Counter | Total API fetches that resulted in a config reload due to a different config |
config_reload_time_ms | Gauge | Timestamp of the last config reload as milliseconds since the epoch |
init_fetch_timeout | Counter | Total initial fetch timeouts |
update_attempt | Counter | Total API fetches attempted |
update_success | Counter | Total API fetches completed successfully |
update_failure | Counter | Total API fetches that failed because of network errors |
update_rejected | Counter | Total API fetches that failed because of schema/validation errors |
update_time | Gauge | Timestamp of the last successful API fetch attempt as milliseconds since the epoch. Refreshed even after a trivial configuration reload that contained no configuration changes. |
version | Gauge | Hash of the contents from the last successful API fetch |
version_text | TextReadout | The version text from the last successful API fetch |
control_plane.connected_state | Gauge | A boolean (1 for connected and 0 for disconnected) that indicates the current connection state with management server |