Monitoring & Metrics

Cilium and Hubble can both be configured to serve Prometheus metrics. Prometheus is a pluggable metrics collection and storage system and can act as a data source for Grafana, a metrics visualization frontend. Unlike some metrics collectors like statsd, Prometheus requires the collectors to pull metrics from each source.

Cilium and Hubble metrics can be enabled independently of each other.

Cilium Metrics

Cilium metrics provide insights into the state of Cilium itself, namely of the cilium-agent, cilium-envoy, and cilium-operator processes. To run Cilium with Prometheus metrics enabled, deploy it with the prometheus.enabled=true Helm value set.

Cilium metrics are exported under the cilium_ Prometheus namespace. Envoy metrics are exported under the envoy_ Prometheus namespace, of which the Cilium-defined metrics are exported under the envoy_cilium_ namespace. When running and collecting in Kubernetes they will be tagged with a pod name and namespace.

Installation

You can enable metrics for cilium-agent (including Envoy) with the Helm value prometheus.enabled=true. To enable metrics for cilium-operator, use operator.prometheus.enabled=true.

  1. helm install cilium cilium/cilium --version 1.11.7 \
  2. --namespace kube-system \
  3. --set prometheus.enabled=true \
  4. --set operator.prometheus.enabled=true

The ports can be configured via prometheus.port, proxy.prometheus.port, or operator.prometheus.port respectively.

When metrics are enabled, all Cilium components will have the following annotations. They can be used to signal Prometheus whether to scrape metrics:

  1. prometheus.io/scrape: true
  2. prometheus.io/port: 9090

To collect Envoy metrics the Cilium chart will create a Kubernetes headless service named cilium-agent with the prometheus.io/scrape:'true' annotation set:

  1. prometheus.io/scrape: true
  2. prometheus.io/port: 9095

This additional headless service in addition to the other Cilium components is needed as each component can only have one Prometheus scrape and port annotation.

Prometheus will pick up the Cilium and Envoy metrics automatically if the following option is set in the scrape_configs section:

  1. scrape_configs:
  2. - job_name: 'kubernetes-pods'
  3. kubernetes_sd_configs:
  4. - role: pod
  5. relabel_configs:
  6. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  7. action: keep
  8. regex: true
  9. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  10. action: replace
  11. regex: (.+):(?:\d+);(\d+)
  12. replacement: ${1}:${2}
  13. target_label: __address__

Hubble Metrics

While Cilium metrics allow you to monitor the state Cilium itself, Hubble metrics on the other hand allow you to monitor the network behavior of your Cilium-managed Kubernetes pods with respect to connectivity and security.

Installation

To deploy Cilium with Hubble metrics enabled, you need to enable Hubble with hubble.enabled=true and provide a set of Hubble metrics you want to enable via hubble.metrics.enabled.

Some of the metrics can also be configured with additional options. See the Hubble exported metrics section for the full list of available metrics and their options.

  1. helm install cilium cilium/cilium --version 1.11.7 \
  2. --namespace kube-system \
  3. --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"

The port of the Hubble metrics can be configured with the hubble.metrics.port Helm value.

Note

L7 metrics such as HTTP, are only emitted for pods that enable Layer 7 Protocol Visibility.

When deployed with a non-empty hubble.metrics.enabled Helm value, the Cilium chart will create a Kubernetes headless service named hubble-metrics with the prometheus.io/scrape:'true' annotation set:

  1. prometheus.io/scrape: true
  2. prometheus.io/port: 9091

Set the following options in the scrape_configs section of Prometheus to have it scrape all Hubble metrics from the endpoints automatically:

  1. scrape_configs:
  2. - job_name: 'kubernetes-endpoints'
  3. scrape_interval: 30s
  4. kubernetes_sd_configs:
  5. - role: endpoints
  6. relabel_configs:
  7. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  8. action: keep
  9. regex: true
  10. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  11. action: replace
  12. target_label: __address__
  13. regex: (.+)(?::\d+);(\d+)
  14. replacement: $1:$2

Example Prometheus & Grafana Deployment

If you don’t have an existing Prometheus and Grafana stack running, you can deploy a stack with:

  1. kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.11/examples/kubernetes/addons/prometheus/monitoring-example.yaml

It will run Prometheus and Grafana in the cilium-monitoring namespace. If you have either enabled Cilium or Hubble metrics, they will automatically be scraped by Prometheus. You can then expose Grafana to access it via your browser.

  1. kubectl -n cilium-monitoring port-forward service/grafana --address 0.0.0.0 --address :: 3000:3000

Open your browser and access http://localhost:3000/

Metrics Reference

cilium-agent

Configuration

To expose any metrics, invoke cilium-agent with the --prometheus-serve-addr option. This option takes a IP:Port pair but passing an empty IP (e.g. :9090) will bind the server to all available interfaces (there is usually only one in a container).

Exported Metrics

Endpoint
NameLabelsDescription
endpoint Number of endpoints managed by this agent
endpoint_regenerations_totaloutcomeCount of all endpoint regenerations that have completed
endpoint_regeneration_time_stats_secondsscopeEndpoint regeneration time stats
endpoint_statestateCount of all endpoints
Services
NameLabelsDescription
services_events_total Number of services events labeled by action type
Cluster health
NameLabelsDescription
unreachable_nodes Number of nodes that cannot be reached
unreachable_health_endpoints Number of health endpoints that cannot be reached
controllers_failing Number of failing controllers
Node Connectivity
NameLabelsDescription
node_connectivity_statussource_cluster, source_node_name, target_cluster, target_node_name, target_node_type, typeThe last observed status of both ICMP and HTTP connectivity between the current Cilium agent and other Cilium nodes
node_connectivity_latency_secondsaddress_type, protocol, source_cluster, source_node_name, target_cluster, target_node_ip, target_node_name, target_node_type, typeThe last observed latency between the current Cilium agent and other Cilium nodes in seconds
Clustermesh
NameLabelsDescription
clustermesh_global_servicessource_cluster, source_node_nameThe total number of global services in the cluster mesh
clustermesh_remote_clusterssource_cluster, source_node_nameThe total number of remote clusters meshed with the local cluster
clustermesh_remote_cluster_failuressource_cluster, source_node_name, target_clusterThe total number of failures related to the remote cluster
clustermesh_remote_cluster_nodessource_cluster, source_node_name, target_clusterThe total number of nodes in the remote cluster
clustermesh_remote_cluster_last_failure_tssource_cluster, source_node_name, target_clusterThe timestamp of the last failure of the remote cluster
clustermesh_remote_cluster_readiness_statussource_cluster, source_node_name, target_clusterThe readiness status of the remote cluster
Datapath
NameLabelsDescription
datapath_conntrack_dump_resets_totalarea, name, familyNumber of conntrack dump resets. Happens when a BPF entry gets removed while dumping the map is in progress.
datapath_conntrack_gc_runs_totalstatusNumber of times that the conntrack garbage collector process was run
datapath_conntrack_gc_key_fallbacks_total The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family
datapath_conntrack_gc_entriesfamilyThe number of alive and deleted conntrack entries at the end of a garbage collector run
datapath_conntrack_gc_duration_secondsstatusDuration in seconds of the garbage collector process
IPSec
NameLabelsDescription
ipsec_xfrm_errorerror, typeTotal number of xfrm errors.
eBPF
NameLabelsDescription
bpf_syscall_duration_secondsoperation, outcomeDuration of eBPF system call performed
bpf_map_ops_totalmapName (deprecated), map_name, operation, outcomeNumber of eBPF map operations performed. mapName is deprecated and will be removed in 1.10. Use map_name instead.
bpf_map_pressuremap_nameMap pressure defined as fill-up ratio of the map. Policy maps are exceptionally reported only when ratio is over 0.1.
bpf_maps_virtual_memory_max_bytes Max memory used by eBPF maps installed in the system
bpf_progs_virtual_memory_max_bytes Max memory used by eBPF programs installed in the system

Both bpf_maps_virtual_memory_max_bytes and bpf_progs_virtual_memory_max_bytes are currently reporting the system-wide memory usage of eBPF that is directly and not directly managed by Cilium. This might change in the future and only report the eBPF memory usage directly managed by Cilium.

Drops/Forwards (L3/L4)
NameLabelsDescription
drop_count_totalreason, directionTotal dropped packets
drop_bytes_totalreason, directionTotal dropped bytes
forward_count_totaldirectionTotal forwarded packets
forward_bytes_totaldirectionTotal forwarded bytes
Policy
NameLabelsDescription
policy Number of policies currently loaded
policy_count Number of policies currently loaded (deprecated, use policy)
policy_regeneration_total Total number of policies regenerated successfully
policy_regeneration_time_stats_secondsscopePolicy regeneration time stats labeled by the scope
policy_max_revision Highest policy revision number in the agent
policy_import_errors_total Number of times a policy import has failed
policy_endpoint_enforcement_status Number of endpoints labeled by policy enforcement status
Policy L7 (HTTP/Kafka)
NameLabelsDescription
proxy_redirectsprotocolNumber of redirects installed for endpoints
proxy_upstream_reply_seconds Seconds waited for upstream server to reply to a request
proxy_datapath_update_timeout_total Number of total datapath update timeouts due to FQDN IP updates
policy_l7_totaltypeNumber of total L7 requests/responses
Identity
NameLabelsDescription
identitytypeNumber of identities currently allocated
Events external to Cilium
NameLabelsDescription
event_tssourceLast timestamp when we received an event
Controllers
NameLabelsDescription
controllers_runs_totalstatusNumber of times that a controller process was run
controllers_runs_duration_secondsstatusDuration in seconds of the controller process
SubProcess
NameLabelsDescription
subprocess_start_totalsubsystemNumber of times that Cilium has started a subprocess
Kubernetes
NameLabelsDescription
kubernetes_events_received_totalscope, action, validity, equalNumber of Kubernetes events received
kubernetes_events_totalscope, action, outcomeNumber of Kubernetes events processed
k8s_cnp_status_completion_secondsattempts, outcomeDuration in seconds in how long it took to complete a CNP status update
IPAM
NameLabelsDescription
ipam_events_total Number of IPAM events received labeled by action and datapath family type
ip_addressesfamilyNumber of allocated IP addresses
KVstore
NameLabelsDescription
kvstore_operations_duration_secondsaction, kind, outcome, scopeDuration of kvstore operation
kvstore_events_queue_secondsaction, scopeDuration of seconds of time received event was blocked before it could be queued
kvstore_quorum_errors_totalerrorNumber of quorum errors
Agent
NameLabelsDescription
agent_bootstrap_secondsscope, outcomeDuration of various bootstrap phases
api_process_time_seconds Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code.
FQDN
NameLabelsDescription
qdn_gc_deletions_total Number of FQDNs that have been cleaned on FQDN garbage collector job
API Rate Limiting
NameLabelsDescription
cilium_api_limiter_adjustment_factorapi_callMost recent adjustment factor for automatic adjustment
cilium_api_limiter_processed_requests_totalapi_call, outcomeTotal number of API requests processed
cilium_api_limiter_processing_duration_secondsapi_call, valueMean and estimated processing duration in seconds
cilium_api_limiter_rate_limitapi_call, valueCurrent rate limiting configuration (limit and burst)
cilium_api_limiter_requests_in_flightapi_call valueCurrent and maximum allowed number of requests in flight
cilium_api_limiter_wait_duration_secondsapi_call, valueMean, min, and max wait duration
cilium_api_limiter_wait_history_duration_secondsapi_callHistogram of wait duration per API call processed

cilium-operator

Configuration

cilium-operator can be configured to serve metrics by running with the option --enable-metrics. By default, the operator will expose metrics on port 6942, the port can be changed with the option --operator-prometheus-serve-addr.

Exported Metrics

All metrics are exported under the cilium_operator_ Prometheus namespace.

IPAM
NameLabelsDescription
ipam_ipstypeNumber of IPs allocated
ipam_allocation_opssubnet_idNumber of IP allocation operations.
ipam_interface_creation_opssubnet_id, statusNumber of interfaces creation operations.
ipam_available Number of interfaces with addresses available
ipam_nodes_at_capacity Number of nodes unable to allocate more addresses
ipam_resync_total Number of synchronization operations with external IPAM API
ipam_api_duration_secondsoperation, response_codeDuration of interactions with external IPAM API.
ipam_api_rate_limit_duration_secondsoperationDuration of rate limiting while accessing external IPAM API

Hubble

Configuration

Hubble metrics are served by a Hubble instance running inside cilium-agent. The command-line options to configure them are --enable-hubble, --hubble-metrics-server, and --hubble-metrics. --hubble-metrics-server takes an IP:Port pair, but passing an empty IP (e.g. :9091) will bind the server to all available interfaces. --hubble-metrics takes a comma-separated list of metrics.

Some metrics can take additional semicolon-separated options per metric, e.g. --hubble-metrics="dns:query;ignoreAAAA,http:destinationContext=pod-short" will enable the dns metric with the query and ignoreAAAA options, and the http metric with the destinationContext=pod-short option.

Context Options

Most Hubble metrics can be configured to add the source and/or destination context as a label. The options are called sourceContext and destinationContext. The possible values are:

Option ValueDescription
identityAll Cilium security identity labels
namespaceKubernetes namespace name
podKubernetes pod name
pod-shortShort version of the Kubernetes pod name. Typically the deployment/replicaset name.
dnsAll known DNS names of the source or destination (comma-separated)
ipThe IPv4 or IPv6 address

When specifying the source and/or destination context, multiple contexts can be specified by separating them via the | symbol. When multiple are specified, then the first non-empty value is added to the metric as a label. For example, a metric configuration of flow:destinationContext=dns|ip will first try to use the DNS name of the target for the label. If no DNS name is known for the target, it will fall back and use the IP address of the target instead.

Exported Metrics

Hubble metrics are exported under the hubble_ Prometheus namespace.

dns
NameLabelsDescription
dns_queries_totalrcode, qtypes, ips_returnedNumber of DNS queries observed
dns_responses_totalrcode, qtypes, ips_returnedNumber of DNS responses observed
dns_response_types_totaltype, qtypesNumber of DNS response types
Options
Option KeyOption ValueDescription
queryN/AInclude the query as label “query”
ignoreAAAAN/AIgnore any AAAA requests/responses

This metric supports Context Options.

drop
NameLabelsDescription
drop_totalreason, protocolNumber of drops
Options

This metric supports Context Options.

flow
NameLabelsDescription
flows_processed_totaltype, subtype, verdictTotal number of flows processed
Options

This metric supports Context Options.

http
NameLabelsDescription
http_requests_totalmethod, protocolCount of HTTP requests
http_responses_totalmethod, statusCount of HTTP responses
http_request_duration_secondsmethodQuantiles of HTTP request duration in seconds
Options

This metric supports Context Options.

icmp
NameLabelsDescription
icmp_totalfamily, typeNumber of ICMP messages
Options

This metric supports Context Options.

port-distribution
NameLabelsDescription
port_distribution_totalprotocol, portNumbers of packets distributed by destination port
Options

This metric supports Context Options.

tcp
NameLabelsDescription
tcp_flags_totalflag, familyTCP flag occurrences
Options

This metric supports Context Options.