Monitoring & Metrics

Monitoring & Metrics

Cilium and Hubble can both be configured to serve Prometheus metrics. Prometheus is a pluggable metrics collection and storage system and can act as a data source for Grafana, a metrics visualization frontend. Unlike some metrics collectors like statsd, Prometheus requires the collectors to pull metrics from each source.

Cilium and Hubble metrics can be enabled independently of each other.

Cilium Metrics

Cilium metrics provide insights into the state of Cilium itself, namely of the cilium-agent, cilium-envoy, and cilium-operator processes. To run Cilium with Prometheus metrics enabled, deploy it with the prometheus.enabled=true Helm value set.

Cilium metrics are exported under the cilium_ Prometheus namespace. Envoy metrics are exported under the envoy_ Prometheus namespace, of which the Cilium-defined metrics are exported under the envoy_cilium_ namespace. When running and collecting in Kubernetes they will be tagged with a pod name and namespace.

Installation

You can enable metrics for cilium-agent (including Envoy) with the Helm value prometheus.enabled=true. To enable metrics for cilium-operator, use operator.prometheus.enabled=true.

helm install cilium cilium/cilium --version 1.11.7 \
  --namespace kube-system \
  --set prometheus.enabled=true \
  --set operator.prometheus.enabled=true

The ports can be configured via prometheus.port, proxy.prometheus.port, or operator.prometheus.port respectively.

When metrics are enabled, all Cilium components will have the following annotations. They can be used to signal Prometheus whether to scrape metrics:

prometheus.io/scrape: true
prometheus.io/port: 9090

To collect Envoy metrics the Cilium chart will create a Kubernetes headless service named cilium-agent with the prometheus.io/scrape:'true' annotation set:

prometheus.io/scrape: true
prometheus.io/port: 9095

This additional headless service in addition to the other Cilium components is needed as each component can only have one Prometheus scrape and port annotation.

Prometheus will pick up the Cilium and Envoy metrics automatically if the following option is set in the scrape_configs section:

scrape_configs:
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: (.+):(?:\d+);(\d+)
      replacement: ${1}:${2}
      target_label: __address__

Hubble Metrics

While Cilium metrics allow you to monitor the state Cilium itself, Hubble metrics on the other hand allow you to monitor the network behavior of your Cilium-managed Kubernetes pods with respect to connectivity and security.

Installation

To deploy Cilium with Hubble metrics enabled, you need to enable Hubble with hubble.enabled=true and provide a set of Hubble metrics you want to enable via hubble.metrics.enabled.

Some of the metrics can also be configured with additional options. See the Hubble exported metrics section for the full list of available metrics and their options.

helm install cilium cilium/cilium --version 1.11.7 \
  --namespace kube-system \
  --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"

The port of the Hubble metrics can be configured with the hubble.metrics.port Helm value.

Note

L7 metrics such as HTTP, are only emitted for pods that enable Layer 7 Protocol Visibility.

When deployed with a non-empty hubble.metrics.enabled Helm value, the Cilium chart will create a Kubernetes headless service named hubble-metrics with the prometheus.io/scrape:'true' annotation set:

prometheus.io/scrape: true
prometheus.io/port: 9091

Set the following options in the scrape_configs section of Prometheus to have it scrape all Hubble metrics from the endpoints automatically:

scrape_configs:
  - job_name: 'kubernetes-endpoints'
    scrape_interval: 30s
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)(?::\d+);(\d+)
        replacement: $1:$2

Example Prometheus & Grafana Deployment

If you don’t have an existing Prometheus and Grafana stack running, you can deploy a stack with:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.11/examples/kubernetes/addons/prometheus/monitoring-example.yaml

It will run Prometheus and Grafana in the cilium-monitoring namespace. If you have either enabled Cilium or Hubble metrics, they will automatically be scraped by Prometheus. You can then expose Grafana to access it via your browser.

kubectl -n cilium-monitoring port-forward service/grafana --address 0.0.0.0 --address :: 3000:3000

Open your browser and access http://localhost:3000/

Metrics Reference

cilium-agent

Configuration

To expose any metrics, invoke cilium-agent with the --prometheus-serve-addr option. This option takes a IP:Port pair but passing an empty IP (e.g. :9090) will bind the server to all available interfaces (there is usually only one in a container).

Exported Metrics

Endpoint

Name	Labels	Description
`endpoint`		Number of endpoints managed by this agent
`endpoint_regenerations_total`	`outcome`	Count of all endpoint regenerations that have completed
`endpoint_regeneration_time_stats_seconds`	`scope`	Endpoint regeneration time stats
`endpoint_state`	`state`	Count of all endpoints

Services

Name	Labels	Description
`services_events_total`		Number of services events labeled by action type

Cluster health

Name	Labels	Description
`unreachable_nodes`		Number of nodes that cannot be reached
`unreachable_health_endpoints`		Number of health endpoints that cannot be reached
`controllers_failing`		Number of failing controllers

Node Connectivity

Name	Labels	Description
`node_connectivity_status`	`source_cluster`, `source_node_name`, `target_cluster`, `target_node_name`, `target_node_type`, `type`	The last observed status of both ICMP and HTTP connectivity between the current Cilium agent and other Cilium nodes
`node_connectivity_latency_seconds`	`address_type`, `protocol`, `source_cluster`, `source_node_name`, `target_cluster`, `target_node_ip`, `target_node_name`, `target_node_type`, `type`	The last observed latency between the current Cilium agent and other Cilium nodes in seconds

Clustermesh

Name	Labels	Description
`clustermesh_global_services`	`source_cluster`, `source_node_name`	The total number of global services in the cluster mesh
`clustermesh_remote_clusters`	`source_cluster`, `source_node_name`	The total number of remote clusters meshed with the local cluster
`clustermesh_remote_cluster_failures`	`source_cluster`, `source_node_name`, `target_cluster`	The total number of failures related to the remote cluster
`clustermesh_remote_cluster_nodes`	`source_cluster`, `source_node_name`, `target_cluster`	The total number of nodes in the remote cluster
`clustermesh_remote_cluster_last_failure_ts`	`source_cluster`, `source_node_name`, `target_cluster`	The timestamp of the last failure of the remote cluster
`clustermesh_remote_cluster_readiness_status`	`source_cluster`, `source_node_name`, `target_cluster`	The readiness status of the remote cluster

Datapath

Name	Labels	Description
`datapath_conntrack_dump_resets_total`	`area`, `name`, `family`	Number of conntrack dump resets. Happens when a BPF entry gets removed while dumping the map is in progress.
`datapath_conntrack_gc_runs_total`	`status`	Number of times that the conntrack garbage collector process was run
`datapath_conntrack_gc_key_fallbacks_total`		The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family
`datapath_conntrack_gc_entries`	`family`	The number of alive and deleted conntrack entries at the end of a garbage collector run
`datapath_conntrack_gc_duration_seconds`	`status`	Duration in seconds of the garbage collector process

IPSec

Name	Labels	Description
`ipsec_xfrm_error`	`error`, `type`	Total number of xfrm errors.

eBPF

Name	Labels	Description
`bpf_syscall_duration_seconds`	`operation`, `outcome`	Duration of eBPF system call performed
`bpf_map_ops_total`	`mapName` (deprecated), `map_name`, `operation`, `outcome`	Number of eBPF map operations performed. `mapName` is deprecated and will be removed in 1.10. Use `map_name` instead.
`bpf_map_pressure`	`map_name`	Map pressure defined as fill-up ratio of the map. Policy maps are exceptionally reported only when ratio is over 0.1.
`bpf_maps_virtual_memory_max_bytes`		Max memory used by eBPF maps installed in the system
`bpf_progs_virtual_memory_max_bytes`		Max memory used by eBPF programs installed in the system

Both bpf_maps_virtual_memory_max_bytes and bpf_progs_virtual_memory_max_bytes are currently reporting the system-wide memory usage of eBPF that is directly and not directly managed by Cilium. This might change in the future and only report the eBPF memory usage directly managed by Cilium.

Drops/Forwards (L3/L4)

Name	Labels	Description
`drop_count_total`	`reason`, `direction`	Total dropped packets
`drop_bytes_total`	`reason`, `direction`	Total dropped bytes
`forward_count_total`	`direction`	Total forwarded packets
`forward_bytes_total`	`direction`	Total forwarded bytes

Policy

Name	Labels	Description
`policy`		Number of policies currently loaded
`policy_count`		Number of policies currently loaded (deprecated, use `policy`)
`policy_regeneration_total`		Total number of policies regenerated successfully
`policy_regeneration_time_stats_seconds`	`scope`	Policy regeneration time stats labeled by the scope
`policy_max_revision`		Highest policy revision number in the agent
`policy_import_errors_total`		Number of times a policy import has failed
`policy_endpoint_enforcement_status`		Number of endpoints labeled by policy enforcement status

Policy L7 (HTTP/Kafka)

Name	Labels	Description
`proxy_redirects`	`protocol`	Number of redirects installed for endpoints
`proxy_upstream_reply_seconds`		Seconds waited for upstream server to reply to a request
`proxy_datapath_update_timeout_total`		Number of total datapath update timeouts due to FQDN IP updates
`policy_l7_total`	`type`	Number of total L7 requests/responses

Identity

Name	Labels	Description
`identity`	`type`	Number of identities currently allocated

Events external to Cilium

Name	Labels	Description
`event_ts`	`source`	Last timestamp when we received an event

Controllers

Name	Labels	Description
`controllers_runs_total`	`status`	Number of times that a controller process was run
`controllers_runs_duration_seconds`	`status`	Duration in seconds of the controller process

SubProcess

Name	Labels	Description
`subprocess_start_total`	`subsystem`	Number of times that Cilium has started a subprocess

Kubernetes

Name	Labels	Description
`kubernetes_events_received_total`	`scope`, `action`, `validity`, `equal`	Number of Kubernetes events received
`kubernetes_events_total`	`scope`, `action`, `outcome`	Number of Kubernetes events processed
`k8s_cnp_status_completion_seconds`	`attempts`, `outcome`	Duration in seconds in how long it took to complete a CNP status update

IPAM

Name	Labels	Description
`ipam_events_total`		Number of IPAM events received labeled by action and datapath family type
`ip_addresses`	`family`	Number of allocated IP addresses

KVstore

Name	Labels	Description
`kvstore_operations_duration_seconds`	`action`, `kind`, `outcome`, `scope`	Duration of kvstore operation
`kvstore_events_queue_seconds`	`action`, `scope`	Duration of seconds of time received event was blocked before it could be queued
`kvstore_quorum_errors_total`	`error`	Number of quorum errors

Agent

Name	Labels	Description
`agent_bootstrap_seconds`	`scope`, `outcome`	Duration of various bootstrap phases
`api_process_time_seconds`		Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code.

FQDN

Name	Labels	Description
`qdn_gc_deletions_total`		Number of FQDNs that have been cleaned on FQDN garbage collector job

API Rate Limiting

Name	Labels	Description
`cilium_api_limiter_adjustment_factor`	`api_call`	Most recent adjustment factor for automatic adjustment
`cilium_api_limiter_processed_requests_total`	`api_call`, `outcome`	Total number of API requests processed
`cilium_api_limiter_processing_duration_seconds`	`api_call`, `value`	Mean and estimated processing duration in seconds
`cilium_api_limiter_rate_limit`	`api_call`, `value`	Current rate limiting configuration (limit and burst)
`cilium_api_limiter_requests_in_flight`	`api_call` `value`	Current and maximum allowed number of requests in flight
`cilium_api_limiter_wait_duration_seconds`	`api_call`, `value`	Mean, min, and max wait duration
`cilium_api_limiter_wait_history_duration_seconds`	`api_call`	Histogram of wait duration per API call processed

cilium-operator

Configuration

cilium-operator can be configured to serve metrics by running with the option --enable-metrics. By default, the operator will expose metrics on port 6942, the port can be changed with the option --operator-prometheus-serve-addr.

Exported Metrics

All metrics are exported under the cilium_operator_ Prometheus namespace.

IPAM

Name	Labels	Description
`ipam_ips`	`type`	Number of IPs allocated
`ipam_allocation_ops`	`subnet_id`	Number of IP allocation operations.
`ipam_interface_creation_ops`	`subnet_id`, `status`	Number of interfaces creation operations.
`ipam_available`		Number of interfaces with addresses available
`ipam_nodes_at_capacity`		Number of nodes unable to allocate more addresses
`ipam_resync_total`		Number of synchronization operations with external IPAM API
`ipam_api_duration_seconds`	`operation`, `response_code`	Duration of interactions with external IPAM API.
`ipam_api_rate_limit_duration_seconds`	`operation`	Duration of rate limiting while accessing external IPAM API

Hubble

Configuration

Hubble metrics are served by a Hubble instance running inside cilium-agent. The command-line options to configure them are --enable-hubble, --hubble-metrics-server, and --hubble-metrics. --hubble-metrics-server takes an IP:Port pair, but passing an empty IP (e.g. :9091) will bind the server to all available interfaces. --hubble-metrics takes a comma-separated list of metrics.

Some metrics can take additional semicolon-separated options per metric, e.g. --hubble-metrics="dns:query;ignoreAAAA,http:destinationContext=pod-short" will enable the dns metric with the query and ignoreAAAA options, and the http metric with the destinationContext=pod-short option.

Context Options

Most Hubble metrics can be configured to add the source and/or destination context as a label. The options are called sourceContext and destinationContext. The possible values are:

Option Value	Description
`identity`	All Cilium security identity labels
`namespace`	Kubernetes namespace name
`pod`	Kubernetes pod name
`pod-short`	Short version of the Kubernetes pod name. Typically the deployment/replicaset name.
`dns`	All known DNS names of the source or destination (comma-separated)
`ip`	The IPv4 or IPv6 address

When specifying the source and/or destination context, multiple contexts can be specified by separating them via the | symbol. When multiple are specified, then the first non-empty value is added to the metric as a label. For example, a metric configuration of flow:destinationContext=dns|ip will first try to use the DNS name of the target for the label. If no DNS name is known for the target, it will fall back and use the IP address of the target instead.

Exported Metrics

Hubble metrics are exported under the hubble_ Prometheus namespace.

`dns`

Name	Labels	Description
`dns_queries_total`	`rcode`, `qtypes`, `ips_returned`	Number of DNS queries observed
`dns_responses_total`	`rcode`, `qtypes`, `ips_returned`	Number of DNS responses observed
`dns_response_types_total`	`type`, `qtypes`	Number of DNS response types

Options

Option Key	Option Value	Description
`query`	N/A	Include the query as label “query”
`ignoreAAAA`	N/A	Ignore any AAAA requests/responses

This metric supports Context Options.

`drop`

Name	Labels	Description
`drop_total`	`reason`, `protocol`	Number of drops

Options

This metric supports Context Options.

`flow`

Name	Labels	Description
`flows_processed_total`	`type`, `subtype`, `verdict`	Total number of flows processed

Options

This metric supports Context Options.

`http`

Name	Labels	Description
`http_requests_total`	`method`, `protocol`	Count of HTTP requests
`http_responses_total`	`method`, `status`	Count of HTTP responses
`http_request_duration_seconds`	`method`	Quantiles of HTTP request duration in seconds

Options

This metric supports Context Options.

`icmp`

Name	Labels	Description
`icmp_total`	`family`, `type`	Number of ICMP messages

Options

This metric supports Context Options.

`port-distribution`

Name	Labels	Description
`port_distribution_total`	`protocol`, `port`	Numbers of packets distributed by destination port

Options

This metric supports Context Options.

`tcp`

Name	Labels	Description
`tcp_flags_total`	`flag`, `family`	TCP flag occurrences

Options

This metric supports Context Options.