Monitoring

Available as of v1.2.0

The monitoring feature is now implemented with an addon and is disabled by default in new installations.

Users can enable/disable rancher-monitoring addon from the Harvester WebUI after installation.

Users can also enable/disable the rancher-monitoring addon in their Harvester installation by customizing the harvester-configuration file.

For Harvester clusters upgraded from version v1.1.x, the monitoring feature is converted to an addon automatically and kept enabled as before.

Dashboard Metrics

Harvester has provided a built-in monitoring integration using Prometheus. Monitoring is automatically enabled during the Harvester installations.

From the Dashboard page, users can view the cluster metrics and top 10 most used VM metrics respectively. Also, users can click the Grafana dashboard link to view more dashboards on the Grafana UI.

Monitoring - 图1

Monitoring - 图2note

Only admin users are able to view the cluster dashboard metrics.

Additionally, Grafana is provided by rancher-monitoring, so the default admin password is: prom-operator

Reference: values.yaml

VM Detail Metrics

For VMs, you can view VM metrics by clicking on the VM details page > VM Metrics.

Monitoring - 图3

Monitoring - 图4note

The current Memory Usage is calculated based on (1 - free/total) * 100%, not (used/total) * 100%.

For example, in a Linux OS, the free -h command outputs the current memory statistics as follows

  1. $ free -h
  2. total used free shared buff/cache available
  3. Mem: 7.7Gi 166Mi 4.6Gi 1.0Mi 2.9Gi 7.2Gi
  4. Swap: 0B 0B 0B

The corresponding Memory Usage is (1 - 4.6/7.7) * 100%, roughly 40%.

How to Configure Monitoring Settings

Available as of v1.0.2

Monitoring has several components that help to collect and aggregate metric data from all Nodes/Pods/VMs. The resources required for monitoring depend on your workloads and hardware resources. Harvester sets defaults based on general use cases, and you can change them accordingly.

Currently, Resources Settings can be configured for the following components:

  • Prometheus
  • Prometheus Node Exporter

From UI

On the Advanced page, you can view and change the resource settings as follows:

  1. Go to the Advanced > Addons page and select the rancher-monitoring page.
  2. From the Prometheus tab, change the resource requests and limits.
  3. Select Save when finished configuring the settings for the rancher-monitoring addon. The Monitoring deployments restart within a few seconds. Please be aware that the reboot can take time to reload previous data.

Monitoring - 图5

Monitoring - 图6note

The UI configuration is only visible when the rancher-monitoring addon is enabled.

The most frequently used option is the memory setting:

  • The Requested Memory is the minimum memory required by the Monitoring resource. The recommended value is about 5% to 10% of the system memory of one single management node. A value less than 500Mi will be denied.

  • The Memory Limit is the maximum memory that can be allocated to a Monitoring resource. The recommended value is about 30% of the system’s memory for one single management node. When the Monitoring reaches this threshold, it will automatically restart.

Depending on the available hardware resources and system loads, you may change the above settings accordingly.

Monitoring - 图7note

If you have multiple management nodes with different hardware resources, please set the value of Prometheus based on the smaller one.

Monitoring - 图8caution

When an increasing number of VMs get deployed on one node, the prometheus-node-exporter pod might get killed due to OOM (out of memory). In that case, you should increase the value of limits.memory.

From CLI

You can use the following kubectl command to change resource configurations for the rancher-monitoring addon: kubectl edit addons.harvesterhci.io -n cattle-monitoring-system rancher-monitoring.

The resource path and default values are as follows:

  1. apiVersion: harvesterhci.io/v1beta1
  2. kind: Addon
  3. metadata:
  4. name: rancher-monitoring
  5. namespace: cattle-monitoring-system
  6. spec:
  7. valuesContent: |
  8. prometheus:
  9. prometheusSpec:
  10. resources:
  11. limits:
  12. cpu: 1000m
  13. memory: 2500Mi
  14. requests:
  15. cpu: 850m
  16. memory: 1750Mi

Monitoring - 图9note

You can still make configuration adjustments when the addon is disabled. However, these changes only take effect when you re-enable the addon.

Alertmanager

Harvester uses Alertmanager to collect and manage all the alerts that happened/happening in the cluster.

Alertmanager Config

Enable/Disable Alertmanager

Alertmanager is enabled by default. You may disable it from the following config path.

Monitoring - 图10

Change Resource Setting

You can also change the resource settings of Alertmanager as shown in the picture above.

Configure AlertmanagerConfig from WebUI

To send the alerts to third-party servers, you need to config AlertmanagerConfig.

On the WebUI, navigate to Monitoring & Logging -> Monitoring -> Alertmanager Configs.

On the Alertmanager Config: Create page, click Namespace to select the target namespace from the drop-down list and set the Name. After this, click Create in the lower right corner.

Monitoring - 图11

Click the Alertmanager Configs you just created to continue the configuration.

Monitoring - 图12

Click Add Receiver.

Monitoring - 图13

Set the Name for the receiver. After this, select the receiver type, for example, Webhook, and click Add Webhook.

Monitoring - 图14

Fill in the required parameters and click Create.

Monitoring - 图15

Configure AlertmanagerConfig from CLI

You can also add AlertmanagerConfig from the CLI.

Exampe: a Webhook receiver in the default namespace.

  1. cat << EOF > a-single-receiver.yaml
  2. apiVersion: monitoring.coreos.com/v1alpha1
  3. kind: AlertmanagerConfig
  4. metadata:
  5. name: amc-example
  6. # namespace: your value
  7. labels:
  8. alertmanagerConfig: example
  9. spec:
  10. route:
  11. continue: true
  12. groupBy:
  13. - cluster
  14. - alertname
  15. receiver: "amc-webhook-receiver"
  16. receivers:
  17. - name: "amc-webhook-receiver"
  18. webhookConfigs:
  19. - sendResolved: true
  20. url: "http://192.168.122.159:8090/"
  21. EOF
  22. # kubectl apply -f a-single-receiver.yaml
  23. alertmanagerconfig.monitoring.coreos.com/amc-example created
  24. # kubectl get alertmanagerconfig -A
  25. NAMESPACE NAME AGE
  26. default amc-example 27s

Example of an Alert Received by Webhook

Alerts sent to the webhook server will be in the following format:

  1. {
  2. 'receiver': 'longhorn-system-amc-example-amc-webhook-receiver',
  3. 'status': 'firing',
  4. 'alerts': [],
  5. 'groupLabels': {},
  6. 'commonLabels': {'alertname': 'LonghornVolumeStatusWarning', 'container': 'longhorn-manager', 'endpoint': 'manager', 'instance': '10.52.0.83:9500', 'issue': 'Longhorn volume is Degraded.',
  7. 'job': 'longhorn-backend', 'namespace': 'longhorn-system', 'node': 'harv2', 'pod': 'longhorn-manager-r5bgm', 'prometheus': 'cattle-monitoring-system/rancher-monitoring-prometheus',
  8. 'service': 'longhorn-backend', 'severity': 'warning'},
  9. 'commonAnnotations': {'description': 'Longhorn volume is Degraded for more than 5 minutes.', 'runbook_url': 'https://longhorn.io/docs/1.3.0/monitoring/metrics/',
  10. 'summary': 'Longhorn volume is Degraded'},
  11. 'externalURL': 'https://192.168.122.200/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-alertmanager:9093/proxy',
  12. 'version': '4',
  13. 'groupKey': '{}/{namespace="longhorn-system"}:{}',
  14. 'truncatedAlerts': 0
  15. }

Monitoring - 图16note

Different receivers may present the alerts in different formats. For details, please refer to the related documents.

Known Limitation

The AlertmanagerConfig is enforced by the namespace. Gloabl-level AlertmanagerConfig without a namespace is not supported.

We have already created a GithHb issue to track upstream changes. Once the feature is available, Harvester will adopt it.

View and Manage Alerts

From Alertmanager Dashboard

You can visit the original dashboard of Alertmanager from the link below. Note that you need to replace the-cluster-vip with the actual cluster-vip:

https://the-cluster-vip/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-alertmanager:9093/proxy/#/alerts

The overall view of the Alertmanager dashboard is as follows.

Monitoring - 图17

You can view the details of an alert:

Monitoring - 图18

From Prometheus Dashboard

You can visit the original dashboard of Prometheus from the link below. Note that you need to replace the-cluster-vip with the actual cluster-vip:

https://the-cluster-vip/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-prometheus:9090/proxy/

The Alerts menu in the top navigation bar shows all defined rules in Prometheus. You can use the filters Inactive, Pending, and Firing to quickly find the information that you need.

Monitoring - 图19

Troubleshooting

For Monitoring support and troubleshooting, please refer to the troubleshooting page .