Involve Prometheus

This article describes how to enable Prometheus to monitor all running Linkis services.

Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

In the context of microservice, it provides the service discovery feature, enabling to find targets dynamically from service register center, like Eureka, Consul, etc, and pull the metrics from API endpoint over http protocol.

This diagram illustrates the architecture of Prometheus and some of its ecosystem components:

Involve Prometheus - 图1

Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API consumers can be used to visualize the collected data.

Involve Prometheus - 图2

In the context of Linkis, we will use Eureka (Service Discover)SD in Prometheus to retrieve scrape targets using the Eureka REST API. And Prometheus will periodically check the REST endpoint and create a target for every app instance.

Modify the configuration item PROMETHEUS_ENABLE in linkis-env.sh of Linkis.

  1. export PROMETHEUS_ENABLE=true

After running the install.sh, it’s expected to see the configuration related to prometheus is appended inside the following files:

  1. ## application-linkis.yml ##
  2. eureka:
  3. instance:
  4. metadata-map:
  5. prometheus.path: ${prometheus.path:${prometheus.endpoint}}
  6. ...
  7. management:
  8. endpoints:
  9. web:
  10. exposure:
  11. include: refresh,info,health,metrics,prometheus
  1. ## application-eureka.yml ##
  2. eureka:
  3. instance:
  4. metadata-map:
  5. prometheus.path: ${prometheus.path:/actuator/prometheus}
  6. ...
  7. management:
  8. endpoints:
  9. web:
  10. exposure:
  11. include: refresh,info,health,metrics,prometheus
  1. ## linkis.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. wds.linkis.server.user.restful.uri.pass.auth=/api/rest_j/v1/actuator/prometheus,
  5. ...

Then inside each computation engine, like spark, flink or hive, it’s needed to add the same configuration manually.

  1. ## linkis-engineconn.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. wds.linkis.server.user.restful.uri.pass.auth=/api/rest_j/v1/actuator/prometheus,
  5. ...

Modify${LINKIS_HOME}/conf/application-linkis.yml, add prometheus as exposed endpoints.

  1. ## application-linkis.yml ##
  2. management:
  3. endpoints:
  4. web:
  5. exposure:
  6. #Add prometheus
  7. include: refresh,info,health,metrics,prometheus

Modify${LINKIS_HOME}/conf/application-eureka.yml, add prometheus as exposed endpoints.

  1. ## application-eureka.yml ##
  2. management:
  3. endpoints:
  4. web:
  5. exposure:
  6. #Add prometheus
  7. include: refresh,info,health,metrics,prometheus

Modify${LINKIS_HOME}/conf/linkis.properties, remove the comment # before prometheus.enable

  1. ## linkis.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. ...
  1. $ bash linkis-start-all.sh

After start the services, it’s expected to access the prometheus endpoint of each microservice in the Linkis, for example, http://linkishost:9103/api/rest\_j/v1/actuator/prometheus.

Involve Prometheus - 图3注意

The prometheus endpoint of gateway/eureka don’t include the prefix api/rest_j/v1, and the complete endpoint will be http://linkishost:9001/actuator/prometheus

Usually the monitoring setup for a cloud native application will be deployed on kubernetes with service discovery and high availability (e.g. using a kubernetes operator like Prometheus Operator). To quickly prototype dashboards and experiment with different metric type options (e.g. histogram vs gauge) you may need a similar setup locally. This sector explains how to setup locally a Prometheus/Alert Manager and Grafana monitoring stack with Docker Compose.

First, lets define a general component of the stack as follows:

  • An Alert Manager container that exposes its UI at 9093 and read its configuration from alertmanager.conf

  • A Prometheus container that exposes its UI at 9090 and read its configuration from prometheus.yml and its list of alert rules from alert_rules.yml

  • A Grafana container that exposes its UI at 3000, with list of metrics sources defined in grafana_datasources.yml and configuration in grafana_config.ini

  • The following docker-compose.yml file summaries the configuration of all those components:

  1. ## docker-compose.yml ##
  2. version: "3"
  3. networks:
  4. default:
  5. external: true
  6. name: my-network
  7. services:
  8. prometheus:
  9. image: prom/prometheus:latest
  10. container_name: prometheus
  11. volumes:
  12. - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
  13. - ./config/alertrule.yml:/etc/prometheus/alertrule.yml
  14. - ./prometheus/prometheus_data:/prometheus
  15. command:
  16. - '--config.file=/etc/prometheus/prometheus.yml'
  17. ports:
  18. - "9090:9090"
  19. alertmanager:
  20. image: prom/alertmanager:latest
  21. container_name: alertmanager
  22. volumes:
  23. - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  24. ports:
  25. - "9093:9093"
  26. grafana:
  27. image: grafana/grafana:latest
  28. container_name: grafana
  29. environment:
  30. - GF_SECURITY_ADMIN_PASSWORD=123456
  31. - GF_USERS_ALLOW_SIGN_UP=false
  32. volumes:
  33. - ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards
  34. - ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources
  35. - ./grafana/grafana_data:/var/lib/grafana
  36. ports:
  37. - "3000:3000"

Second, to define some alerts based on metrics in Prometheus, you can group then into an alert_rules.yml, so you could validate those alerts are properly triggered in your local setup before configuring them in the production instance. As an example, the following configration convers the usual metrics used to monitor Linkis services.

  • a. Down instance
  • b. High Cpu for each JVM instance (>80%)
  • c. High Heap memory for each JVM instance (>80%)
  • d. High NonHeap memory for each JVM instance (>80%)
  • e. High Waiting thread for each JVM instance (100)
  1. ## alertrule.yml ##
  2. groups:
  3. - name: LinkisAlert
  4. rules:
  5. - alert: LinkisNodeDown
  6. expr: last_over_time(up{job="linkis", application=~"LINKISI.*", application!="LINKIS-CG-ENGINECONN"}[1m])== 0
  7. for: 15s
  8. labels:
  9. severity: critical
  10. service: Linkis
  11. instance: "{{ $labels.instance }}"
  12. annotations:
  13. summary: "instance: {{ $labels.instance }} down"
  14. description: "Linkis instance(s) is/are down in last 1m"
  15. value: "{{ $value }}"
  16. - alert: LinkisNodeCpuHigh
  17. expr: system_cpu_usage{job="linkis", application=~"LINKIS.*"} >= 0.8
  18. for: 1m
  19. labels:
  20. severity: warning
  21. service: Linkis
  22. instance: "{{ $labels.instance }}"
  23. annotations:
  24. summary: "instance: {{ $labels.instance }} cpu overload"
  25. description: "CPU usage is over 80% for over 1min"
  26. value: "{{ $value }}"
  27. - alert: LinkisNodeHeapMemoryHigh
  28. expr: sum(jvm_memory_used_bytes{job="linkis", application=~"LINKIS.*", area="heap"}) by(instance) *100/sum(jvm_memory_max_bytes{job="linkis", application=~"LINKIS.*", area="heap"}) by(instance) >= 50
  29. for: 1m
  30. labels:
  31. severity: warning
  32. service: Linkis
  33. instance: "{{ $labels.instance }}"
  34. annotations:
  35. summary: "instance: {{ $labels.instance }} memory(heap) overload"
  36. description: "Memory usage(heap) is over 80% for over 1min"
  37. value: "{{ $value }}"
  38. - alert: LinkisNodeNonHeapMemoryHigh
  39. expr: sum(jvm_memory_used_bytes{job="linkis", application=~"LINKIS.*", area="nonheap"}) by(instance) *100/sum(jvm_memory_max_bytes{job="linkis", application=~"LINKIS.*", area="nonheap"}) by(instance) >= 60
  40. for: 1m
  41. labels:
  42. severity: warning
  43. service: Linkis
  44. instance: "{{ $labels.instance }}"
  45. annotations:
  46. summary: "instance: {{ $labels.instance }} memory(nonheap) overload"
  47. description: "Memory usage(nonheap) is over 80% for over 1min"
  48. value: "{{ $value }}"
  49. - alert: LinkisWaitingThreadHigh
  50. expr: jvm_threads_states_threads{job="linkis", application=~"LINKIS.*", state="waiting"} >= 100
  51. for: 1m
  52. labels:
  53. severity: warning
  54. service: Linkis
  55. instance: "{{ $labels.instance }}"
  56. annotations:
  57. summary: "instance: {{ $labels.instance }} waiting threads is high"
  58. description: "waiting threads is over 100 for over 1min"
  59. value: "{{ $value }}"

Note: Since once the service instance is shutdown, it will not be one of the target of Prometheus Eureka SD, and up metrics will not return any data after a short time. Thus we will collect if the up=0 in the last one minute to determine whether the service is alive or not.

Third, and most importantly define Prometheus configuration in prometheus.yml file. This will defines:

  • the global settings like scrapping interval and rules evaluation interval
  • the connection information to reach AlertManager and the rules to be evaluated
  • the connection information to application metrics endpoint. This is an example configration file for Linkis:
  1. ## prometheus.yml ##
  2. # my global config
  3. global:
  4. scrape_interval: 30s # By default, scrape targets every 15 seconds.
  5. evaluation_interval: 30s # By default, scrape targets every 15 seconds.
  6. alerting:
  7. alertmanagers:
  8. - static_configs:
  9. - targets: ['alertmanager:9093']
  10. # Load and evaluate rules in this file every 'evaluation_interval' seconds.
  11. rule_files:
  12. - "alertrule.yml"
  13. # A scrape configuration containing exactly one endpoint to scrape:
  14. # Here it's Prometheus itself.
  15. scrape_configs:
  16. - job_name: 'prometheus'
  17. static_configs:
  18. - targets: ['localhost:9090']
  19. - job_name: linkis
  20. eureka_sd_configs:
  21. # the endpoint of your eureka instance
  22. - server: {{linkis-host}}:20303/eureka
  23. relabel_configs:
  24. - source_labels: [__meta_eureka_app_name]
  25. target_label: application
  26. - source_labels: [__meta_eureka_app_instance_metadata_prometheus_path]
  27. action: replace
  28. target_label: __metrics_path__
  29. regex: (.+)

Forth, the following configuration defines how alerts will be sent to external webhook.

  1. ## alertmanager.yml ##
  2. global:
  3. resolve_timeout: 5m
  4. route:
  5. receiver: 'webhook'
  6. group_by: ['alertname']
  7. # How long to wait to buffer alerts of the same group before sending a notification initially.
  8. group_wait: 1m
  9. # How long to wait before sending an alert that has been added to a group for which there has already been a notification.
  10. group_interval: 5m
  11. # How long to wait before re-sending a given alert that has already been sent in a notification.
  12. repeat_interval: 12h
  13. receivers:
  14. - name: 'webhook'
  15. webhook_configs:
  16. - send_resolved: true
  17. url: {{your-webhook-url}}

Finally, after defining all the configuration file as well as the docker compose file we can start the monitoring stack with docker-compose up

On Prometheus page, it’s expected to see all the Linkis service instances as shown below: Involve Prometheus - 图4

When the Grafana is accessible, you need to import the prometheus as datasource in Grafana, and import the dashboard template with id 11378, which is normally used for springboot service(2.1+). Then you can view one living dashboard of Linkis there.

Involve Prometheus - 图5

You can also try to integrate the Prometheus alter manager with your own webhook, where you can see if the alter message is fired.