Prometheus 安装

安装前准备

  1. 安装docker

    1. curl -sSL https://get.daocloud.io/docker > docker.sh
    2. bash docker.sh
  2. 替换国内docker源加速

    1. touch /etc/default/docker && echo "DOCKER_OPTS=\"--registry-mirror=https://registry.docker-cn.com\"" > /etc/default/docker
  3. 关闭主机防火墙

    1. systemctl stop firewalld.service

获取资源

  1. 拉取标准容器镜像,如果timeout多试几次

    1. docker pull prom/prometheus
    2. docker pull prom/alertmanager
    3. docker pull consul
  2. 准备部署测试机

    1. OS:CentOS-7.5.1804
    2. host1:127.0.0.1
    3. host2:127.0.0.2

单机部署:

结构图如下:

deploy_single

  1. run consul
    实例运行:

    1. docker volume create consul-data
    2. docker run --name consul01 --volume consul-data:/consul/data -d -p 8300:8300 -p 8400:8400 -p 8500:8500 -p 8600:8600 consul
  2. run alertmanager

    配置文件/app/docker/alertmanager/alertmanager.yml
    example:

    1. global:
    2. resolve_timeout: 5m
    3. route:
    4. group_by: ['alertname']
    5. group_wait: 2s
    6. group_interval: 10s
    7. repeat_interval: 5m
    8. receiver: 'web.hook'
    9. receivers:
    10. - name: 'web.hook'
    11. webhook_configs:
    12. - url: 'http://127.0.0.1:8088/wecube-monitor/api/v1/alarm/webhook'
    13. send_resolved: true
    14. inhibit_rules:
    15. - source_match:
    16. severity: 'critical'
    17. target_match:
    18. severity: 'warning'
    19. equal: ['alertname', 'dev', 'instance']

    配置文件中的url是monitor中的告警回调的接口,先把告警发给monitor,再由monitor来进一步处理展示在web上和关联配置好的接收人进行发送

    实例运行:

    1. docker volume create alertmanager-data
    2. docker run --name alertmanager01 --volume alertmanager-data:/alertmanager --volume /app/docker/alertmanager:/etc/alertmanager -d -p 9093:9093 -p 9094:9094 prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --web.listen-address=":9093" --cluster.listen-address=":9094"
  3. run prometheus

    配置文件/app/docker/prometheus/prometheus.yml
    example:

    1. # my global config
    2. global:
    3. scrape_interval: 10s
    4. evaluation_interval: 10s
    5. # scrape_timeout is set to the global default (10s).
    6. # Alertmanager configuration
    7. alerting:
    8. alertmanagers:
    9. - static_configs:
    10. - targets:
    11. - 127.0.0.1:9093
    12. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    13. rule_files:
    14. - /etc/prometheus/rules/*.yml
    15. # - "first_rules.yml"
    16. # - "second_rules.yml"
    17. # A scrape configuration containing exactly one endpoint to scrape:
    18. # Here it's Prometheus itself.
    19. scrape_configs:
    20. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    21. - job_name: 'prometheus'
    22. # metrics_path defaults to '/metrics'
    23. # scheme defaults to 'http'.
    24. static_configs:
    25. - targets: ['127.0.0.1:9090']
    26. - job_name: 'consul'
    27. scheme: http
    28. consul_sd_configs:
    29. - server: 127.0.0.1:8500
    30. scheme: http
    31. services: []

    配置文件说明:
    global -> scrape_interval 默认采集间隔
    alerting -> targets altermanager的地址
    rule_files 告警配置规则文件路径
    scrape_configs -> job:consul 从consul中获取采集的对象信息

    实例运行:

    1. docker volume create prometheus-tsdb
    2. docker run --name prometheus01 --volume prometheus-tsdb:/prometheus --volume /app/docker/prometheus:/etc/prometheus -d -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

    热加载配置接口:

    1. curl -X POST http://127.0.0.1:9090/-/reload
  4. 注册exporter

    1. curl -X PUT -d '{"id": "node31","name": "node31","address": "127.0.0.1","port": 9100,"tags": ["host"],"checks": [{"http": "http://127.0.0.1:9100/","interval": "10s"}]}' http://127.0.0.1:8500/v1/agent/service/register
  5. 注销exporter

    1. curl -X PUT http://127.0.0.1:8500/v1/agent/service/deregister/node29

主备部署

deploy_as_1

deploy_as_1

  1. alive_check可以部署在host2上去检测prometheus01的状态
  2. if prometheus01 down
  3. 检测host1的状态
  4. if host1 up
  5. 尝试一定次数去把prometheus01拉起,如果恢复了->return
  6. 修改prometheus02配置并reload配置去启用备节点
  1. host1 & host2 run consul

    1. docker run --name consul01 -d -p 8300:8300 -p 8400:8400 -p 8500:8500 -p 8600:8600 consul
  2. host1 run alertmanager

    1. docker volume create alertmanager-data
    2. docker run --name alertmanager01 --volume alertmanager-data:/alertmanager --volume /app/docker/alertmanager:/etc/alertmanager -d -p 9093:9093 -p 9094:9094 prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --web.listen-address=":9093" --cluster.listen-address=":9094"
  3. host2 run alertmanager

    在alertmanager.yml里配置不同的group_wait,为了防止极端情况下备节点告警时还没收到主节点的相关告警信息,让备节点等待一点时间

    1. docker volume create alertmanager-data
    2. docker run --name alertmanager02 --volume alertmanager-data:/alertmanager --volume /app/docker/alertmanager:/etc/alertmanager -d -p 9093:9093 -p 9094:9094 prom/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --web.listen-address=":9093" --cluster.listen-address=":9094" --cluster.peer="127.0.0.1:9094"
  4. host1 run prometheus

    1. docker volume create prometheus-tsdb
    2. docker run --name prometheus01 --volume prometheus-tsdb:/prometheus --volume /app/docker/prometheus:/etc/prometheus -d -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle
  5. host2 run prometheus 修改prometheus.yml,把01里的consul scrape改成去拉01节点prometheus的数据targets:’127.0.0.1:9090’,同步01的数据

    1. - job_name: 'federate'
    2. scrape_interval: 10s
    3. honor_labels: true
    4. metrics_path: '/federate'
    5. params:
    6. 'match[]':
    7. - '{job="prometheus"}'
    8. - '{__name__=~"job:.*"}'
    9. - '{__name__=~"node.*"}'
    10. static_configs:
    11. - targets:
    12. - '127.0.0.1:9090'
    1. docker run --name prometheus02 --volume prometheus-tsdb:/prometheus --volume /app/docker/prometheus:/etc/prometheus -d -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle