告警

概览

IoTDB 告警功能预计支持两种模式:

  • 写入触发:用户写入原始数据到原始时间序列,每插入一条数据都会触发 Trigger 的判断逻辑, 若满足告警要求则发送告警到下游数据接收器, 数据接收器再转发告警到外部终端。这种模式:

    • 适合需要即时监控每一条数据的场景。
    • 由于触发器中的运算会影响数据写入性能,适合对原始数据写入性能不敏感的场景。
  • 持续查询:用户写入原始数据到原始时间序列, ContinousQuery 定时查询原始时间序列,将查询结果写入新的时间序列, 每一次写入触发 Trigger 的判断逻辑, 若满足告警要求则发送告警到下游数据接收器, 数据接收器再转发告警到外部终端。这种模式:

    • 适合需要定时查询数据在某一段时间内的情况的场景。
    • 适合需要将原始数据降采样并持久化的场景。
    • 由于定时查询几乎不影响原始时间序列的写入,适合对原始数据写入性能敏感的场景。

随着 Trigger 模块的引入,可以实现写入触发模式的告警。

部署 AlertManager

安装与运行

二进制文件

预编译好的二进制文件可在 这里告警机制 - 图1open in new window 下载。

运行方法:

  1. ./alertmanager --config.file=<your_file>

Docker 镜像

可在 Quay.io告警机制 - 图2open in new windowDocker Hub告警机制 - 图3open in new window 获得。

运行方法:

  1. docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager

配置

如下是一个示例,可以覆盖到大部分配置规则,详细的配置规则参见 这里告警机制 - 图4open in new window

示例:

  1. # alertmanager.yml
  2. global:
  3. # The smarthost and SMTP sender used for mail notifications.
  4. smtp_smarthost: 'localhost:25'
  5. smtp_from: 'alertmanager@example.org'
  6. # The root route on which each incoming alert enters.
  7. route:
  8. # The root route must not have any matchers as it is the entry point for
  9. # all alerts. It needs to have a receiver configured so alerts that do not
  10. # match any of the sub-routes are sent to someone.
  11. receiver: 'team-X-mails'
  12. # The labels by which incoming alerts are grouped together. For example,
  13. # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  14. # be batched into a single group.
  15. #
  16. # To aggregate by all possible labels use '...' as the sole label name.
  17. # This effectively disables aggregation entirely, passing through all
  18. # alerts as-is. This is unlikely to be what you want, unless you have
  19. # a very low alert volume or your upstream notification system performs
  20. # its own grouping. Example: group_by: [...]
  21. group_by: ['alertname', 'cluster']
  22. # When a new group of alerts is created by an incoming alert, wait at
  23. # least 'group_wait' to send the initial notification.
  24. # This way ensures that you get multiple alerts for the same group that start
  25. # firing shortly after another are batched together on the first
  26. # notification.
  27. group_wait: 30s
  28. # When the first notification was sent, wait 'group_interval' to send a batch
  29. # of new alerts that started firing for that group.
  30. group_interval: 5m
  31. # If an alert has successfully been sent, wait 'repeat_interval' to
  32. # resend them.
  33. repeat_interval: 3h
  34. # All the above attributes are inherited by all child routes and can
  35. # overwritten on each.
  36. # The child route trees.
  37. routes:
  38. # This routes performs a regular expression match on alert labels to
  39. # catch alerts that are related to a list of services.
  40. - match_re:
  41. service: ^(foo1|foo2|baz)$
  42. receiver: team-X-mails
  43. # The service has a sub-route for critical alerts, any alerts
  44. # that do not match, i.e. severity != critical, fall-back to the
  45. # parent node and are sent to 'team-X-mails'
  46. routes:
  47. - match:
  48. severity: critical
  49. receiver: team-X-pager
  50. - match:
  51. service: files
  52. receiver: team-Y-mails
  53. routes:
  54. - match:
  55. severity: critical
  56. receiver: team-Y-pager
  57. # This route handles all alerts coming from a database service. If there's
  58. # no team to handle it, it defaults to the DB team.
  59. - match:
  60. service: database
  61. receiver: team-DB-pager
  62. # Also group alerts by affected database.
  63. group_by: [alertname, cluster, database]
  64. routes:
  65. - match:
  66. owner: team-X
  67. receiver: team-X-pager
  68. - match:
  69. owner: team-Y
  70. receiver: team-Y-pager
  71. # Inhibition rules allow to mute a set of alerts given that another alert is
  72. # firing.
  73. # We use this to mute any warning-level notifications if the same alert is
  74. # already critical.
  75. inhibit_rules:
  76. - source_match:
  77. severity: 'critical'
  78. target_match:
  79. severity: 'warning'
  80. # Apply inhibition if the alertname is the same.
  81. # CAUTION:
  82. # If all label names listed in `equal` are missing
  83. # from both the source and target alerts,
  84. # the inhibition rule will apply!
  85. equal: ['alertname']
  86. receivers:
  87. - name: 'team-X-mails'
  88. email_configs:
  89. - to: 'team-X+alerts@example.org, team-Y+alerts@example.org'
  90. - name: 'team-X-pager'
  91. email_configs:
  92. - to: 'team-X+alerts-critical@example.org'
  93. pagerduty_configs:
  94. - routing_key: <team-X-key>
  95. - name: 'team-Y-mails'
  96. email_configs:
  97. - to: 'team-Y+alerts@example.org'
  98. - name: 'team-Y-pager'
  99. pagerduty_configs:
  100. - routing_key: <team-Y-key>
  101. - name: 'team-DB-pager'
  102. pagerduty_configs:
  103. - routing_key: <team-DB-key>

在后面的示例中,我们采用的配置如下:

  1. # alertmanager.yml
  2. global:
  3. smtp_smarthost: ''
  4. smtp_from: ''
  5. smtp_auth_username: ''
  6. smtp_auth_password: ''
  7. smtp_require_tls: false
  8. route:
  9. group_by: ['alertname']
  10. group_wait: 1m
  11. group_interval: 10m
  12. repeat_interval: 10h
  13. receiver: 'email'
  14. receivers:
  15. - name: 'email'
  16. email_configs:
  17. - to: ''
  18. inhibit_rules:
  19. - source_match:
  20. severity: 'critical'
  21. target_match:
  22. severity: 'warning'
  23. equal: ['alertname']

API

AlertManager API 分为 v1v2 两个版本,当前 AlertManager API 版本为 v2 (配置参见 api/v2/openapi.yaml告警机制 - 图5open in new window)。

默认配置的前缀为 /api/v1/api/v2, 发送告警的 endpoint 为 /api/v1/alerts/api/v2/alerts。 如果用户指定了 --web.route-prefix, 例如 --web.route-prefix=/alertmanager/, 那么前缀将会变为 /alertmanager/api/v1/alertmanager/api/v2, 发送告警的 endpoint 变为 /alertmanager/api/v1/alerts/alertmanager/api/v2/alerts

创建 trigger

编写 trigger 类

用户通过自行创建 Java 类、编写钩子中的逻辑来定义一个触发器。 具体配置流程参见 Triggers

下面的示例创建了 org.apache.iotdb.trigger.ClusterAlertingExample 类, 其 alertManagerHandler 成员变量可发送告警至地址为 http://127.0.0.1:9093/ 的 AlertManager 实例。

value > 100.0 时,发送 severitycritical 的告警; 当 50.0 < value <= 100.0 时,发送 severitywarning 的告警。

  1. package org.apache.iotdb.trigger;
  2. import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerConfiguration;
  3. import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerEvent;
  4. import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerHandler;
  5. import org.apache.iotdb.trigger.api.Trigger;
  6. import org.apache.iotdb.trigger.api.TriggerAttributes;
  7. import org.apache.iotdb.tsfile.file.metadata.enums.TSDataType;
  8. import org.apache.iotdb.tsfile.write.record.Tablet;
  9. import org.apache.iotdb.tsfile.write.schema.MeasurementSchema;
  10. import org.slf4j.Logger;
  11. import org.slf4j.LoggerFactory;
  12. import java.io.IOException;
  13. import java.util.HashMap;
  14. import java.util.List;
  15. public class ClusterAlertingExample implements Trigger {
  16. private static final Logger LOGGER = LoggerFactory.getLogger(ClusterAlertingExample.class);
  17. private final AlertManagerHandler alertManagerHandler = new AlertManagerHandler();
  18. private final AlertManagerConfiguration alertManagerConfiguration =
  19. new AlertManagerConfiguration("http://127.0.0.1:9093/api/v2/alerts");
  20. private String alertname;
  21. private final HashMap<String, String> labels = new HashMap<>();
  22. private final HashMap<String, String> annotations = new HashMap<>();
  23. @Override
  24. public void onCreate(TriggerAttributes attributes) throws Exception {
  25. alertname = "alert_test";
  26. labels.put("series", "root.ln.wf01.wt01.temperature");
  27. labels.put("value", "");
  28. labels.put("severity", "");
  29. annotations.put("summary", "high temperature");
  30. annotations.put("description", "{{.alertname}}: {{.series}} is {{.value}}");
  31. alertManagerHandler.open(alertManagerConfiguration);
  32. }
  33. @Override
  34. public void onDrop() throws IOException {
  35. alertManagerHandler.close();
  36. }
  37. @Override
  38. public boolean fire(Tablet tablet) throws Exception {
  39. List<MeasurementSchema> measurementSchemaList = tablet.getSchemas();
  40. for (int i = 0, n = measurementSchemaList.size(); i < n; i++) {
  41. if (measurementSchemaList.get(i).getType().equals(TSDataType.DOUBLE)) {
  42. // for example, we only deal with the columns of Double type
  43. double[] values = (double[]) tablet.values[i];
  44. for (double value : values) {
  45. if (value > 100.0) {
  46. LOGGER.info("trigger value > 100");
  47. labels.put("value", String.valueOf(value));
  48. labels.put("severity", "critical");
  49. AlertManagerEvent alertManagerEvent =
  50. new AlertManagerEvent(alertname, labels, annotations);
  51. alertManagerHandler.onEvent(alertManagerEvent);
  52. } else if (value > 50.0) {
  53. LOGGER.info("trigger value > 50");
  54. labels.put("value", String.valueOf(value));
  55. labels.put("severity", "warning");
  56. AlertManagerEvent alertManagerEvent =
  57. new AlertManagerEvent(alertname, labels, annotations);
  58. alertManagerHandler.onEvent(alertManagerEvent);
  59. }
  60. }
  61. }
  62. }
  63. return true;
  64. }
  65. }

创建 trigger

如下的 sql 语句在 root.ln.wf01.wt01.temperature 时间序列上注册了名为 root-ln-wf01-wt01-alert、 运行逻辑由 org.apache.iotdb.trigger.ClusterAlertingExample 类定义的触发器。

  1. CREATE STATELESS TRIGGER `root-ln-wf01-wt01-alert`
  2. AFTER INSERT
  3. ON root.ln.wf01.wt01.temperature
  4. AS "org.apache.iotdb.trigger.ClusterAlertingExample"
  5. USING URI 'http://jar/ClusterAlertingExample.jar'

写入数据

当我们完成 AlertManager 的部署和启动、Trigger 的创建, 可以通过向时间序列写入数据来测试告警功能。

  1. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (1, 0);
  2. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (2, 30);
  3. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (3, 60);
  4. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (4, 90);
  5. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (5, 120);

执行完上述写入语句后,可以收到告警邮件。由于我们的 AlertManager 配置中设定 severitycritical 的告警 会抑制 severitywarning 的告警,我们收到的告警邮件中只包含写入 (5, 120) 后触发的告警。

alerting