告警

告警

概览

IoTDB 告警功能预计支持两种模式：

写入触发：用户写入原始数据到原始时间序列，每插入一条数据都会触发 Trigger 的判断逻辑，若满足告警要求则发送告警到下游数据接收器，数据接收器再转发告警到外部终端。这种模式：
- 适合需要即时监控每一条数据的场景。
- 由于触发器中的运算会影响数据写入性能，适合对原始数据写入性能不敏感的场景。
持续查询：用户写入原始数据到原始时间序列， ContinousQuery 定时查询原始时间序列，将查询结果写入新的时间序列，每一次写入触发 Trigger 的判断逻辑，若满足告警要求则发送告警到下游数据接收器，数据接收器再转发告警到外部终端。这种模式：
- 适合需要定时查询数据在某一段时间内的情况的场景。
- 适合需要将原始数据降采样并持久化的场景。
- 由于定时查询几乎不影响原始时间序列的写入，适合对原始数据写入性能敏感的场景。

随着 Trigger 模块的引入，可以实现写入触发模式的告警。

部署 AlertManager

安装与运行

二进制文件

预编译好的二进制文件可在这里open in new window 下载。

运行方法：

./alertmanager --config.file=<your_file>

Docker 镜像

可在 Quay.ioopen in new window 或 Docker Hubopen in new window 获得。

运行方法：

docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager

配置

如下是一个示例，可以覆盖到大部分配置规则，详细的配置规则参见这里open in new window。

示例：

# alertmanager.yml
global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'
  smtp_from: 'alertmanager@example.org'
# The root route on which each incoming alert enters.
route:
  # The root route must not have any matchers as it is the entry point for
  # all alerts. It needs to have a receiver configured so alerts that do not
  # match any of the sub-routes are sent to someone.
  receiver: 'team-X-mails'
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  #
  # To aggregate by all possible labels use '...' as the sole label name.
  # This effectively disables aggregation entirely, passing through all
  # alerts as-is. This is unlikely to be what you want, unless you have
  # a very low alert volume or your upstream notification system performs
  # its own grouping. Example: group_by: [...]
  group_by: ['alertname', 'cluster']
  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s
  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m
  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h
  # All the above attributes are inherited by all child routes and can
  # overwritten on each.
  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: team-X-mails
    # The service has a sub-route for critical alerts, any alerts
    # that do not match, i.e. severity != critical, fall-back to the
    # parent node and are sent to 'team-X-mails'
    routes:
    - match:
        severity: critical
      receiver: team-X-pager
  - match:
      service: files
    receiver: team-Y-mails
    routes:
    - match:
        severity: critical
      receiver: team-Y-pager
  # This route handles all alerts coming from a database service. If there's
  # no team to handle it, it defaults to the DB team.
  - match:
      service: database
    receiver: team-DB-pager
    # Also group alerts by affected database.
    group_by: [alertname, cluster, database]
    routes:
    - match:
        owner: team-X
      receiver: team-X-pager
    - match:
        owner: team-Y
      receiver: team-Y-pager
# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  # CAUTION: 
  #   If all label names listed in `equal` are missing 
  #   from both the source and target alerts,
  #   the inhibition rule will apply!
  equal: ['alertname']
receivers:
- name: 'team-X-mails'
  email_configs:
  - to: 'team-X+alerts@example.org, team-Y+alerts@example.org'
- name: 'team-X-pager'
  email_configs:
  - to: 'team-X+alerts-critical@example.org'
  pagerduty_configs:
  - routing_key: <team-X-key>
- name: 'team-Y-mails'
  email_configs:
  - to: 'team-Y+alerts@example.org'
- name: 'team-Y-pager'
  pagerduty_configs:
  - routing_key: <team-Y-key>
- name: 'team-DB-pager'
  pagerduty_configs:
  - routing_key: <team-DB-key>

在后面的示例中，我们采用的配置如下：

# alertmanager.yml
global: 
  smtp_smarthost: ''
  smtp_from: '' 
  smtp_auth_username: '' 
  smtp_auth_password: '' 
  smtp_require_tls: false
route:
  group_by: ['alertname'] 
  group_wait: 1m
  group_interval: 10m
  repeat_interval: 10h 
  receiver: 'email'
receivers:
  - name: 'email'
    email_configs: 
    - to: '' 
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

API

AlertManager API 分为 v1 和 v2 两个版本，当前 AlertManager API 版本为 v2 （配置参见 api/v2/openapi.yamlopen in new window)。

默认配置的前缀为 /api/v1 或 /api/v2，发送告警的 endpoint 为 /api/v1/alerts 或 /api/v2/alerts。如果用户指定了 --web.route-prefix，例如 --web.route-prefix=/alertmanager/，那么前缀将会变为 /alertmanager/api/v1 或 /alertmanager/api/v2，发送告警的 endpoint 变为 /alertmanager/api/v1/alerts 或 /alertmanager/api/v2/alerts。

创建 trigger

编写 trigger 类

用户通过自行创建 Java 类、编写钩子中的逻辑来定义一个触发器。具体配置流程参见 Triggers。

下面的示例创建了 org.apache.iotdb.trigger.ClusterAlertingExample 类，其 alertManagerHandler 成员变量可发送告警至地址为 http://127.0.0.1:9093/ 的 AlertManager 实例。

当 value > 100.0 时，发送 severity 为 critical 的告警；当 50.0 < value <= 100.0 时，发送 severity 为 warning 的告警。

package org.apache.iotdb.trigger;
import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerConfiguration;
import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerEvent;
import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerHandler;
import org.apache.iotdb.trigger.api.Trigger;
import org.apache.iotdb.trigger.api.TriggerAttributes;
import org.apache.iotdb.tsfile.file.metadata.enums.TSDataType;
import org.apache.iotdb.tsfile.write.record.Tablet;
import org.apache.iotdb.tsfile.write.schema.MeasurementSchema;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
public class ClusterAlertingExample implements Trigger {
  private static final Logger LOGGER = LoggerFactory.getLogger(ClusterAlertingExample.class);
  private final AlertManagerHandler alertManagerHandler = new AlertManagerHandler();
  private final AlertManagerConfiguration alertManagerConfiguration =
      new AlertManagerConfiguration("http://127.0.0.1:9093/api/v2/alerts");
  private String alertname;
  private final HashMap<String, String> labels = new HashMap<>();
  private final HashMap<String, String> annotations = new HashMap<>();
  @Override
  public void onCreate(TriggerAttributes attributes) throws Exception {
    alertname = "alert_test";
    labels.put("series", "root.ln.wf01.wt01.temperature");
    labels.put("value", "");
    labels.put("severity", "");
    annotations.put("summary", "high temperature");
    annotations.put("description", "{{.alertname}}: {{.series}} is {{.value}}");
    alertManagerHandler.open(alertManagerConfiguration);
  }
  @Override
  public void onDrop() throws IOException {
    alertManagerHandler.close();
  }
  @Override
  public boolean fire(Tablet tablet) throws Exception {
    List<MeasurementSchema> measurementSchemaList = tablet.getSchemas();
    for (int i = 0, n = measurementSchemaList.size(); i < n; i++) {
      if (measurementSchemaList.get(i).getType().equals(TSDataType.DOUBLE)) {
        // for example, we only deal with the columns of Double type
        double[] values = (double[]) tablet.values[i];
        for (double value : values) {
          if (value > 100.0) {
            LOGGER.info("trigger value > 100");
            labels.put("value", String.valueOf(value));
            labels.put("severity", "critical");
            AlertManagerEvent alertManagerEvent =
                new AlertManagerEvent(alertname, labels, annotations);
            alertManagerHandler.onEvent(alertManagerEvent);
          } else if (value > 50.0) {
            LOGGER.info("trigger value > 50");
            labels.put("value", String.valueOf(value));
            labels.put("severity", "warning");
            AlertManagerEvent alertManagerEvent =
                new AlertManagerEvent(alertname, labels, annotations);
            alertManagerHandler.onEvent(alertManagerEvent);
          }
        }
      }
    }
    return true;
  }
}

创建 trigger

如下的 sql 语句在 root.ln.wf01.wt01.temperature 时间序列上注册了名为 root-ln-wf01-wt01-alert、运行逻辑由 org.apache.iotdb.trigger.ClusterAlertingExample 类定义的触发器。

  CREATE STATELESS TRIGGER `root-ln-wf01-wt01-alert`
  AFTER INSERT
  ON root.ln.wf01.wt01.temperature
  AS "org.apache.iotdb.trigger.ClusterAlertingExample"
  USING URI 'http://jar/ClusterAlertingExample.jar'

写入数据

当我们完成 AlertManager 的部署和启动、Trigger 的创建，可以通过向时间序列写入数据来测试告警功能。

INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (1, 0);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (2, 30);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (3, 60);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (4, 90);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (5, 120);

执行完上述写入语句后，可以收到告警邮件。由于我们的 AlertManager 配置中设定 severity 为 critical 的告警会抑制 severity 为 warning 的告警，我们收到的告警邮件中只包含写入 (5, 120) 后触发的告警。

告警机制

告警

概览

部署 AlertManager

安装与运行

二进制文件

Docker 镜像

配置

API

创建 trigger

编写 trigger 类

创建 trigger

写入数据