Understanding groups

Understanding groups

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

To preserve flexibility, Rollup Jobs are defined based on how future queries may need to use the data. Traditionally, systems force the admin to make decisions about what metrics to rollup and on what interval. E.g. The average of cpu_time on an hourly basis. This is limiting; if, in the future, the admin wishes to see the average of cpu_time on an hourly basis and partitioned by host_name, they are out of luck.

Of course, the admin can decide to rollup the [hour, host] tuple on an hourly basis, but as the number of grouping keys grows, so do the number of tuples the admin needs to configure. Furthermore, these [hours, host] tuples are only useful for hourly rollups…​ daily, weekly, or monthly rollups all require new configurations.

Rather than force the admin to decide ahead of time which individual tuples should be rolled up, Elasticsearch’s Rollup jobs are configured based on which groups are potentially useful to future queries. For example, this configuration:

  1. "groups" : {
  2. "date_histogram": {
  3. "field": "timestamp",
  4. "fixed_interval": "1h",
  5. "delay": "7d"
  6. },
  7. "terms": {
  8. "fields": ["hostname", "datacenter"]
  9. },
  10. "histogram": {
  11. "fields": ["load", "net_in", "net_out"],
  12. "interval": 5
  13. }
  14. }

Allows date_histogram to be used on the "timestamp" field, terms aggregations to be used on the "hostname" and "datacenter" fields, and histograms to be used on any of "load", "net_in", "net_out" fields.

Importantly, these aggs/fields can be used in any combination. This aggregation:

  1. "aggs" : {
  2. "hourly": {
  3. "date_histogram": {
  4. "field": "timestamp",
  5. "fixed_interval": "1h"
  6. },
  7. "aggs": {
  8. "host_names": {
  9. "terms": {
  10. "field": "hostname"
  11. }
  12. }
  13. }
  14. }
  15. }

is just as valid as this aggregation:

  1. "aggs" : {
  2. "hourly": {
  3. "date_histogram": {
  4. "field": "timestamp",
  5. "fixed_interval": "1h"
  6. },
  7. "aggs": {
  8. "data_center": {
  9. "terms": {
  10. "field": "datacenter"
  11. }
  12. },
  13. "aggs": {
  14. "host_names": {
  15. "terms": {
  16. "field": "hostname"
  17. }
  18. },
  19. "aggs": {
  20. "load_values": {
  21. "histogram": {
  22. "field": "load",
  23. "interval": 5
  24. }
  25. }
  26. }
  27. }
  28. }
  29. }
  30. }

You’ll notice that the second aggregation is not only substantially larger, it also swapped the position of the terms aggregation on "hostname", illustrating how the order of aggregations does not matter to rollups. Similarly, while the date_histogram is required for rolling up data, it isn’t required while querying (although often used). For example, this is a valid aggregation for Rollup Search to execute:

  1. "aggs" : {
  2. "host_names": {
  3. "terms": {
  4. "field": "hostname"
  5. }
  6. }
  7. }

Ultimately, when configuring groups for a job, think in terms of how you might wish to partition data in a query at a future date…​ then include those in the config. Because Rollup Search allows any order or combination of the grouped fields, you just need to decide if a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc).

Calendar vs fixed time intervals

Each rollup-job must have a date histogram group with a defined interval. Elasticsearch understands both calendar and fixed time intervals. Fixed time intervals are fairly easy to understand; 60s means sixty seconds. But what does 1M mean? One month of time depends on which month we are talking about, some months are longer or shorter than others. This is an example of calendar time and the duration of that unit depends on context. Calendar units are also affected by leap-seconds, leap-years, etc.

This is important because the buckets generated by rollup are in either calendar or fixed intervals and this limits how you can query them later. See Requests must be multiples of the config.

We recommend sticking with fixed time intervals, since they are easier to understand and are more flexible at query time. It will introduce some drift in your data during leap-events and you will have to think about months in a fixed quantity (30 days) instead of the actual calendar length. However, it is often easier than dealing with calendar units at query time.

Multiples of units are always “fixed”. For example, 2h is always the fixed quantity 7200 seconds. Single units can be fixed or calendar depending on the unit:

UnitCalendarFixed

millisecond

NA

1ms, 10ms, etc

second

NA

1s, 10s, etc

minute

1m

2m, 10m, etc

hour

1h

2h, 10h, etc

day

1d

2d, 10d, etc

week

1w

NA

month

1M

NA

quarter

1q

NA

year

1y

NA

For some units where there are both fixed and calendar, you may need to express the quantity in terms of the next smaller unit. For example, if you want a fixed day (not a calendar day), you should specify 24h instead of 1d. Similarly, if you want fixed hours, specify 60m instead of 1h. This is because the single quantity entails calendar time, and limits you to querying by calendar time in the future.

Grouping limitations with heterogeneous indices

There was previously a limitation in how Rollup could handle indices that had heterogeneous mappings (multiple, unrelated/non-overlapping mappings). The recommendation at the time was to configure a separate job per data “type”. For example, you might configure a separate job for each Beats module that you had enabled (one for process, another for filesystem, etc).

This recommendation was driven by internal implementation details that caused document counts to be potentially incorrect if a single “merged” job was used.

This limitation has since been alleviated. As of 6.4.0, it is now considered best practice to combine all rollup configurations into a single job.

As an example, if your index has two types of documents:

  1. {
  2. "timestamp": 1516729294000,
  3. "temperature": 200,
  4. "voltage": 5.2,
  5. "node": "a"
  6. }

and

  1. {
  2. "timestamp": 1516729294000,
  3. "price": 123,
  4. "title": "Foo"
  5. }

the best practice is to combine them into a single rollup job which covers both of these document types, like this:

  1. PUT _rollup/job/combined
  2. {
  3. "index_pattern": "data-*",
  4. "rollup_index": "data_rollup",
  5. "cron": "*/30 * * * * ?",
  6. "page_size": 1000,
  7. "groups": {
  8. "date_histogram": {
  9. "field": "timestamp",
  10. "fixed_interval": "1h",
  11. "delay": "7d"
  12. },
  13. "terms": {
  14. "fields": [ "node", "title" ]
  15. }
  16. },
  17. "metrics": [
  18. {
  19. "field": "temperature",
  20. "metrics": [ "min", "max", "sum" ]
  21. },
  22. {
  23. "field": "price",
  24. "metrics": [ "avg" ]
  25. }
  26. ]
  27. }

Doc counts and overlapping jobs

There was previously an issue with document counts on “overlapping” job configurations, driven by the same internal implementation detail. If there were two Rollup jobs saving to the same index, where one job is a “subset” of another job, it was possible that document counts could be incorrect for certain aggregation arrangements.

This issue has also since been eliminated in 6.4.0.