Sampler aggregation

Sampler aggregation

A filtering aggregation used to limit any sub aggregations’ processing to a sample of the top-scoring documents.

Example use cases:

  • Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
  • Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

Example:

A query on StackOverflow data for the popular term javascript OR the rarer term kibana will match many documents - most of them missing the word Kibana. To focus the significant_terms aggregation on top-scoring documents that are more likely to match the most interesting parts of our query we use a sample.

  1. POST /stackoverflow/_search?size=0
  2. {
  3. "query": {
  4. "query_string": {
  5. "query": "tags:kibana OR tags:javascript"
  6. }
  7. },
  8. "aggs": {
  9. "sample": {
  10. "sampler": {
  11. "shard_size": 200
  12. },
  13. "aggs": {
  14. "keywords": {
  15. "significant_terms": {
  16. "field": "tags",
  17. "exclude": [ "kibana", "javascript" ]
  18. }
  19. }
  20. }
  21. }
  22. }
  23. }

Response:

  1. {
  2. ...
  3. "aggregations": {
  4. "sample": {
  5. "doc_count": 200,
  6. "keywords": {
  7. "doc_count": 200,
  8. "bg_count": 650,
  9. "buckets": [
  10. {
  11. "key": "elasticsearch",
  12. "doc_count": 150,
  13. "score": 1.078125,
  14. "bg_count": 200
  15. },
  16. {
  17. "key": "logstash",
  18. "doc_count": 50,
  19. "score": 0.5625,
  20. "bg_count": 50
  21. }
  22. ]
  23. }
  24. }
  25. }
  26. }

200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.

Without the sampler aggregation the request query considers the full “long tail” of low-quality matches and therefore identifies less significant terms such as jquery and angular rather than focusing on the more insightful Kibana-related terms.

  1. POST /stackoverflow/_search?size=0
  2. {
  3. "query": {
  4. "query_string": {
  5. "query": "tags:kibana OR tags:javascript"
  6. }
  7. },
  8. "aggs": {
  9. "low_quality_keywords": {
  10. "significant_terms": {
  11. "field": "tags",
  12. "size": 3,
  13. "exclude": [ "kibana", "javascript" ]
  14. }
  15. }
  16. }
  17. }

Response:

  1. {
  2. ...
  3. "aggregations": {
  4. "low_quality_keywords": {
  5. "doc_count": 600,
  6. "bg_count": 650,
  7. "buckets": [
  8. {
  9. "key": "angular",
  10. "doc_count": 200,
  11. "score": 0.02777,
  12. "bg_count": 200
  13. },
  14. {
  15. "key": "jquery",
  16. "doc_count": 200,
  17. "score": 0.02777,
  18. "bg_count": 200
  19. },
  20. {
  21. "key": "logstash",
  22. "doc_count": 50,
  23. "score": 0.0069,
  24. "bg_count": 50
  25. }
  26. ]
  27. }
  28. }
  29. }

shard_size

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

Limitations

Cannot be nested under breadth_first aggregations

Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.