Star-tree index

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, join the discussion on the OpenSearch forum.

A star-tree index is a multi-field index that improves the performance of aggregations.

OpenSearch will automatically use a star-tree index to optimize aggregations if the queried fields are part of dimension fields and the aggregations are on star-tree metric fields. No changes are required in the query syntax or the request parameters.

When to use a star-tree index

A star-tree index can be used to perform faster aggregations. Consider the following criteria and features when deciding to use a star-tree index:

  • Star-tree indexes natively support multi-field aggregations.
  • Star-tree indexes are created in real time as part of the indexing process, so the data in a star-tree will always be up to date.
  • A star-tree index consolidates data, increasing index paging efficiency and using less IO for search queries.

Limitations

Star-tree indexes have the following limitations:

  • A star-tree index should only be enabled on indexes whose data is not updated or deleted because updates and deletions are not accounted for in a star-tree index.
  • A star-tree index can be used for aggregation queries only if the queried fields are a subset of the star-tree’s dimensions and the aggregated fields are a subset of the star-tree’s metrics.
  • After a star-tree index is enabled, it cannot be disabled. In order to disable a star-tree index, the data in the index must be reindexed without the star-tree mapping. Furthermore, changing a star-tree configuration will also require a reindex operation.
  • Multi-values/array values are not supported.
  • Only limited queries and aggregations are supported. Support for more features will be added in future versions.
  • The cardinality of the dimensions should not be very high (as with _id fields). Higher cardinality leads to increased storage usage and query latency.

Star-tree index structure

The following image illustrates a standard star-tree index structure.

A star-tree index containing two dimensions and two metrics

Sorted and aggregated star-tree documents are backed by doc_values in an index. The columnar data found in doc_values is stored using the following properties:

  • The values are sorted based on the fields set in the ordered_dimension setting. In the preceding image, the dimensions are determined by the status setting and then by the port for each status.
  • For each unique dimension/value combination, the aggregated values for all the metrics, such as avg(size) and count(requests), are precomputed during ingestion.

Leaf nodes

Each node in a star-tree index points to a range of star-tree documents. Nodes can be further split into child nodes based on the max_leaf_docs configuration. The number of documents that a leaf node points to is less than or equal to the value set in max_leaf_docs. This ensures that the maximum number of documents that need to traverse nodes to derive an aggregated value is at most the number of max_leaf_docs, which provides predictable latency.

Star nodes

A star node contains the aggregated data of all the other nodes for a particular dimension, acting as a “catch-all” node. When a star node is found in a dimension, that dimension is skipped during aggregation. This groups together all values of that dimension and allows a query to skip non-competitive nodes when fetching the aggregated value of a particular field.

The star-tree index structure diagram contains the following three examples demonstrating how a query behaves when retrieving aggregations from nodes in the star-tree:

  • Blue: In a terms query that searches for the average request size aggregation, the port equals 8443 and the status equals 200. Because the query contains values in both the status and port dimensions, the query traverses status node 200 and returns the aggregations from child node 8443.
  • Green: In a term query that searches for the number of aggregation requests, the status equals 200. Because the query only contains a value from the status dimension, the query traverses the 200 node’s child star node, which contains the aggregated value of all the port child nodes.
  • Red: In a term query that searches for the average request size aggregation, the port equals 5600. Because the query does not contain a value from the status dimension, the query traverses a star node and returns the aggregated result from the 5600 child node.

Support for the Terms query will be added in a future version. For more information, see GitHub issue #15257.

Enabling a star-tree index

To use a star-tree index, modify the following settings:

  • Set the feature flag opensearch.experimental.feature.composite_index.star_tree.enabled to true. For more information about enabling and disabling feature flags, see Enabling experimental features.
  • Set the indices.composite_index.star_tree.enabled setting to true. For instructions on how to configure OpenSearch, see Configuring settings.
  • Set the index.composite_index index setting to true during index creation.
  • Ensure that the doc_values parameter is enabled for the dimensions and metrics fields used in your star-tree mapping.

Example mapping

In the following example, index mappings define the star-tree configuration. The star-tree index precomputes aggregations in the logs index. The aggregations are calculated on the size and latency fields for all the combinations of values indexed in the port and status fields:

  1. PUT logs
  2. {
  3. "settings": {
  4. "index.number_of_shards": 1,
  5. "index.number_of_replicas": 0,
  6. "index.composite_index": true
  7. },
  8. "mappings": {
  9. "composite": {
  10. "request_aggs": {
  11. "type": "star_tree",
  12. "config": {
  13. "ordered_dimensions": [
  14. {
  15. "name": "status"
  16. },
  17. {
  18. "name": "port"
  19. }
  20. ],
  21. "metrics": [
  22. {
  23. "name": "size",
  24. "stats": [
  25. "sum"
  26. ]
  27. },
  28. {
  29. "name": "latency",
  30. "stats": [
  31. "avg"
  32. ]
  33. }
  34. ]
  35. }
  36. }
  37. },
  38. "properties": {
  39. "status": {
  40. "type": "integer"
  41. },
  42. "port": {
  43. "type": "integer"
  44. },
  45. "size": {
  46. "type": "integer"
  47. },
  48. "latency": {
  49. "type": "scaled_float",
  50. "scaling_factor": 10
  51. }
  52. }
  53. }
  54. }

For detailed information about star-tree index mappings and parameters, see Star-tree field type.

Supported queries and aggregations

Star-tree indexes can be used to optimize queries and aggregations.

Supported queries

The following queries are supported as of OpenSearch 2.18:

To use a query with a star-tree index, the query’s fields must be present in the ordered_dimensions section of the star-tree configuration. Queries must also be paired with a supported aggregation.

Supported aggregations

The following metric aggregations are supported as of OpenSearch 2.18:

To use aggregations:

  • The fields must be present in the metrics section of the star-tree configuration.
  • The metric aggregation type must be part of the stats parameter.

Aggregation example

The following example gets the sum of all the values in the size field for all error logs with status=500, using the example mapping:

  1. POST /logs/_search
  2. {
  3. "query": {
  4. "term": {
  5. "status": "500"
  6. }
  7. },
  8. "aggs": {
  9. "sum_size": {
  10. "sum": {
  11. "field": "size"
  12. }
  13. }
  14. }
  15. }

Using a star-tree index, the result will be retrieved from a single aggregated document as it traverses the status=500 node, as opposed to scanning through all of the matching documents. This results in lower query latency.

Using queries without a star-tree index

Set the indices.composite_index.star_tree.enabled setting to false to run queries without using a star-tree index.