Grouping top N queries

Introduced 2.17

Monitoring the top N queries can help you to identify the most resource-intensive queries based on latency, CPU, and memory usage in a specified time window. However, if a single computationally expensive query is executed multiple times, it can occupy all top N query slots, potentially preventing other expensive queries from appearing in the list. To address this issue, you can group similar queries, gaining insight into various high-impact query groups.

Starting with OpenSearch version 2.17, the top N queries can be grouped by similarity, with additional grouping options planned for future version releases.

Grouping queries by similarity

Grouping queries by similarity organizes them based on the query structure, removing everything except the core query operations.

For example, the following query:

  1. {
  2. "query": {
  3. "bool": {
  4. "must": [
  5. { "exists": { "field": "field1" } }
  6. ],
  7. "query_string": {
  8. "query": "search query"
  9. }
  10. }
  11. }
  12. }

Has the following corresponding query structure:

  1. bool
  2. must
  3. exists
  4. query_string

When queries share the same query structure, they are grouped together, ensuring that all similar queries belong to the same group.

Configuring the query structure

The preceding example query shows a simplified query structure. By default, the query structure also includes field names and field data types.

For example, consider an index index1 with the following field mapping:

  1. "mappings": {
  2. "properties": {
  3. "field1": {
  4. "type": "keyword"
  5. },
  6. "field2": {
  7. "type": "text"
  8. },
  9. "field3": {
  10. "type": "text"
  11. },
  12. "field4": {
  13. "type": "long"
  14. }
  15. }
  16. }

If you run the following query on this index:

  1. {
  2. "query": {
  3. "bool": {
  4. "must": [
  5. {
  6. "term": {
  7. "field1": "example_value"
  8. }
  9. }
  10. ],
  11. "filter": [
  12. {
  13. "match": {
  14. "field2": "search_text"
  15. }
  16. },
  17. {
  18. "range": {
  19. "field4": {
  20. "gte": 1,
  21. "lte": 100
  22. }
  23. }
  24. }
  25. ],
  26. "should": [
  27. {
  28. "regexp": {
  29. "field3": ".*"
  30. }
  31. }
  32. ]
  33. }
  34. }
  35. }

Then the query has the following corresponding query structure:

  1. bool []
  2. must:
  3. term [field1, keyword]
  4. filter:
  5. match [field2, text]
  6. range [field4, long]
  7. should:
  8. regexp [field3, text]

To exclude field names and field data types from the query structure, configure the following settings:

  1. PUT _cluster/settings
  2. {
  3. "persistent" : {
  4. "search.insights.top_queries.grouping.attributes.field_name" : false,
  5. "search.insights.top_queries.grouping.attributes.field_type" : false
  6. }
  7. }

copy

Aggregate metrics per group

In addition to retrieving latency, CPU, and memory metrics for individual top N queries, you can obtain aggregate statistics for the top N query groups. For each query group, the response includes the following statistics:

  • The total latency, CPU usage, or memory usage (depending on the configured metric type)
  • The total query count

Using these statistics, you can calculate the average latency, CPU usage, or memory usage for each query group. The response also includes one example query from the query group.

Configuring query grouping

Before you enable query grouping, you must enable top N query monitoring for a metric type of your choice. For more information, see Configuring top N query monitoring.

To configure grouping for top N queries, use the following steps.

Step 1: Enable top N query monitoring

Ensure that top N query monitoring is enabled for at least one of the metrics: latency, CPU, or memory. For more information, see Configuring top N query monitoring.

For example, to enable top N query monitoring by latency with the default settings, send the following request:

  1. PUT _cluster/settings
  2. {
  3. "persistent" : {
  4. "search.insights.top_queries.latency.enabled" : true
  5. }
  6. }

copy

Step 2: Configure query grouping

Set the desired grouping method by updating the following cluster setting:

  1. PUT _cluster/settings
  2. {
  3. "persistent" : {
  4. "search.insights.top_queries.group_by" : "similarity"
  5. }
  6. }

copy

The default value for the group_by setting is none, which disables grouping. As of OpenSearch 2.17, the supported values for group_by are similarity and none.

Step 3 (Optional): Limit the number of monitored query groups

Optionally, you can limit the number of monitored query groups. Queries already included in the top N query list (the most resource-intensive queries) will not be considered in determining the limit. Essentially, the maximum applies only to other query groups, and the top N queries are tracked separately. This helps manage the tracking of query groups based on workload and query window size.

To limit tracking to 100 query groups, send the following request:

  1. PUT _cluster/settings
  2. {
  3. "persistent" : {
  4. "search.insights.top_queries.max_groups_excluding_topn" : 100
  5. }
  6. }

copy

The default value for max_groups_excluding_topn is 100, and you can set it to any value between 0 and 10,000, inclusive.

Monitoring query groups

To view the top N query groups, send the following request:

  1. GET /_insights/top_queries

copy

The response contains the top N query groups:

Response

  1. {
  2. "top_queries": [
  3. {
  4. "timestamp": 1725495127359,
  5. "source": {
  6. "query": {
  7. "match_all": {
  8. "boost": 1.0
  9. }
  10. }
  11. },
  12. "phase_latency_map": {
  13. "expand": 0,
  14. "query": 55,
  15. "fetch": 3
  16. },
  17. "total_shards": 1,
  18. "node_id": "ZbINz1KFS1OPeFmN-n5rdg",
  19. "query_hashcode": "b4c4f69290df756021ca6276be5cbb75",
  20. "task_resource_usages": [
  21. {
  22. "action": "indices:data/read/search[phase/query]",
  23. "taskId": 30,
  24. "parentTaskId": 29,
  25. "nodeId": "ZbINz1KFS1OPeFmN-n5rdg",
  26. "taskResourceUsage": {
  27. "cpu_time_in_nanos": 33249000,
  28. "memory_in_bytes": 2896848
  29. }
  30. },
  31. {
  32. "action": "indices:data/read/search",
  33. "taskId": 29,
  34. "parentTaskId": -1,
  35. "nodeId": "ZbINz1KFS1OPeFmN-n5rdg",
  36. "taskResourceUsage": {
  37. "cpu_time_in_nanos": 3151000,
  38. "memory_in_bytes": 133936
  39. }
  40. }
  41. ],
  42. "indices": [
  43. "my_index"
  44. ],
  45. "labels": {},
  46. "search_type": "query_then_fetch",
  47. "measurements": {
  48. "latency": {
  49. "number": 160,
  50. "count": 10,
  51. "aggregationType": "AVERAGE"
  52. }
  53. }
  54. },
  55. {
  56. "timestamp": 1725495135160,
  57. "source": {
  58. "query": {
  59. "term": {
  60. "content": {
  61. "value": "first",
  62. "boost": 1.0
  63. }
  64. }
  65. }
  66. },
  67. "phase_latency_map": {
  68. "expand": 0,
  69. "query": 18,
  70. "fetch": 0
  71. },
  72. "total_shards": 1,
  73. "node_id": "ZbINz1KFS1OPeFmN-n5rdg",
  74. "query_hashcode": "c3620cc3d4df30fb3f95aeb2167289a4",
  75. "task_resource_usages": [
  76. {
  77. "action": "indices:data/read/search[phase/query]",
  78. "taskId": 50,
  79. "parentTaskId": 49,
  80. "nodeId": "ZbINz1KFS1OPeFmN-n5rdg",
  81. "taskResourceUsage": {
  82. "cpu_time_in_nanos": 10188000,
  83. "memory_in_bytes": 288136
  84. }
  85. },
  86. {
  87. "action": "indices:data/read/search",
  88. "taskId": 49,
  89. "parentTaskId": -1,
  90. "nodeId": "ZbINz1KFS1OPeFmN-n5rdg",
  91. "taskResourceUsage": {
  92. "cpu_time_in_nanos": 262000,
  93. "memory_in_bytes": 3216
  94. }
  95. }
  96. ],
  97. "indices": [
  98. "my_index"
  99. ],
  100. "labels": {},
  101. "search_type": "query_then_fetch",
  102. "measurements": {
  103. "latency": {
  104. "number": 109,
  105. "count": 7,
  106. "aggregationType": "AVERAGE"
  107. }
  108. }
  109. },
  110. {
  111. "timestamp": 1725495139766,
  112. "source": {
  113. "query": {
  114. "match": {
  115. "content": {
  116. "query": "first",
  117. "operator": "OR",
  118. "prefix_length": 0,
  119. "max_expansions": 50,
  120. "fuzzy_transpositions": true,
  121. "lenient": false,
  122. "zero_terms_query": "NONE",
  123. "auto_generate_synonyms_phrase_query": true,
  124. "boost": 1.0
  125. }
  126. }
  127. }
  128. },
  129. "phase_latency_map": {
  130. "expand": 0,
  131. "query": 15,
  132. "fetch": 0
  133. },
  134. "total_shards": 1,
  135. "node_id": "ZbINz1KFS1OPeFmN-n5rdg",
  136. "query_hashcode": "484eaabecd13db65216b9e2ff5eee999",
  137. "task_resource_usages": [
  138. {
  139. "action": "indices:data/read/search[phase/query]",
  140. "taskId": 64,
  141. "parentTaskId": 63,
  142. "nodeId": "ZbINz1KFS1OPeFmN-n5rdg",
  143. "taskResourceUsage": {
  144. "cpu_time_in_nanos": 12161000,
  145. "memory_in_bytes": 473456
  146. }
  147. },
  148. {
  149. "action": "indices:data/read/search",
  150. "taskId": 63,
  151. "parentTaskId": -1,
  152. "nodeId": "ZbINz1KFS1OPeFmN-n5rdg",
  153. "taskResourceUsage": {
  154. "cpu_time_in_nanos": 293000,
  155. "memory_in_bytes": 3216
  156. }
  157. }
  158. ],
  159. "indices": [
  160. "my_index"
  161. ],
  162. "labels": {},
  163. "search_type": "query_then_fetch",
  164. "measurements": {
  165. "latency": {
  166. "number": 43,
  167. "count": 3,
  168. "aggregationType": "AVERAGE"
  169. }
  170. }
  171. }
  172. ]
  173. }

Response body fields

The response includes the following fields.

FieldData typeDescription
top_queriesArrayThe list of top query groups.
top_queries.timestampIntegerThe execution timestamp for the first query in the query group.
top_queries.sourceObjectThe first query in the query group.
top_queries.phase_latency_mapObjectThe phase latency map for the first query in the query group. The map includes the amount of time, in milliseconds, that the query spent in the expand, query, and fetch phases.
top_queries.total_shardsIntegerThe number of shards on which the first query was executed.
top_queries.node_idStringThe node ID of the node that coordinated the execution of the first query in the query group.
top_queries.query_hashcodeStringThe hash code that uniquely identifies the query group. This is essentially the hash of the query structure.
top_queries.task_resource_usagesArray of objectsThe resource usage breakdown for the various tasks belonging to the first query in the query group.
top_queries.indicesArrayThe indexes searched by the first query in the query group.
top_queries.labelsObjectUsed to label the top query.
top_queries.search_typeStringThe search request execution type (query_then_fetch or dfs_query_then_fetch). For more information, see the search_type parameter in the Search API documentation.
top_queries.measurementsObjectThe aggregate measurements for the query group.
top_queries.measurements.latencyObjectThe aggregate latency measurements for the query group.
top_queries.measurements.latency.numberIntegerThe total latency for the query group.
top_queries.measurements.latency.countIntegerThe number of queries in the query group.
top_queries.measurements.latency.aggregationTypeStringThe aggregation type for the current entry. If grouping by similarity is enabled, then aggregationType is AVERAGE. If it is not enabled, then aggregationType is NONE.