Introduced 2.18

Workload management

Workload management allows you to group search traffic and isolate network resources, preventing the overuse of network resources by specific requests. It offers the following benefits:

  • Tenant-level admission control and reactive query management. When resource usage exceeds configured limits, it automatically identifies and cancels demanding queries, ensuring fair resource distribution.

  • Tenant-level isolation within the cluster for search workloads, operating at the node level.

Installing workload management

To install workload management, use the following command:

  1. ./bin/opensearch-plugin install workload-management

copy

Query groups

A query group is a logical grouping of tasks with defined resource limits. System administrators can dynamically manage query groups using the Workload Management APIs. These query groups can be used to create search requests with resource limits.

Permissions

Only users with administrator-level permissions can create and update query groups using the Workload Management APIs.

Operating modes

The following operating modes determine the operating level for a query group:

  • Disabled mode: Workload management is disabled.

  • Enabled mode: Workload management is enabled and will cancel and reject queries once the query group’s configured thresholds are reached.

  • Monitor_only mode (Default): Workload management will monitor tasks but will not cancel or reject any queries.

Example request

The following example request adds a query group named analytics:

  1. PUT _wlm/query_group
  2. {
  3. name”: analytics”,
  4. resiliency_mode”: enforced”,
  5. resource_limits”: {
  6. cpu”: 0.4,
  7. memory”: 0.2
  8. }
  9. }

copy

When creating a query group, make sure that the sum of the resource limits for a single resource, such as cpu or memory, does not exceed 1.

Example response

OpenSearch responds with the set resource limits and the _id for the query group:

  1. {
  2. "_id":"preXpc67RbKKeCyka72_Gw",
  3. "name":"analytics",
  4. "resiliency_mode":"enforced",
  5. "resource_limits":{
  6. "cpu":0.4,
  7. "memory":0.2
  8. },
  9. "updated_at":1726270184642
  10. }

Using queryGroupID

You can associate a query request with a queryGroupID to manage and allocate resources within the limits defined by the query group. By using this ID, request routing and tracking are associated with the query group, ensuring resource quotas and task limits are maintained.

The following example query uses the queryGroupId to ensure that the query does not exceed that query group’s resource limits:

  1. GET testindex/_search
  2. Host: localhost:9200
  3. Content-Type: application/json
  4. queryGroupId: preXpc67RbKKeCyka72_Gw
  5. {
  6. "query": {
  7. "match": {
  8. "field_name": "value"
  9. }
  10. }
  11. }

copy

Workload management settings

The following settings can be used to customize workload management using the _cluster/settings API.

Setting nameDescription
wlm.query_group.duress_streakDetermines the node duress threshold. Once the threshold is reached, the node is marked as in duress.
wlm.query_group.enforcement_intervalDefines the monitoring interval.
wlm.query_group.modeDefines the operating mode.
wlm.query_group.node.memory_rejection_thresholdDefines the query group level memory threshold. When the threshold is reached, the request is rejected.
wlm.query_group.node.cpu_rejection_thresholdDefines the query group level cpu threshold. When the threshold is reached, the request is rejected.
wlm.query_group.node.memory_cancellation_thresholdControls whether the node is considered to be in duress when the memory threshold is reached. Requests routed to nodes in duress are canceled.
wlm.query_group.node.cpu_cancellation_thresholdControls whether the node is considered to be in duress when the cpu threshold is reached. Requests routed to nodes in duress are canceled.

When setting rejection and cancellation thresholds, remember that the rejection threshold for a resource should always be lower than the cancellation threshold.

Workload Management Stats API

The Workload Management Stats API returns workload management metrics for a query group, using the following method:

  1. GET _wlm/stats

copy

Example response

  1. {
  2. _nodes”: {
  3. total”: 1,
  4. successful”: 1,
  5. failed”: 0
  6. },
  7. cluster_name”: XXXXXXYYYYYYYY”,
  8. A3L9EfBIQf2anrrUhh_goA”: {
  9. query_groups”: {
  10. 16YGxFlPRdqIO7K4EACJlw”: {
  11. total_completions”: 33570,
  12. total_rejections”: 0,
  13. total_cancellations”: 0,
  14. cpu”: {
  15. current_usage”: 0.03319935314357281,
  16. cancellations”: 0,
  17. rejections”: 0
  18. },
  19. memory”: {
  20. current_usage”: 0.002306486276211217,
  21. cancellations”: 0,
  22. rejections”: 0
  23. }
  24. },
  25. DEFAULT_QUERY_GROUP”: {
  26. total_completions”: 42572,
  27. total_rejections”: 0,
  28. total_cancellations”: 0,
  29. cpu”: {
  30. current_usage”: 0,
  31. cancellations”: 0,
  32. rejections”: 0
  33. },
  34. memory”: {
  35. current_usage”: 0,
  36. cancellations”: 0,
  37. rejections”: 0
  38. }
  39. }
  40. }
  41. }
  42. }

copy

Response body fields

Field nameDescription
total_completionsThe total number of request completions in the query_group at the given node. This includes all shard-level and coordinator-level requests.
total_rejectionsThe total number request rejections in the query_group at the given node. This includes all shard-level and coordinator-level requests.
total_cancellationsThe total number of cancellations in the query_group at the given node. This includes all shard-level and coordinator-level requests.
cpuThe cpu resource type statistics for the query_group.
memoryThe memory resource type statistics for the query_group.

Resource type statistics

Field nameDescription
current_usageThe resource usage for the query_group at the given node based on the last run of the monitoring thread. This value is updated based on the wlm.query_group.enforcement_interval.
cancellationsThe number of cancellations resulting from the cancellation threshold being reached.
rejectionsThe number of rejections resulting from the cancellation threshold being reached.