Hot shard identification

Hot shard identification root cause analysis (RCA) lets you identify a hot shard within an index. A hot shard is an outlier that consumes more resources than other shards and may lead to poor indexing and search performance. The hot shard identification RCA monitors the following metrics:

  • CPU utilization
  • Heap allocation rate

Shards may become hot because of the nature of your workload. When you use a _routing parameter or a custom document ID, a specific shard or several shards within the cluster receive frequent updates, consuming more CPU and heap resources than other shards.

The hot shard identification RCA compares the CPU utilization and heap allocation rates against their threshold values. If the usage for either metric is greater than the threshold, the shard is considered to be hot.

For more information about the hot shard identification RCA implementation, see Hot Shard RCA.

Example request

The following query requests hot shard identification:

  1. GET _plugins/_performanceanalyzer/rca?name=HotShardClusterRca

copy

Example response

The response contains a list of unhealthy shards:

  1. "HotShardClusterRca": [{
  2. "rca_name": "HotShardClusterRca",
  3. "timestamp": 1680721367563,
  4. "state": "unhealthy",
  5. "HotClusterSummary": [
  6. {
  7. "number_of_nodes": 3,
  8. "number_of_unhealthy_nodes": 1,
  9. "HotNodeSummary": [
  10. {
  11. "node_id": "7kosAbpASsqBoHmHkVXxmw",
  12. "host_address": "192.168.80.4",
  13. "HotResourceSummary": [
  14. {
  15. "resource_type": "cpu usage",
  16. "resource_metric": "cpu usage(num of cores)",
  17. "threshold": 0.027397981341796683,
  18. "value": 0.034449630200405396,
  19. "time_period_seconds": 60,
  20. "meta_data": "ssZw1WRUSHS5DZCW73BOJQ index9 4"
  21. },
  22. {
  23. "resource_type": "heap",
  24. "resource_metric": "heap alloc rate(heap alloc rate in bytes per second)",
  25. "threshold": 7605441.367010161,
  26. "value": 10872119.748328414,
  27. "time_period_seconds": 60,
  28. "meta_data": "ssZw1WRUSHS5DZCW73BOJQ index9 4"
  29. },
  30. {
  31. "resource_type": "heap",
  32. "resource_metric": "heap alloc rate(heap alloc rate in bytes per second)",
  33. "threshold": 7605441.367010161,
  34. "value": 8019622.354388569,
  35. "time_period_seconds": 60,
  36. "meta_data": "QRF4rBM7SNCDr1g3KU6HyA index9 0"
  37. }
  38. ]
  39. }
  40. ]
  41. }
  42. ]
  43. }]

Response fields

The following table lists the response fields.

FieldTypeDescription
rca_nameStringThe name of the RCA. In this case, “HotShardClusterRca”.
timestampIntegerThe timestamp of the RCA.
stateObjectThe state of the cluster determined by the RCA. The state can be healthy, unhealthy, or unknown.
HotClusterSummary.HotNodeSummary.number_of_nodesIntegerThe number of nodes in the cluster.
HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodesIntegerThe number of nodes found to be in an unhealthy state.
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_typeObjectThe type of resource causing the unhealthy state, either “cpu usage” or “heap”.
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metricStringThe definition of the resource_type. Either “cpu usage(num of cores)” or “heap alloc rate(heap alloc rate in bytes per second)”.
HotClusterSummary.HotNodeSummary.HotResourceSummary.thresholdFloatThe value that determines whether a resource is contended.
HotClusterSummary.HotNodeSummary.HotResourceSummary.valueFloatThe current value of the resource.
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_secondsTimeThe amount of time that a shard was monitored before its state was declared to be healthy or unhealthy.
HotClusterSummary.HotNodeSummary.HotResourceSummary.meta_dataStringThe metadata associated with the resource_type.

In the preceding example response, meta_data is QRF4rBM7SNCDr1g3KU6HyA index9 0. The meta_data string consists of three fields:

  • Node name: QRF4rBM7SNCDr1g3KU6HyA
  • Index name: index9
  • Shard ID: 0

This means that shard 0 of index index9 on node QRF4rBM7SNCDr1g3KU6HyA is hot.