Health API

Health API

New API reference

For the most up-to-date API details, refer to Cluster health APIs.

An API that reports the health status of an Elasticsearch cluster.

Request

GET /_health_report

GET /_health_report/<indicator>

Prerequisites

  • If the Elasticsearch security features are enabled, you must have the monitor or manage cluster privilege to use this API.

Description

The health API returns a report with the health status of an Elasticsearch cluster. The report contains a list of indicators that compose Elasticsearch functionality.

Each indicator has a health status of: green, unknown, yellow or red. The indicator will provide an explanation and metadata describing the reason for its current health status.

The cluster’s status is controlled by the worst indicator status.

In the event that an indicator’s status is non-green, a list of impacts may be present in the indicator result which detail the functionalities that are negatively affected by the health issue. Each impact carries with it a severity level, an area of the system that is affected, and a simple description of the impact on the system.

Some health indicators can determine the root cause of a health problem and prescribe a set of steps that can be performed in order to improve the health of the system. The root cause and remediation steps are encapsulated in a diagnosis. A diagnosis contains a cause detailing a root cause analysis, an action containing a brief description of the steps to take to fix the problem, the list of affected resources (if applicable), and a detailed step-by-step troubleshooting guide to fix the diagnosed problem.

The health indicators perform root cause analysis of non-green health statuses. This can be computationally expensive when called frequently. When setting up automated polling of the API for health status set verbose to false to disable the more expensive analysis logic.

Path parameters

<indicator>

(Optional, string) Limit the information returned to a specific indicator. Supported indicators are:

  • master_is_stable

    Reports health issues regarding the stability of the node that is seen as the master by the node handling the health request. In case of enough observed master changes in a short period of time this indicator will aim to diagnose and report back useful information regarding the cluster formation issues it detects.

    shards_availability

    Reports health issues regarding shard assignments.

    disk

    Reports health issues caused by lack of disk space.

    ilm

    Reports health issues related to Indexing Lifecycle Management.

    repository_integrity

    Tracks repository integrity and reports health issues that arise if repositories become corrupted, unknown, or invalid.

    slm

    Reports health issues related to Snapshot Lifecycle Management.

    shards_capacity

    Reports health issues related to the shards capacity of the cluster.

Query parameters

verbose

(Optional, Boolean) If true, the response includes additional details that help explain the status of each non-green indicator. These details include additional troubleshooting metrics and sometimes a root cause analysis of a health status. Defaults to true.

size

(Optional, integer) The maximum number of affected resources to return. As a diagnosis can return multiple types of affected resources this parameter will limit the number of resources returned for each type to the configured value (e.g. a diagnosis could return 1000 affected indices and 1000 affected nodes). Defaults to 1000.

Response body

cluster_name

(string) The name of the cluster.

status

(Optional, string) Health status of the cluster, based on the aggregated status of all indicators in the cluster. If the health of a specific indicator is being requested, this top level status will be omitted. Statuses are:

  • green

    The cluster is healthy.

    unknown

    The health of the cluster could not be determined.

    yellow

    The functionality of a cluster is in a degraded state and may need remediation to avoid the health becoming red.

    red

    The cluster is experiencing an outage or certain features are unavailable for use.

indicators

(object) Information about the health of the cluster indicators.

Properties of indicators

  • <indicator>

    (object) Contains health results for an indicator.

    Properties of <indicator>

    • status

      (string) Health status of the indicator. Statuses are:

      green

      The indicator is healthy.

      unknown

      The health of the indicator could not be determined.

      yellow

      The functionality of an indicator is in a degraded state and may need remediation to avoid the health becoming red.

      red

      The indicator is experiencing an outage or certain features are unavailable for use.

      symptom

      (string) A message providing information about the current health status.

      details

      (Optional, object) An object that contains additional information about the cluster that has lead to the current health status result. This data is unstructured, and each indicator returns a unique set of details. Details will not be calculated if the verbose property is set to false.

      impacts

      (Optional, array) If a non-healthy status is returned, indicators may include a list of impacts that this health status will have on the cluster.

      Properties of impacts

      severity

      (integer) How important this impact is to the functionality of the cluster. A value of 1 is the highest severity, with larger values indicating lower severity.

      description

      (string) A description of the impact on the cluster.

      impact_areas

      (array of strings) The areas of cluster functionality that this impact affects. Possible values are:

      • search
      • ingest
      • backup
      • deployment_management

      diagnosis

      (Optional, array) If a non-healthy status is returned, indicators may include a list of diagnosis that encapsulate the cause of the health issue and an action to take in order to remediate the problem. The diagnosis will not be calculated if the verbose property is false.

      Properties of diagnosis

      cause

      (string) A description of a root cause of this health problem.

      action

      (string) A brief description the steps that should be taken to remediate the problem. A more detailed step-by-step guide to remediate the problem is provided by the help_url field.

      affected_resources

      (Optional, object) An object where the keys represent resource types (for example, indices, shards), and the values are lists of the specific resources affected by the issue.

      help_url

      (string) A link to the troubleshooting guide that’ll fix the health problem.

Indicator Details

Each health indicator in the health API returns a set of details that further explains the state of the system. The details have contents and a structure that is unique to each indicator.

master_is_stable

current_master

(object) Information about the currently elected master.

Properties of current_master

  • node_id

    (string) The node id of the currently elected master, or null if no master is elected.

    name

    (string) The node name of the currently elected master, or null if no master is elected.

recent_masters

(Optional, array) A list of nodes that have been elected or replaced as master in a recent time window. This field is present if the master is changing rapidly enough to cause problems, and also present as additional information when the indicator is green. This array includes only elected masters, and does not include empty entries for periods when there was no elected master.

Properties of recent_masters

  • node_id

    (string) The node id of a recently active master node.

    name

    (string) The node name of a recently active master node.

exception_fetching_history

(Optional, object) If the node being queried sees that the elected master has stepped down repeatedly, the master history is requested from the most recently elected master node for diagnosis purposes. If fetching this remote history fails, the exception information is returned in this detail field.

Properties of exception_fetching_history

  • message

    (string) The exception message for the failed history fetch operation.

    stack_trace

    (string) The stack trace for the failed history fetch operation.

cluster_formation

(Optional, array) If there has been no elected master node recently, the node being queried attempts to gather information about why the cluster has been unable to form, or why the node being queried has been unable to join the cluster if it has formed. This array could contain any entry for each master eligible node’s view of cluster formation.

Properties of cluster_formation

  • node_id

    (string) The node id of a master-eligible node

    name

    (Optional, string) The node name of a master-eligible node

    cluster_formation_message

    (string) A detailed description explaining what went wrong with cluster formation, or why this node was unable to join the cluster if it has formed.

shards_availability

unassigned_primaries

(int) The number of primary shards that are unassigned for reasons other than initialization or relocation.

initializing_primaries

(int) The number of primary shards that are initializing or recovering.

creating_primaries

(int) The number of primary shards that are unassigned because they have been very recently created.

creating_replicas

(int) The number of replica shards that are unassigned because they have been very recently created.

restarting_primaries

(int) The number of primary shards that are relocating because of a node shutdown operation.

started_primaries

(int) The number of primary shards that are active and available on the system.

unassigned_replicas

(int) The number of replica shards that are unassigned for reasons other than initialization or relocation.

initializing_replicas

(int) The number of replica shards that are initializing or recovering.

restarting_replicas

(int) The number of replica shards that are relocating because of a node shutdown operation.

started_replicas

(int) The number of replica shards that are active and available on the system.

disk

indices_with_readonly_block

(int) The number of indices the system enforced a read-only index block (index.blocks.read_only_allow_delete) on because the cluster is running out of space.

nodes_with_enough_disk_space

(int) The number of nodes that have enough available disk space to function.

nodes_over_high_watermark

(int) The number of nodes that are running low on disk and it is likely that they will run out of space. Their disk usage has tripped the high watermark threshold.

nodes_over_flood_stage_watermark

(int) The number of nodes that have run out of disk. Their disk usage has tripped the flood stage watermark threshold.

unknown_nodes

(int) The number of nodes for which it was not possible to determine their disk health.

repository_integrity

total_repositories

(Optional, int) The number of currently configured repositories on the system. If there are no repositories configured then this detail is omitted.

corrupted_repositories

(Optional, int) The number of repositories on the system that have been determined to be corrupted. If there are no corrupted repositories detected, this detail is omitted.

corrupted

(Optional, array of strings) If corrupted repositories have been detected in the system, the names of up to ten of them are displayed in this field. If no corrupted repositories are found, this detail is omitted.

unknown_repositories

(Optional, int) The number of repositories that have been determined to be unknown by at least one node. If there are no unknown repositories detected, this detail is omitted.

invalid_repositories

(Optional, int) The number of repositories that have been determined to be invalid by at least one node. If there are no invalid repositories detected, this detail is omitted.

ilm

ilm_status

(string) The current status of the Indexing Lifecycle Management feature. Either STOPPED, STOPPING, or RUNNING.

policies

(int) The number of index lifecycle policies that the system is managing.

stagnating_indices

(int) the number of indices managed by index lifecycle management that has been stagnant longer than expected.

stagnating_indices_per_action

(optional, map) Summary of the number of indices, grouped by action, that have been stagnant longer than expected.

Properties of stagnating_indices_per_action

  • downsample

    (int) The number of stagnant indices in the downsample action.

    allocate

    (int) The number of stagnant indices in the allocate action.

    shrink

    (int) The number of stagnant indices in the shrink action.

    searchable_snapshot

    (int) The number of stagnant indices in the searchable_snapshot action.

    rollover

    (int) The number of stagnant indices in the rollver action.

    forcemerge

    (int) The number of stagnant indices in the forcemerge action.

    delete

    (int) The number of stagnant indices in the delete action.

    migrate

    (int) The number of stagnant indices in the migrate action.

slm

slm_status

(string) The current status of the Snapshot Lifecycle Management feature. Either STOPPED, STOPPING, or RUNNING.

policies

(int) The number of snapshot policies that the system is managing.

unhealthy_policies

(map) A detailed view on the policies that are considered unhealthy due to having several consecutive unsuccessful invocations. The count key represents the number of unhealthy policies (int). The invocations_since_last_success key will report a map where the unhealthy policy name is the key and it’s corresponding number of failed invocations is the value.

shards_capacity

data

(map) A view with information about the current capacity of shards for data nodes that do not belong to the frozen tier.

Properties of data

  • max_shards_in_cluster

    (int) Indicates the maximum number of shards that the cluster can hold.

    current_used_shards

    (optional, int) The total number of shards hold by the cluster. Only displayed in the case the indicator’s status is red or yellow.

frozen

(map) A view with information about the current capacity of shards for data nodes that belong to the frozen tier.

Properties of frozen

  • max_shards_in_cluster

    (int) Indicates the maximum number of shards the cluster can hold for the partially mounted indices.

    current_used_shards

    (optional, int) The total number of shards the partially mounted indices have in the cluster. Only displayed in the case the indicator’s status is red or yellow.

Examples

  1. resp = client.health_report()
  2. print(resp)
  1. response = client.health_report
  2. puts response
  1. const response = await client.healthReport();
  2. console.log(response);
  1. GET _health_report

The API returns a response with all the indicators regardless of current status.

  1. resp = client.health_report(
  2. feature="shards_availability",
  3. )
  4. print(resp)
  1. response = client.health_report(
  2. feature: 'shards_availability'
  3. )
  4. puts response
  1. const response = await client.healthReport({
  2. feature: "shards_availability",
  3. });
  4. console.log(response);
  1. GET _health_report/shards_availability

The API returns a response for just the shard availability indicator.

  1. resp = client.health_report(
  2. verbose=False,
  3. )
  4. print(resp)
  1. response = client.health_report(
  2. verbose: false
  3. )
  4. puts response
  1. const response = await client.healthReport({
  2. verbose: "false",
  3. });
  4. console.log(response);
  1. GET _health_report?verbose=false

The API returns a response with all health indicators but will not calculate details or root cause analysis for the response. This is helpful if you would like to monitor the health API and do not want the overhead of calculating additional troubleshooting details each call.