Troubleshooting for Critical Alerts

Troubleshooting for Critical Alerts

Elasticsearch Cluster Health is Red
Elasticsearch Cluster Health is Yellow
Elasticsearch Node Disk Low Watermark Reached
Elasticsearch Node Disk High Watermark Reached
Elasticsearch Node Disk Flood Watermark Reached
Elasticsearch JVM Heap Use is High
Aggregated Logging System CPU is High
Elasticsearch Process CPU is High
Elasticsearch Disk Space is Running Low
Elasticsearch FileDescriptor Usage is high

Elasticsearch Cluster Health is Red

At least one primary shard and its replicas are not allocated to a node.

Troubleshooting

Check the Elasticsearch cluster health and verify that the cluster status is red.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health

List the nodes that have joined the cluster.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/nodes?v

List the Elasticsearch pods and compare them with the nodes in the command output from the previous step.
```
oc -n openshift-logging get pods -l component=elasticsearch
```
If some of the Elasticsearch nodes have not joined the cluster, perform the following steps.
1. Confirm that Elasticsearch has an elected master node.
```
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/master?v
```
2. Review the pod logs of the elected master node for issues.
```
oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging
```
3. Review the logs of nodes that have not joined the cluster for issues.
```
oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
```
If all the nodes have joined the cluster, perform the following steps, check if the cluster is in the process of recovering.
```
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/recovery?active_only=true
```
If there is no command output, the recovery process might be delayed or stalled by pending tasks.

Check if there are pending tasks.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health |grep  number_of_pending_tasks

If there are pending tasks, monitor their status.

If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors.

Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.

If it seems like the recovery has stalled, check if cluster.routing.allocation.enable is set to none.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty

If cluster.routing.allocation.enable is set to none, set it to all.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'

Check which indices are still red.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v

If any indices are still red, try to clear them by performing the following steps.

Clear the cache.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty

Increase the max allocation retries.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.allocation.max_retries":10}'

Delete all the scroll items.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_search/scroll/_all -X DELETE

Increase the timeout.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'

If the preceding steps do not clear the red indices, delete the indices individually.

Identify the red index name.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v

Delete the red index.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_red_index_name> -X DELETE

If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.
1. Check if the Elasticsearch JVM Heap usage is high.
```
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_nodes/stats?pretty
```
  In the command output, review the node_name.jvm.mem.heap_used_percent field to determine the JVM Heap usage.
2. Check for high CPU utilization.

Additional resources

Search for “Free up or increase disk space” in the Elasticsearch topic, Fix a red or yellow cluster status.

Elasticsearch Cluster Health is Yellow

Replica shards for at least one primary shard are not allocated to nodes.

Troubleshooting

Increase the node count by adjusting nodeCount in the ClusterLogging CR.

Additional resources

About the Cluster Logging custom resource
Configuring persistent storage for the log store
Search for “Free up or increase disk space” in the Elasticsearch topic, Fix a red or yellow cluster status.

Elasticsearch Node Disk Low Watermark Reached

Elasticsearch does not allocate shards to nodes that reach the low watermark.

Troubleshooting

Identify the node on which Elasticsearch is deployed.
```
oc -n openshift-logging get po -o wide
```

Check if there are unassigned shards.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep unassigned_shards

If there are unassigned shards, check the disk space on each node.

for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done

Check the nodes.node_name.fs field to determine the free disk space on that node.

If the used disk percentage is above 85%, the node has exceeded the low watermark, and shards can no longer be allocated to this node.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
1. Check the current redundancyPolicy.
```
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
```
  If you are using a ClusterLogging CR, enter:
```
oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
```
2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

If the preceding steps do not fix the issue, delete the old indices.

Check the status of all indices on Elasticsearch.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices

Identify an old index that can be deleted.

Delete the index.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE

Additional resources

Search for “redundancyPolicy” in the “Sample ClusterLogging custom resource (CR)” in About the Cluster Logging custom resource

Elasticsearch Node Disk High Watermark Reached

Elasticsearch attempts to relocate shards away from a node that has reached the high watermark.

Troubleshooting

Identify the node on which Elasticsearch is deployed.
```
oc -n openshift-logging get po -o wide
```

Check the disk space on each node.

for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done

Check if the cluster is rebalancing.
```
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep relocating_shards
```
If the command output shows relocating shards, the High Watermark has been exceeded. The default value of the High Watermark is 90%.

The shards relocate to a node with low disk usage that has not crossed any watermark threshold limits.
To allocate shards to a particular node, free up some space.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
1. Check the current redundancyPolicy.
```
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
```
  If you are using a ClusterLogging CR, enter:
```
oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
```
2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

If the preceding steps do not fix the issue, delete the old indices.

Check the status of all indices on Elasticsearch.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices

Identify an old index that can be deleted.

Delete the index.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE

Additional resources

Search for “redundancyPolicy” in the “Sample ClusterLogging custom resource (CR)” in About the Cluster Logging custom resource

Elasticsearch Node Disk Flood Watermark Reached

Elasticsearch enforces a read-only index block on every index that has both of these conditions:

One or more shards are allocated to the node.
One or more disks exceed the flood stage.

Troubleshooting

Check the disk space of the Elasticsearch node.

for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done

Check the nodes.node_name.fs field to determine the free disk space on that node.

If the used disk percentage is above 95%, it signifies that the node has crossed the flood watermark. Writing is blocked for shards allocated on this particular node.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
1. Check the current redundancyPolicy.
```
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
```
  If you are using a ClusterLogging CR, enter:
```
oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
```
2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

If the preceding steps do not fix the issue, delete the old indices.

Check the status of all indices on Elasticsearch.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices

Identify an old index that can be deleted.

Delete the index.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE

Continue freeing up and monitoring the disk space until the used disk space drops below 90%. Then, unblock write to this particular node.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_all/_settings?pretty -X PUT -d '{"index.blocks.read_only_allow_delete": null}'

Additional resources

Search for “redundancyPolicy” in the “Sample ClusterLogging custom resource (CR)” in About the Cluster Logging custom resource

Elasticsearch JVM Heap Use is High

The Elasticsearch node JVM Heap memory used is above 75%.

Troubleshooting

Consider increasing the heap size.

Aggregated Logging System CPU is High

System CPU usage on the node is high.

Troubleshooting

Check the CPU of the cluster node. Consider allocating more CPU resources to the node.

Elasticsearch Process CPU is High

Elasticsearch process CPU usage on the node is high.

Troubleshooting

Check the CPU of the cluster node. Consider allocating more CPU resources to the node.

Elasticsearch Disk Space is Running Low

The Elasticsearch Cluster is predicted to be out of disk space within the next 6 hours based on current disk usage.

Troubleshooting

Get the disk space of the Elasticsearch node.

for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done

In the command output, check the nodes.node_name.fs field to determine the free disk space on that node.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
1. Check the current redundancyPolicy.
```
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
```
  If you are using a ClusterLogging CR, enter:
```
oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
```
2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

If the preceding steps do not fix the issue, delete the old indices.

Check the status of all indices on Elasticsearch.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices

Identify an old index that can be deleted.

Delete the index.

oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE

Additional resources

Search for “redundancyPolicy” in the “Sample ClusterLogging custom resource (CR)” in About the Cluster Logging custom resource
Search for “ElasticsearchDiskSpaceRunningLow” in About Elasticsearch alerting rules.
Search for “Free up or increase disk space” in the Elasticsearch topic, Fix a red or yellow cluster status.

Elasticsearch FileDescriptor Usage is high

Based on current usage trends, the predicted number of file descriptors on the node is insufficient.

Troubleshooting

Check and, if needed, configure the value of max_file_descriptors for each node, as described in the Elasticsearch File descriptors topic.

Additional resources

Search for “ElasticsearchHighFileDescriptorUsage” in About Elasticsearch alerting rules.
Search for “File Descriptors In Use” in OpenShift Logging dashboards.