Troubleshooting

Troubleshooting

While Kubernetes and the ArangoDB Kubernetes operator will automaticallyresolve a lot of issues, there are always cases where human attentionis needed.

This chapter gives your tips & tricks to help you troubleshoot deployments.

Where to look

In Kubernetes all resources can be inspected using kubectl using eitherthe get or describe command.

To get all details of the resource (both specification & status),run the following command:

kubectl get <resource-type> <resource-name> -n <namespace> -o yaml

For example, to get the entire specification and statusof an ArangoDeployment resource named my-arangodb in the default namespace,run:

kubectl get ArangoDeployment my-arango -n default -o yaml
# or shorter
kubectl get arango my-arango -o yaml

Several types of resources (including all ArangoDB custom resources) supportevents. These events show what happened to the resource over time.

To show the events (and most important resource data) of a resource,run the following command:

kubectl describe <resource-type> <resource-name> -n <namespace>

Getting logs

Another invaluable source of information is the log of containers being runin Kubernetes.These logs are accessible through the Pods that group these containers.

To fetch the logs of the default container running in a Pod, run:

kubectl logs <pod-name> -n <namespace>
# or with follow option to keep inspecting logs while they are written
kubectl logs <pod-name> -n <namespace> -f

To inspect the logs of a specific container in Pod, add -c <container-name>.You can find the names of the containers in the Pod, using kubectl describe pod ….

Note that the ArangoDB operators are being deployed themselves as a Kubernetes Deploymentwith 2 replicas. This means that you will have to fetch the logs of 2 Pods runningthose replicas.

What if

The Pods of a deployment stay in Pending state

There are two common causes for this.

1) The Pods cannot be scheduled because there are not enough nodes available. This is usually only the case with a spec.environment setting that has a value of Production.

Solution:Add more nodes.

1) There are no PersistentVolumes available to be bound to the PersistentVolumeClaims created by the operator.

Solution: Use `kubectl get persistentvolumes` to inspect the available `PersistentVolumes` and if needed, use the [`ArangoLocalStorage` operator](deployment-kubernetes-storage-resource.html) to provision `PersistentVolumes`.

When restarting a Node, the Pods scheduled on that node remain in Terminating state

When a Node no longer makes regular calls to the Kubernetes API server, it ismarked as not available. Depending on specific settings in your Pods, Kuberneteswill at some point decide to terminate the Pod. As long as the Node is notcompletely removed from the Kubernetes API server, Kubernetes will try to usethe Node itself to terminate the Pod.

The ArangoDeployment operator recognizes this condition and will try to replace thosePods with Pods on different nodes. The exact behavior differs per type of server.

What happens when a Node with local data is broken

When a Node with PersistentVolumes hosted on that Node is broken andcannot be repaired, the data in those PersistentVolumes is lost.

If an ArangoDeployment of type Single was using one of those PersistentVolumesthe database is lost and must be restored from a backup.

If an ArangoDeployment of type ActiveFailover or Cluster was using one ofthose PersistentVolumes, it depends on the type of server that was using the volume.

If an Agent was using the volume, it can be repaired as long as 2 other agents are still healthy.
If a DBServer was using the volume, and the replication factor of all databasecollections is 2 or higher, and the remaining dbservers are still healthy,the cluster will duplicate the remaining replicas tobring the number of replicas back to the original number.
If a DBServer was using the volume, and the replication factor of a databasecollection is 1 and happens to be stored on that dbserver, the data is lost.
If a single server of an ActiveFailover deployment was using the volume, and theother single server is still healthy, the other single server will become leader.After replacing the failed single server, the new follower will synchronize withthe leader.