Strategy

Tutorial - Applying troubleshooting strategies

IMPORTANT: Tutorials are intended to give you hands-on experience working with a limited set of DC/OS features with no implied or explicit warranty of any kind. None of the information provided—including sample scripts, commands, or applications—is officially supported by Mesosphere. You should not use this information in a production environment without independent testing and validation.

General Strategy: Debugging Application Deployment on DC/OS

Now that we have defined a toolset for debugging applications on DC/OS, let us consider a step-by-step general troubleshooting strategy for actually implementing these tools in a application debugging scenario. Once we have gone over this general strategy, we will consider a few concrete scenarios of how to apply this strategy in the practice section.

Beyond considering any information special to your scenario, a reasonable approach to debugging an application deployment issue is to apply our debugging tools in the following order:

Step 1: Check the web interfaces

Start by examining the DC/OS web interface (or use the CLI) to check the status of the task. If the task has an associated health check, it is also a good idea to check the task’s health status.

If it could be relevant, check the Mesos web interface or Exhibitor/ZooKeeper web interface for potentially relevant debugging information there.

Step 2: Check the Task Logs

If the web interfaces cannot provide sufficient information, next check the task logs using the DC/OS web interface or the CLI. This helps a better understanding of what might have happened to the application. If the issue is related to our app not deploying (for example, the task status continues to wait indefinitely), try looking at the ‘Debug’ page. It could be helpful in getting a better understanding of the resources being offered by Mesos.

Step 3: Check the Scheduler Logs

Next, when there is a deployment problem and the task logs do not provide enough information to fix the issue, it can be helpful to double-check the app definition. Then, after confirming the app definition, check the Marathon log or web interface to better understand how it was scheduled or why not.

Step 4: Check the Agent Logs

The Mesos Agent logs provide information regarding how the task and that task’s environment are being started. Recall that increasing the log level can be helpful in some cases to obtain more information with which to work.

Step 5: Test the Task Interactively

The next step is to interactively look at the task running inside the container. If the task is still running, dcos task exec or docker exec can be helpful to start an interactive debugging session. If the application is based on a Docker container image, manually starting it using docker run followed by docker exec can also get you started in the right direction.

Step 6: Check the Master Logs

If you want to understand why a particular scheduler has received certain resources or a particular status, then the master logs can be very helpful. Recall that the master is forwarding all status updates between the agents and scheduler, so it might even be helpful in cases where the agent node might not be reachable (for example, network partition or node failure).

Step 7: Ask the Community

As mentioned above, the community can be very helpful by either using the DC/OS Slack or the mailing list can be very helpful in debugging further.