Orca: Zombie Executions

A zombie execution is one whose status is RUNNING, but there are no messages in Orca’s work queue or unacked set.

Aliases: orphaned execution

A zombie Execution is one that has a status in the database of RUNNING but there are no messages in Orca’s work queue or unacked set—the pipeline or task is not doing anything.

Diagnosis

Logs will be emitted regularly for Executions that are currently running in Orca via the QueueProcessor class, which will look similar to the following example. If no logs have been emitted for over 10 minutes for a RUNNING Execution, it is very likely a zombie.

  1. Received message RunTask(executionType=pipeline, executionId=01CT1ST3MBJ9ECPH5JM5HVJARE, application=myapplication, stageId=01CT1ST4P79Y3MPW6FC4H38N3A, taskId=8, taskType=class com.netflix.spinnaker.orca.clouddriver.tasks.instance.WaitForUpInstancesTask)

Metrics & Alerting

Orca can be configured to detect and emit metrics for zombie Executions. This setting is expensive with the RedisExecutionRepository and is disabled by default. You can enable this detection by setting queue.zombieCheck.enabled: true in your Orca configuration.

When enabled, any discovered zombies will be logged out, as well as emitted via a metric:

  1. Found zombie executionType=pipeline application=myapplication executionName=myexample executionId=01CS076X85RX6MWBTQ0VGBF8VX

If you’ve enabled the zombie check, set an alert on the metric queue.zombies, triggering whenever there is a count greater than 0.

Remediation

Rehydrate the Queue

If the Execution is a zombie, there are no messages on the work queue for that Execution. You can attempt to re-hydrate the queue — reissue messages onto the work queue based on the last stored state — using an admin API in Orca , which must be called directly as it is not exposed through Gate. This command can take either a single execution or operate on all executions within a time range. This command will dry-run by default. To actually rehydrate the queue, pass the query parameter dryRun=false.

  1. $ curl -XPOST \
  2. https://localhost:8083/admin/queue/hydrate?executionId=01CS076X85RX6MWBTQ0VGBF8VX&dryRun=false

This command is best effort and may not be able to rehydrate the Execution, especially if the Execution was zombied while running a non-retryable task.

An example response from the endpoint:

  1. {
  2. "dryRun": false,
  3. "executions": {
  4. "01CS076X85RX6MWBTQ0VGBF8VX": {
  5. "startTime": 1538679600852,
  6. "actions": [
  7. {
  8. "description": "Task is running and is retryable",
  9. "message": {
  10. "kind": "runTask",
  11. "executionType": "PIPELINE",
  12. "executionId": "01CS076X85RX6MWBTQ0VGBF8VX",
  13. "application": "myapplication",
  14. "stageId": "01CS076X8501MNAD2ZTJ4ST2TM",
  15. "taskId": "1",
  16. "taskType": "com.netflix.spinnaker.orca.echo.pipeline.ManualJudgmentStage$WaitForManualJudgmentTask",
  17. "attributes": [],
  18. "ackTimeoutMs": 600000
  19. },
  20. "context": {
  21. "stageId": "01CS076X8501MNAD2ZTJ4ST2TM",
  22. "stageType": "manualJudgment",
  23. "stageStartTime": 1538682406227,
  24. "taskId": "1",
  25. "taskType": "waitForJudgment",
  26. "taskStartTime": 1538682406242
  27. }
  28. },
  29. {
  30. "description": "Task is running but is not retryable",
  31. "context": {
  32. "stageId": "01CS076X85ECXHF3FRWZBTQ359",
  33. "stageType": "createProperty",
  34. "stageStartTime": 1538681485559,
  35. "taskId": "3",
  36. "taskType": "monitorProperties",
  37. "taskStartTime": 1538681546116
  38. }
  39. }
  40. ],
  41. "canApply": false
  42. }
  43. }
  44. }

For each Execution, a final action summary is provided canApply. If any part of an Execution cannot be re-hydrated, the entire Execution will be skipped.

Cancel the Execution

If the Execution cannot be rehydrated, it will need to be canceled. You can cancel the Execution via the UI or force cancellation via an Orca admin API:

  1. PUT /admin/forceCancelExecution?executionId=01CS076X85RX6MWBTQ0VGBF8VX&executionType=PIPELINE

Known Causes

Zombie Executions can occur due to a loss of the Redis instance backing the Orca queue, or in prolonged unreliable networks. If you’re using the RedisExecutionRepository, it’s likely that when you lose the Redis backing the queue, you’re also losing all running Executions. However, while using the SQL backend for Orca , losing the Redis means you’re only losing the in-flight work queue state, not the state of the pipelines. In such a scenario, once Redis has been restored it will not have any messages to process and existing RUNNING Executions will sit unprocessed.

Last modified May 4, 2021: rest of migration (700781a)