General Upgrade Process

Introduction

This document describes some best practices that you should follow when upgrading Consul. Some versions also have steps that are specific to that version, so make sure you also review the upgrade instructions for the version you are on.

Download the New Version

First, download the binary for the new version you want.

All current and past versions of the OSS and Enterprise releases are available here:

Docker containers are available at these locations:

If you are using Kubernetes, then please review our documentation for Upgrading Consul on Kubernetes.

Prepare for the Upgrade

1. Take a snapshot:

  1. consul snapshot save backup.snap

You can inspect the snapshot to ensure if was successful with:

  1. consul snapshot inspect backup.snap

Example output:

  1. ID 2-1182-1542056499724
  2. Size 4115
  3. Index 1182
  4. Term 2
  5. Version 1

This will ensure you have a safe fallback option in case something goes wrong. Store this snapshot somewhere safe. More documentation on snapshot usage is available here:

2. Temporarily modify your Consul configuration so that its log_level is set to debug. After doing this, issue the following command on your servers to reload the configuration:

  1. consul reload

This change will give you more information to work with in the event something goes wrong.

Perform the Upgrade

1. Issue the following command to discover which server is currently the leader:

  1. consul operator raft list-peers

You should receive output similar to this (exact formatting and content may differ based on version):

  1. Node ID Address State Voter RaftProtocol
  2. dc1-node1 ae15858f-7f5f-4dcb-b7d5-710fdcdd2745 10.11.0.2:8300 leader true 3
  3. dc1-node2 20e6be1b-f1cb-4aab-929f-f7d2d43d9a96 10.11.0.3:8300 follower true 3
  4. dc1-node3 658c343b-8769-431f-a71a-236f9dbb17b3 10.11.0.4:8300 follower true 3

Take note of which agent is the leader.

2. Copy the new consul binary onto your servers and replace the existing binary with the new one.

3. The following steps must be done in order on the server agents, leaving the leader agent for last. First, use a service management system (e.g., systemd, upstart, etc.) to restart the Consul service. If you are not using a service management system, you must restart the agent manually.

To validate that the agent has rejoined the cluster and is in sync with the leader, issue the following command:

  1. consul info

Check whether the commit_index and last_log_index fields have the same value. If done properly, this should avoid an unexpected leadership election due to loss of quorum.

4. Double-check that all servers are showing up in the cluster as expected and are on the correct version by issuing:

  1. consul members

You should receive output similar to this:

  1. Node Address Status Type Build Protocol DC
  2. dc1-node1 10.11.0.2:8301 alive server 1.8.3 2 dc1
  3. dc1-node2 10.11.0.3:8301 alive server 1.8.3 2 dc1
  4. dc1-node3 10.11.0.4:8301 alive server 1.8.3 2 dc1

Also double-check the raft state to make sure there is a leader and sufficient voters:

  1. consul operator raft list-peers

You should receive output similar to this:

  1. Node ID Address State Voter RaftProtocol
  2. dc1-node1 ae15858f-7f5f-4dcb-b7d5-710fdcdd2745 10.11.0.2:8300 leader true 3
  3. dc1-node2 20e6be1b-f1cb-4aab-929f-f7d2d43d9a96 10.11.0.3:8300 follower true 3
  4. dc1-node3 658c343b-8769-431f-a71a-236f9dbb17b3 10.11.0.4:8300 follower true 3

5. Set your log_level back to its original value and issue the following command on your servers to reload the configuration:

  1. consul reload

Troubleshooting

Most problems with upgrading occur due to either failing to upgrade the leader agent last, or failing to wait for a follower agent to fully rejoin a cluster before moving on to another server. This can cause a loss of quorum and occasionally can result in all of your servers attempting to kick off leadership elections endlessly without ever reaching a quorum and electing a leader.

Most of these problems can be solved by following the steps outlined in our Disaster recovery for Consul clusters document. If you are still having trouble after trying the recovery steps outlined there, then the following options for further assistance are available:

When contacting Hashicorp Support, please include the following information in your ticket:

  • Consul version you were upgrading FROM and TO.
  • Debug level logs from all servers in the cluster that you are having trouble with. These should include logs from prior to the upgrade attempt up through the current time. If your logs were not set at debug level prior to the upgrade, please include those logs as well. Also, update your config to use debug logs, and include logs from after that was done.
  • Your Consul config files (please redact any secrets).
  • Output from consul members -detailed and consul operator raft list-peers from each server in your cluster.