Rolling Upgrade

Rolling upgrades, sometimes referred to as “node replacement upgrades,” can be performed on running clusters with virtually no downtime. Nodes are individually stopped and upgraded in place. Alternatively, nodes can be stopped and replaced, one at a time, by hosts running the new version. During this process you can continue to index and query data in your cluster.

This document serves as a high-level, platform-agnostic overview of the rolling upgrade procedure. For specific examples of commands, scripts, and configuration files, refer to the Appendix.

Preparing to upgrade

Review Upgrading OpenSearch for recommendations about backing up your configuration files and creating a snapshot of the cluster state and indexes before you make any changes to your OpenSearch cluster.

Important: OpenSearch nodes cannot be downgraded. If you need to revert the upgrade, then you will need to perform a fresh installation of OpenSearch and restore the cluster from a snapshot. Take a snapshot and store it in a remote repository before beginning the upgrade procedure.

Performing the upgrade

  1. Verify the health of your OpenSearch cluster before you begin. You should resolve any index or shard allocation issues prior to upgrading to ensure that your data is preserved. A status of green indicates that all primary and replica shards are allocated. See Cluster health for more information. The following command queries the _cluster/health API endpoint:

    1. GET "/_cluster/health?pretty"

    The response should look similar to the following example:

    1. {
    2. "cluster_name":"opensearch-dev-cluster",
    3. "status":"green",
    4. "timed_out":false,
    5. "number_of_nodes":4,
    6. "number_of_data_nodes":4,
    7. "active_primary_shards":1,
    8. "active_shards":4,
    9. "relocating_shards":0,
    10. "initializing_shards":0,
    11. "unassigned_shards":0,
    12. "delayed_unassigned_shards":0,
    13. "number_of_pending_tasks":0,
    14. "number_of_in_flight_fetch":0,
    15. "task_max_waiting_in_queue_millis":0,
    16. "active_shards_percent_as_number":100.0
    17. }
  2. Disable shard replication to prevent shard replicas from being created while nodes are being taken offline. This stops the movement of Lucene index segments on nodes in your cluster. You can disable shard replication by querying the _cluster/settings API endpoint:

    1. PUT "/_cluster/settings?pretty"
    2. {
    3. "persistent": {
    4. "cluster.routing.allocation.enable": "primaries"
    5. }
    6. }

    The response should look similar to the following example:

    1. {
    2. "acknowledged" : true,
    3. "persistent" : {
    4. "cluster" : {
    5. "routing" : {
    6. "allocation" : {
    7. "enable" : "primaries"
    8. }
    9. }
    10. }
    11. },
    12. "transient" : { }
    13. }
  3. Perform a flush operation on the cluster to commit transaction log entries to the Lucene index:

    1. POST "/_flush?pretty"

    The response should look similar to the following example:

    1. {
    2. "_shards" : {
    3. "total" : 4,
    4. "successful" : 4,
    5. "failed" : 0
    6. }
    7. }
  4. Review your cluster and identify the first node to upgrade. Eligible cluster manager nodes should be upgraded last because OpenSearch nodes can join a cluster with manager nodes running an older version, but they cannot join a cluster with all manager nodes running a newer version.

  5. Query the _cat/nodes endpoint to identify which node was promoted to cluster manager. The following command includes additional query parameters that request only the name, version, node.role, and master headers. Note that OpenSearch 1.x versions use the term “master,” which has been deprecated and replaced by “cluster_manager” in OpenSearch 2.x and later.

    1. GET "/_cat/nodes?v&h=name,version,node.role,master" | column -t

    The response should look similar to the following example:

    1. name version node.role master
    2. os-node-01 7.10.2 dimr -
    3. os-node-04 7.10.2 dimr -
    4. os-node-03 7.10.2 dimr -
    5. os-node-02 7.10.2 dimr *
  6. Stop the node you are upgrading. Do not delete the volume associated with the container when you delete the container. The new OpenSearch container will use the existing volume. Deleting the volume will result in data loss.

  7. Confirm that the associated node has been dismissed from the cluster by querying the _cat/nodes API endpoint:

    1. GET "/_cat/nodes?v&h=name,version,node.role,master" | column -t

    The response should look similar to the following example:

    1. name version node.role master
    2. os-node-02 7.10.2 dimr *
    3. os-node-04 7.10.2 dimr -
    4. os-node-03 7.10.2 dimr -

    os-node-01 is no longer listed because the container has been stopped and deleted.

  8. Deploy a new container running the desired version of OpenSearch and mapped to the same volume as the container you deleted.

  9. Query the _cat/nodes endpoint after OpenSearch is running on the new node to confirm that it has joined the cluster:

    1. GET "/_cat/nodes?v&h=name,version,node.role,master" | column -t

    The response should look similar to the following example:

    1. name version node.role master
    2. os-node-02 7.10.2 dimr *
    3. os-node-04 7.10.2 dimr -
    4. os-node-01 7.10.2 dimr -
    5. os-node-03 7.10.2 dimr -

    In the example output, the new OpenSearch node reports a running version of 7.10.2 to the cluster. This is the result of compatibility.override_main_response_version, which is used when connecting to a cluster with legacy clients that check for a version. You can manually confirm the version of the node by calling the /_nodes API endpoint, as in the following command. Replace <nodeName> with the name of your node. See Nodes API to learn more.

    1. GET "/_nodes/<nodeName>?pretty=true" | jq -r '.nodes | .[] | "\(.name) v\(.version)"'

    The response should look similar to the following example:

    1. os-node-01 v1.3.7
  10. Reenable shard replication:

    1. PUT "/_cluster/settings?pretty"
    2. {
    3. "persistent": {
    4. "cluster.routing.allocation.enable": "all"
    5. }
    6. }

    The response should look similar to the following example:

    1. {
    2. "acknowledged" : true,
    3. "persistent" : {
    4. "cluster" : {
    5. "routing" : {
    6. "allocation" : {
    7. "enable" : "all"
    8. }
    9. }
    10. }
    11. },
    12. "transient" : { }
    13. }
  11. Confirm that the cluster is healthy:

    1. GET "/_cluster/health?pretty"

    The response should look similar to the following example:

    1. {
    2. "cluster_name" : "opensearch-dev-cluster",
    3. "status" : "green",
    4. "timed_out" : false,
    5. "number_of_nodes" : 4,
    6. "number_of_data_nodes" : 4,
    7. "discovered_master" : true,
    8. "active_primary_shards" : 1,
    9. "active_shards" : 4,
    10. "relocating_shards" : 0,
    11. "initializing_shards" : 0,
    12. "unassigned_shards" : 0,
    13. "delayed_unassigned_shards" : 0,
    14. "number_of_pending_tasks" : 0,
    15. "number_of_in_flight_fetch" : 0,
    16. "task_max_waiting_in_queue_millis" : 0,
    17. "active_shards_percent_as_number" : 100.0
    18. }
  12. Repeat steps 2 through 11 for each node in your cluster. Remember to upgrade an eligible cluster manager node last. After replacing the last node, query the _cat/nodes endpoint to confirm that all nodes have joined the cluster. The cluster is now bootstrapped to the new version of OpenSearch. You can verify the cluster version by querying the _cat/nodes API endpoint:

    1. GET "/_cat/nodes?v&h=name,version,node.role,master" | column -t

    The response should look similar to the following example:

    1. name version node.role master
    2. os-node-04 1.3.7 dimr -
    3. os-node-02 1.3.7 dimr *
    4. os-node-01 1.3.7 dimr -
    5. os-node-03 1.3.7 dimr -
  13. The upgrade is now complete, and you can begin enjoying the latest features and fixes!