Upgrade from v1.1.0/v1.1.1 to v1.1.2

General information

Once there is an upgradable version, the Harvester GUI Dashboard page will show an upgrade button. For more details, please refer to start an upgrade.

For the air-gap env upgrade, please refer to prepare an air-gapped upgrade.

Known Issues


1. An upgrade is stuck when pre-draining a node

Starting from v1.1.0, Harvester will wait for all volumes to become healthy (when node count >= 3) before upgrading a node. Generally, you can check volumes’ health if an upgrade is stuck in the “pre-draining” state.

Visit “Access Embedded Longhorn” to see how to access the embedded Longhorn GUI.

You can also check the pre-drain job logs. Please refer to Phase 4: Upgrade nodes in the troubleshooting guide.


2. An upgrade is stuck when pre-draining a node (case 2)

An upgrade is stuck, as shown in the screenshot below:

Upgrade from v1.1.0/v1.1.1 to v1.1.2 - 图1

And you can also observe that multiple nodes’ status is SchedulingDisabled.

  1. $ kubectl get nodes
  2. NAME STATUS ROLES AGE VERSION
  3. node1 Ready control-plane,etcd,master 20d v1.24.7+rke2r1
  4. node2 Ready,SchedulingDisabled control-plane,etcd,master 20d v1.24.7+rke2r1
  5. node3 Ready,SchedulingDisabled control-plane,etcd,master 20d v1.24.7+rke2r1

3. An upgrade is stuck in upgrading the first node: Job was active longer than the specified deadline

An upgrade fails, as shown in the screenshot below:

Upgrade from v1.1.0/v1.1.1 to v1.1.2 - 图2


4. After an upgrade, a fleet bundle’s status is ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]

There is a chance fleet-managed bundle’s status is ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress] after an upgrade. To check if this happened, run the following command:

  1. kubectl get bundles -A

If you see the following output, it’s possible that your cluster has hit the issue:

  1. NAMESPACE NAME BUNDLEDEPLOYMENTS-READY STATUS
  2. fleet-local fleet-agent-local 0/1 ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
  3. fleet-local local-managed-system-agent 1/1
  4. fleet-local mcc-harvester 1/1
  5. fleet-local mcc-harvester-crd 1/1
  6. fleet-local mcc-local-managed-system-upgrade-controller 1/1
  7. fleet-local mcc-rancher-logging 1/1
  8. fleet-local mcc-rancher-logging-crd 1/1
  9. fleet-local mcc-rancher-monitoring 1/1
  10. fleet-local mcc-rancher-monitoring-crd 1/1

5. An upgrade stops because it can’t retrieve the harvester-release.yaml file

An upgrade is stopped with the Get "http://upgrade-repo-hvst-upgrade-mldzx.harvester-system/harvester-iso/harvester-release.yaml": context deadline exceeded (Client.Timeout exceeded while awaiting headers) message:

Upgrade from v1.1.0/v1.1.1 to v1.1.2 - 图3

We have fixed this issue in v1.1.2. But for v1.1.0 and v1.1.1 users, the workaround is to start over an upgrade. Please refer to Start over an upgrade.


6. An upgrade is stuck in the Pre-drained state

You might see an upgrade is stuck in the “pre-drained” state:

Upgrade from v1.1.0/v1.1.1 to v1.1.2 - 图4

This could be caused by a misconfigured PDB. To check if that’s the case, perform the following steps:

  1. Assume the stuck node is harvester-node-1.

  2. Check the instance-manager-e or instance-manager-r pod names on the stuck node:

    1. $ kubectl get pods -n longhorn-system --field-selector spec.nodeName=harvester-node-1 | grep instance-manager
    2. instance-manager-r-d4ed2788 1/1 Running 0 3d8h

    The output above shows that the instance-manager-r-d4ed2788 pod is on the node.

  3. Check Rancher logs and verify that the instance-manager-e or instance-manager-r pod can’t be drained:

    1. $ kubectl logs deployment/rancher -n cattle-system
    2. ...
    3. 2023-03-28T17:10:52.199575910Z 2023/03/28 17:10:52 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-4f8cb698b24a,custom-a0f714579def
    4. 2023-03-28T17:10:55.034453029Z evicting pod longhorn-system/instance-manager-r-d4ed2788
    5. 2023-03-28T17:10:55.080933607Z error when evicting pods/"instance-manager-r-d4ed2788" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
  4. Run the command to check if there is a PDB associated with the stuck node:

    1. $ kubectl get pdb -n longhorn-system -o yaml | yq '.items[] | select(.spec.selector.matchLabels."longhorn.io/node"=="harvester-node-1") | .metadata.name'
    2. instance-manager-r-466e3c7f
  5. Check the owner of the instance manager to this PDB:

    1. $ kubectl get instancemanager instance-manager-r-466e3c7f -n longhorn-system -o yaml | yq -e '.spec.nodeID'
    2. harvester-node-2

    If the output doesn’t match the stuck node (in this example output, harvester-node-2 doesn’t match the stuck node harvester-node-1), then we can conclude this issue happens.

  6. Before applying the workaround, check if all volumes are healthy:

    1. kubectl get volumes -n longhorn-system -o yaml | yq '.items[] | select(.status.state == "attached")| .status.robustness'

    The output should all be healthy. If this is not the case, you might want to uncordon nodes to make the volume healthy again.

  7. Remove the misconfigured PDB:

    1. kubectl delete pdb instance-manager-r-466e3c7f -n longhorn-system