从 v1.1.0/v1.1.1 升级到 v1.1.2
- 通用信息
- 已知问题

从 v1.1.0/v1.1.1 升级到 v1.1.2

危险

如果你的机器具有 Intel E810 网卡，请不要将正在运行的集群升级到 v1.1.2。有用户报告网卡在添加到 bonding 设备时会出现问题。请检查此 issue 以获取更多信息：https://github.com/harvester/harvester/issues/3860。

通用信息

一旦有了可升级的版本，Harvester GUI Dashboard 页面将显示一个升级按钮。有关详细信息，请参阅开始升级。

对于离线环境升级，请参阅准备离线升级。

已知问题

1. 升级卡在预清空节点状态

从 v1.1.0 开始，Harvester 将等待所有卷状态都是健康（节点数量 >= 3 时）后再升级节点。通常，如果升级卡在 “pre-draining” 状态，你可以检查卷的运行状况。

访问 “Access Embedded Longhorn” 了解如何访问嵌入式 Longhorn GUI。

你还可以检查预清空作业日志。请参考故障排除指南中的阶段 4：升级节点。

2. 升级卡在预清空节点状态（情况 2）

升级卡在下图所示的状态：

此外，你还观察到多个节点的状态为 SchedulingDisabled。

$ kubectl get nodes
NAME    STATUS                     ROLES                       AGE   VERSION
node1   Ready                      control-plane,etcd,master   20d   v1.24.7+rke2r1
node2   Ready,SchedulingDisabled   control-plane,etcd,master   20d   v1.24.7+rke2r1
node3   Ready,SchedulingDisabled   control-plane,etcd,master   20d   v1.24.7+rke2r1

相关 issue：
- [BUG] Multiple nodes pre-drains in an upgrade
解决方法：
- https://github.com/harvester/harvester/issues/3216#issuecomment-1328607004

3. 升级卡在第一个节点：Job was active longer than specified deadline

升级失败，如下截图所示：

相关 issue：
- [BUG] Upgrade stuck in upgrading first node: Job was active longer than specified deadline
解决方法：
- https://github.com/harvester/harvester/issues/2894#issuecomment-1274069690

4. 升级后，Fleet 包的状态为 `ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]`

升级后，Fleet 管理的包的状态可能是 ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]。要检查是否发生了这种情况，请运行以下命令：

kubectl get bundles -A

如果你看到以下输出，你的集群可能遇到了该问题：

NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   local-managed-system-agent                    1/1
fleet-local   mcc-harvester                                 1/1
fleet-local   mcc-harvester-crd                             1/1
fleet-local   mcc-local-managed-system-upgrade-controller   1/1
fleet-local   mcc-rancher-logging                           1/1
fleet-local   mcc-rancher-logging-crd                       1/1
fleet-local   mcc-rancher-monitoring                        1/1
fleet-local   mcc-rancher-monitoring-crd                    1/1

相关 issue：
- [BUG] Harvester single node upgrade will get another operation (install/upgrade/rollback) is in progress error
解决方法：
- https://github.com/harvester/harvester/issues/3616#issuecomment-1489892688

5. 无法检索 harvester-release.yaml 文件导致升级停止

升级停止，错误信息为 Get "http://upgrade-repo-hvst-upgrade-mldzx.harvester-system/harvester-iso/harvester-release.yaml": context deadline exceeded (Client.Timeout exceeded while awaiting headers)。

我们已在 v1.1.2 中修复此问题。v1.1.0 和 v1.1.1 用户可以通过重新开始升级来解决该问题。请参阅重新开始升级。

相关 issue：
- https://github.com/harvester/harvester/issues/3729
解决方法：
- 重新开始升级

6. 升级卡在 Pre-drained 状态

你可能会看到升级卡在“Pre-drained”状态：

这可能是由错误配置的 PDB 引起的。要检查是否属于这种情况，请执行以下步骤：

假设卡住的节点是 harvester-node-1。

检查节点上的 instance-manager-e 或 instance-manager-r pod 名称：

$ kubectl get pods -n longhorn-system --field-selector spec.nodeName=harvester-node-1 | grep instance-manager
instance-manager-r-d4ed2788          1/1     Running   0              3d8h

上方的输出表示 instance-manager-r-d4ed2788 pod 在节点上。

检查 Rancher 日志并验证 instance-manager-e 或 instance-manager-r pod 不会被清空：

$ kubectl logs deployment/rancher -n cattle-system
...
2023-03-28T17:10:52.199575910Z 2023/03/28 17:10:52 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-4f8cb698b24a,custom-a0f714579def
2023-03-28T17:10:55.034453029Z evicting pod longhorn-system/instance-manager-r-d4ed2788
2023-03-28T17:10:55.080933607Z error when evicting pods/"instance-manager-r-d4ed2788" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

运行以下命令，检查是否存在与卡住的节点关联的 PDB：

$ kubectl get pdb -n longhorn-system -o yaml | yq '.items[] | select(.spec.selector.matchLabels."longhorn.io/node"=="harvester-node-1") | .metadata.name'
instance-manager-r-466e3c7f

检查此 PDB 的实例管理器的所有者：
```
$ kubectl get instancemanager instance-manager-r-466e3c7f -n longhorn-system -o yaml | yq -e '.spec.nodeID'
harvester-node-2
```
如果输出与卡住节点不匹配（在此示例输出中，即 harvester-node-2 与卡住节点 harvester-node-1 不匹配），我们可以认为这个问题发生了。
在应用解决方法之前，请检查所有卷是否正常：
```
kubectl get volumes -n longhorn-system -o yaml | yq '.items[] | select(.status.state == "attached")| .status.robustness'
```
输出应该都是 healthy。如果不是这种情况，你可能需要取消封锁节点来让卷恢复 healthy 状态。

删除配置错误的 PDB：

kubectl delete pdb instance-manager-r-466e3c7f -n longhorn-system

相关 issue：
- [BUG] 3 Node AirGapped Cluster Upgrade Stuck v1.1.0->v1.1.2-rc4

从 v1.1.0/v1.1.1 升级到 v1.1.2

从 v1.1.0/v1.1.1 升级到 v1.1.2

通用信息

已知问题

1. 升级卡在预清空节点状态

2. 升级卡在预清空节点状态（情况 2）

3. 升级卡在第一个节点：Job was active longer than specified deadline

4. 升级后，Fleet 包的状态为 ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]

5. 无法检索 harvester-release.yaml 文件导致升级停止

6. 升级卡在 Pre-drained 状态

4. 升级后，Fleet 包的状态为 `ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]`