从 v1.0.2 升级到 v1.0.3
通用信息
Harvester GUI Dashboard 页面有一个升级按钮。有关详细信息,请参阅开始升级。
对于离线环境升级,请参阅准备离线升级。
已知问题
1. 下载升级镜像失败
说明
无法完成升级镜像的下载或失败并出现错误。
相关问题
解决方法
删除当前升级并重新开始。请参阅重新开始升级。
2. 升级卡住,节点处于 “Pre-drained” 状态(案例 1)
说明
用户可能会看到节点停留在 Pre-drained 状态一段时间(> 30 分钟)。
这可能是由于节点 harvester-z7j2g
上的 instance-manager-r-*
Pod 无法清空造成的。要验证上述情况:
检查 Rancher Server 日志:
kubectl logs deployment/rancher -n cattle-system
示例输出:
error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-r-10dd59c4
error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-r-10dd59c4
error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-r-10dd59c4
error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
验证 Pod
longhorn-system/instance-manager-r-10dd59c4
是否位于卡住的节点上:kubectl get pod instance-manager-r-10dd59c4 -n longhorn-system -o=jsonpath='{.spec.nodeName}'
示例输出:
harvester-z7j2g
检查降级的卷:
kubectl get volumes -n longhorn-system
示例输出:
NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
pvc-08c34593-8225-4be6-9899-10a978df6ea1 attached healthy True 10485760 harvester-279l2 3d13h
pvc-526600f5-bde2-4244-bb8e-7910385cbaeb attached healthy True 21474836480 harvester-x9jqw 3d1h
pvc-7b3fc2c3-30eb-48b8-8a98-11913f8314c2 attached healthy True 10737418240 harvester-x9jqw 3d
pvc-8065ed6c-a077-472c-920e-5fe9eacff96e attached healthy True 21474836480 harvester-x9jqw 3d
pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 attached degraded True 10737418240 harvester-x9jqw 2d23h
pvc-9a6539b8-44e5-430e-9b24-ea8290cb13b7 attached healthy True 53687091200 harvester-x9jqw 3d13h
我们可以看到卷
pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599
已降级。
备注
用户需要检查所有降级的卷。
检查降级卷的副本状态:
kubectl get replicas -n longhorn-system --selector longhornvolume=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 -o json | jq '.items[] | {replica: .metadata.name, healthyAt: .spec.healthyAt, nodeID: .spec.nodeID, state: .status.currentState}'
示例输出:
{
"replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246",
"healthyAt": "2022-07-25T07:33:16Z",
"nodeID": "harvester-z7j2g",
"state": "running"
}
{
"replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-22974d0f",
"healthyAt": "",
"nodeID": "harvester-279l2",
"state": "running"
}
{
"replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5",
"healthyAt": "",
"nodeID": "harvester-x9jqw",
"state": "stopped"
}
这里唯一健康的副本是
pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246
,它位于harvester-z7j2g
节点上。因此,我们可以确认instance-manager-r-*
Pod 位于harvester-z7j2g
节点上并避免了清空。
相关问题
解决方法
我们需要启动 “Stopped” 状态的副本。在前面的示例中,停止的副本的名称是 pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5
。
检查 Longhorn 管理器日志,我们会看到一个副本在等待 backing 镜像。首先,我们需要获取管理器的名称:
kubectl get pods -n longhorn-system --selector app=longhorn-manager --field-selector spec.nodeName=harvester-x9jqw
示例输出:
NAME READY STATUS RESTARTS AGE
longhorn-manager-zmfbw 1/1 Running 0 3d10h
获取 Pod 日志:
kubectl logs longhorn-manager-zmfbw -n longhorn-system | grep pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5
示例输出:
(...)
time="2022-07-28T04:35:34Z" level=debug msg="Prepare to create instance pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"
time="2022-07-28T04:35:34Z" level=debug msg="Replica pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5 is waiting for backing image harvester-system-harvester-iso-n7bxh downloading file to node harvester-x9jqw disk 3830342d-c13d-4e55-ac74-99cad529e9d4, the current state is in-progress" controller=longhorn-replica dataPath= node=harvester-x9jqw nodeID=harvester-x9jqw ownerID=harvester-x9jqw replica=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5
time="2022-07-28T04:35:34Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5\", UID:\"c511630f-2fe2-4cf9-97a4-21bce73782b1\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"632926\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"
在这里,我们可以确定副本正在等待 backing 镜像
harvester-system-harvester-iso-n7bxh
。从 backing 镜像中获取磁盘文件映射:
kubectl describe backingimage harvester-system-harvester-iso-n7bxh -n longhorn-system
示例输出:
(...)
Disk File Status Map:
3830342d-c13d-4e55-ac74-99cad529e9d4:
Last State Transition Time: 2022-07-25T08:30:34Z
Message:
Progress: 29
State: in-progress
3aa804e1-229d-4141-8816-1f6a7c6c3096:
Last State Transition Time: 2022-07-25T08:33:20Z
Message:
Progress: 100
State: ready
92726efa-bfb3-478e-8553-3206ad34ce70:
Last State Transition Time: 2022-07-28T04:31:49Z
Message:
Progress: 100
State: ready
UUID
3830342d-c13d-4e55-ac74-99cad529e9d4
的磁盘文件状态为in-progress
。接下来,我们需要找到包含这个磁盘文件的 backing-image-manager:
kubectl get pod -n longhorn-system --selector=longhorn.io/disk-uuid=3830342d-c13d-4e55-ac74-99cad529e9d4
示例输出:
NAME READY STATUS RESTARTS AGE
backing-image-manager-c00e-3830 1/1 Running 0 3d1h
通过删除 Pod 重新启动 backing-image-manager:
kubectl delete pod -n longhorn-system backing-image-manager-c00e-3830
3. 升级卡住,节点处于 “Pre-drained” 状态(案例 2)
说明
用户可能会看到节点停留在 Pre-drained 状态一段时间(> 30 分钟)。
以下是验证是否发生了此问题的步骤:
访问 Longhorn GUI:
https://{{VIP}}/k8s/clusters/local/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/#/volume
(用适当的值替换 VIP)并检查降级的卷。降级的卷可能只包含一个健康的副本(蓝色背景),并且健康的副本位于 “Pre-drained” 节点上:将鼠标悬停在红色的 scheduled 图标上,可以看到原因是
toomanysnapshots
:
相关问题
解决方法
在 “Snapshots and Backup” 面板中,切换 “Show System Hidden” 开关并删除最新的系统快照(就在 “Volume Head” 前面):
卷将继续重建以恢复升级。