Troubleshooting upgrade

For a high level overview of the upgrade lifecycle and components, please refer to the related document.

Rancher side

In this example we upgraded the cluster nodes with the following ManagedOSImage definition:

  1. apiVersion: elemental.cattle.io/v1beta1
  2. kind: ManagedOSImage
  3. metadata:
  4. name: my-upgrade
  5. namespace: fleet-default
  6. spec:
  7. # Set to the new Elemental version you would like to upgrade to or track the latest tag
  8. osImage: "registry.suse.com/rancher/elemental-teal/5.4:latest"
  9. clusterTargets:
  10. - clusterName: my-cluster

Once the ManagedOSImage is applied, the elemental-operator will verify it and generate a related Bundle.
The Bundle name will be prefixed with mos and then the ManagedOSImage name. In this case mos-my-upgrade.

In the Bundle definition, you will find the details about the upgrade plan and the desired target.
For example:

  1. kubectl -n fleet-default get bundle mos-my-upgrade -o yaml

Example

  1. apiVersion: fleet.cattle.io/v1alpha1
  2. kind: Bundle
  3. metadata:
  4. creationTimestamp: "2023-06-16T09:01:47Z"
  5. generation: 1
  6. name: mos-my-upgrade
  7. namespace: fleet-default
  8. ownerReferences:
  9. - apiVersion: elemental.cattle.io/v1beta1
  10. controller: true
  11. kind: ManagedOSImage
  12. name: my-upgrade
  13. uid: e468ed21-23bb-487a-a022-dbc7ef753720
  14. resourceVersion: "1038645"
  15. uid: 35e83fc4-28c8-4b10-8059-cae6cdff2cda
  16. spec:
  17. resources:
  18. - content: '{"kind":"ClusterRole","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"os-upgrader-my-upgrade","creationTimestamp":null},"rules":[{"verbs":["update","get","list","watch","patch"],"apiGroups":[""],"resources":["nodes"]},{"verbs":["list"],"apiGroups":[""],"resources":["pods"]}]}'
  19. name: ClusterRole--os-upgrader-my-upgrade-296a3abf3451.yaml
  20. - content: '{"kind":"ClusterRoleBinding","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"os-upgrader-my-upgrade","creationTimestamp":null},"subjects":[{"kind":"ServiceAccount","name":"os-upgrader-my-upgrade","namespace":"cattle-system"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"os-upgrader-my-upgrade"}}'
  21. name: ClusterRoleBinding--os-upgrader-my-upgrade-f63eaecde935.yaml
  22. - content: '{"kind":"ServiceAccount","apiVersion":"v1","metadata":{"name":"os-upgrader-my-upgrade","namespace":"cattle-system","creationTimestamp":null}}'
  23. name: ServiceAccount-cattle-system-os-upgrader-my-upgrade-ce93d-01096.yaml
  24. - content: '{"kind":"Secret","apiVersion":"v1","metadata":{"name":"os-upgrader-my-upgrade","namespace":"cattle-system","creationTimestamp":null},"data":{"cloud-config":""}}'
  25. name: Secret-cattle-system-os-upgrader-my-upgrade-a997ee6a67ef.yaml
  26. - content: '{"kind":"Plan","apiVersion":"upgrade.cattle.io/v1","metadata":{"name":"os-upgrader-my-upgrade","namespace":"cattle-system","creationTimestamp":null},"spec":{"concurrency":1,"nodeSelector":{},"serviceAccountName":"os-upgrader-my-upgrade","version":"latest","secrets":[{"name":"os-upgrader-my-upgrade","path":"/run/data"}],"tolerations":[{"operator":"Exists"}],"cordon":true,"upgrade":{"image":"registry.suse.com/rancher/elemental-teal/5.4","command":["/usr/sbin/suc-upgrade"]}},"status":{}}'
  27. name: Plan-cattle-system-os-upgrader-my-upgrade-273c2c09afca.yaml
  28. targets:
  29. - clusterName: my-cluster
  30. .
  31. .
  32. .

Elemental Cluster side

Any Elemental Teal node correctly registered and part of the target cluster will fetch the bundle and start applying it.
This operation is performed by the Rancher’s system-upgrade-controller running on the Elemental Cluster.
To monitor the correct operation of this controller, you can read its logs:

  1. kubectl -n cattle-system logs deployment/system-upgrade-controller

If everything is correct, the system-upgrade-controller will create an upgrade Plan on the cluster:

  1. kubectl -n cattle-system get plans

For each Plan, the controller will orchestrate the jobs that will apply it on each targeted node.
The job names will use the Plan name (os-upgrader-my-upgrade) and the target machine hostname (my-host) for easy discoverability.
For example: apply-os-upgrader-my-upgrade-on-my-host-7a25e
You can monitor these jobs with:

  1. kubectl -n cattle-system get jobs

Each job will use a privileged: true container with the Elemental Teal image specified in the ManagedOSImage definition. This container will try to upgrade the system and perform a reboot.

If the job fails, you can check its status by examining the logs:

  1. kubectl -n cattle-system logs job.batch/apply-os-upgrader-my-upgrade-on-my-host-7a25e

Upgrade - 图1Two stages job process

Note that the upgrade process is performed in two stages.
You will notice that the same job is ran twice and the first one ends with the Uknown Status and will not complete.
This is to be expected, as Elemental Teal relies on the job to be ran again after the machine restarts, so that it can verify the new version was installed correctly.
You will notice a second run of the job, this time completing correctly.

  1. kubectl -n cattle-system get jobs
  2. NAMESPACE NAME COMPLETIONS DURATION AGE
  3. cattle-system apply-os-upgrader-my-upgrade-on-my-host-0b392 1/1 2m34s 6m23s
  4. cattle-system apply-os-upgrader-my-upgrade-on-my-host-7a25e 0/1 6m23s 6m23s
  1. kubectl -n cattle-system get pods
  2. NAME READY STATUS RESTARTS AGE
  3. apply-os-upgrader-my-upgrade-on-my-host-zbkrh 0/1 Completed 0 9m40s
  4. apply-os-upgrader-my-upgrade-on-my-host-zvrff 0/1 Unknown 0 12m

Recovering from failures

It is possible that the ManagedOSImage upgrade process will fail, leaving one or more nodes in a faulty state.
For example if the to-be-upgraded image is not found on the registry or otherwise broken, the upgrade job running on the downstream clusters will not succeed.

When this is the case, the nodes running the failing upgrade job will stay cordoned.
You can update the ManagedOSImage with a functioning osImage, or alternatively delete it to stop any further upgrade attempt.
In any case, in order to restore them and be able to also schedule following upgrades, the affected nodes need to be uncordoned manually.