Topology Aware Lifecycle Manager for cluster updates

You can use the Topology Aware Lifecycle Manager (TALM) to manage the software lifecycle of multiple single-node OpenShift clusters. TALM uses Red Hat Advanced Cluster Management (RHACM) policies to perform changes on the target clusters.

About the Topology Aware Lifecycle Manager configuration

The Topology Aware Lifecycle Manager (TALM) manages the deployment of Red Hat Advanced Cluster Management (RHACM) policies for one or more OKD clusters. Using TALM in a large network of clusters allows the phased rollout of policies to the clusters in limited batches. This helps to minimize possible service disruptions when updating. With TALM, you can control the following actions:

  • The timing of the update

  • The number of RHACM-managed clusters

  • The subset of managed clusters to apply the policies to

  • The update order of the clusters

  • The set of policies remediated to the cluster

  • The order of policies remediated to the cluster

  • The assignment of a canary cluster

For single-node OpenShift, the Topology Aware Lifecycle Manager (TALM) can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.

TALM supports the orchestration of the OKD y-stream and z-stream updates, and day-two operations on y-streams and z-streams.

About managed policies used with Topology Aware Lifecycle Manager

The Topology Aware Lifecycle Manager (TALM) uses RHACM policies for cluster updates.

TALM can be used to manage the rollout of any policy CR where the remediationAction field is set to inform. Supported use cases include the following:

  • Manual user creation of policy CRs

  • Automatically generated policies from the PolicyGenTemplate custom resource definition (CRD)

For policies that update an Operator subscription with manual approval, TALM provides additional functionality that approves the installation of the updated Operator.

For more information about managed policies, see Policy Overview in the RHACM documentation.

For more information about the PolicyGenTemplate CRD, see the “About the PolicyGenTemplate” section in “Deploying distributed units at scale in a disconnected environment”.

Installing the Topology Aware Lifecycle Manager by using the web console

You can use the OKD web console to install the Topology Aware Lifecycle Manager.

Prerequisites

  • Install the latest version of the RHACM Operator.

  • Set up a hub cluster with disconnected regitry.

  • Log in as a user with cluster-admin privileges.

Procedure

  1. In the OKD web console, navigate to OperatorsOperatorHub.

  2. Search for the Topology Aware Lifecycle Manager from the list of available Operators, and then click Install.

  3. Keep the default selection of Installation mode [“All namespaces on the cluster (default)”] and Installed Namespace (“openshift-operators”) to ensure that the Operator is installed properly.

  4. Click Install.

Verification

To confirm that the installation is successful:

  1. Navigate to the OperatorsInstalled Operators page.

  2. Check that the Operator is installed in the All Namespaces namespace and its status is Succeeded.

If the Operator is not installed successfully:

  1. Navigate to the OperatorsInstalled Operators page and inspect the Status column for any errors or failures.

  2. Navigate to the WorkloadsPods page and check the logs in any containers in the cluster-group-upgrades-controller-manager pod that are reporting issues.

Installing the Topology Aware Lifecycle Manager by using the CLI

You can use the OpenShift CLI (oc) to install the Topology Aware Lifecycle Manager (TALM).

Prerequisites

  • Install the OpenShift CLI (oc).

  • Install the latest version of the RHACM Operator.

  • Set up a hub cluster with disconnected registry.

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a Subscription CR:

    1. Define the Subscription CR and save the YAML file, for example, talm-subscription.yaml:

      1. apiVersion: operators.coreos.com/v1alpha1
      2. kind: Subscription
      3. metadata:
      4. name: openshift-topology-aware-lifecycle-manager-subscription
      5. namespace: openshift-operators
      6. spec:
      7. channel: "stable"
      8. name: topology-aware-lifecycle-manager
      9. source: redhat-operators
      10. sourceNamespace: openshift-marketplace
    2. Create the Subscription CR by running the following command:

      1. $ oc create -f talm-subscription.yaml

Verification

  1. Verify that the installation succeeded by inspecting the CSV resource:

    1. $ oc get csv -n openshift-operators

    Example output

    1. NAME DISPLAY VERSION REPLACES PHASE
    2. topology-aware-lifecycle-manager.4.12.x Topology Aware Lifecycle Manager 4.12.x Succeeded
  2. Verify that the TALM is up and running:

    1. $ oc get deploy -n openshift-operators

    Example output

    1. NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
    2. openshift-operators cluster-group-upgrades-controller-manager 1/1 1 1 14s

About the ClusterGroupUpgrade CR

The Topology Aware Lifecycle Manager (TALM) builds the remediation plan from the ClusterGroupUpgrade CR for a group of clusters. You can define the following specifications in a ClusterGroupUpgrade CR:

  • Clusters in the group

  • Blocking ClusterGroupUpgrade CRs

  • Applicable list of managed policies

  • Number of concurrent updates

  • Applicable canary updates

  • Actions to perform before and after the update

  • Update timing

You can control the start time of an update using the enable field in the ClusterGroupUpgrade CR. For example, if you have a scheduled maintenance window of four hours, you can prepare a ClusterGroupUpgrade CR with the enable field set to false.

You can set the timeout by configuring the spec.remediationStrategy.timeout setting as follows:

  1. spec
  2. remediationStrategy:
  3. maxConcurrency: 1
  4. timeout: 240

You can use the batchTimeoutAction to determine what happens if an update fails for a cluster. You can specify continue to skip the failing cluster and continue to upgrade other clusters, or abort to stop policy remediation for all clusters. Once the timeout elapses, TALM removes all enforce policies to ensure that no further updates are made to clusters.

To apply the changes, you set the enabled field to true.

For more information see the “Applying update policies to managed clusters” section.

As TALM works through remediation of the policies to the specified clusters, the ClusterGroupUpgrade CR can report true or false statuses for a number of conditions.

After TALM completes a cluster update, the cluster does not update again under the control of the same ClusterGroupUpgrade CR. You must create a new ClusterGroupUpgrade CR in the following cases:

  • When you need to update the cluster again

  • When the cluster changes to non-compliant with the inform policy after being updated

Selecting clusters

TALM builds a remediation plan and selects clusters based on the following fields:

  • The clusterLabelSelector field specifies the labels of the clusters that you want to update. This consists of a list of the standard label selectors from k8s.io/apimachinery/pkg/apis/meta/v1. Each selector in the list uses either label value pairs or label expressions. Matches from each selector are added to the final list of clusters along with the matches from the clusterSelector field and the cluster field.

  • The clusters field specifies a list of clusters to update.

  • The canaries field specifies the clusters for canary updates.

  • The maxConcurrency field specifies the number of clusters to update in a batch.

You can use the clusters, clusterLabelSelector, and clusterSelector fields together to create a combined list of clusters.

The remediation plan starts with the clusters listed in the canaries field. Each canary cluster forms a single-cluster batch.

Sample ClusterGroupUpgrade CR with the enabled field set to false

  1. apiVersion: ran.openshift.io/v1alpha1
  2. kind: ClusterGroupUpgrade
  3. metadata:
  4. creationTimestamp: '2022-11-18T16:27:15Z'
  5. finalizers:
  6. - ran.openshift.io/cleanup-finalizer
  7. generation: 1
  8. name: talm-cgu
  9. namespace: talm-namespace
  10. resourceVersion: '40451823'
  11. uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
  12. Spec:
  13. actions:
  14. afterCompletion:
  15. deleteObjects: true
  16. beforeEnable: {}
  17. backup: false
  18. clusters: (1)
  19. - spoke1
  20. enable: false (2)
  21. managedPolicies: (3)
  22. - talm-policy
  23. preCaching: false
  24. remediationStrategy: (4)
  25. canaries: (5)
  26. - spoke1
  27. maxConcurrency: 2 (6)
  28. timeout: 240
  29. clusterLabelSelectors: (7)
  30. - matchExpressions:
  31. - key: label1
  32. operator: In
  33. values:
  34. - value1a
  35. - value1b
  36. batchTimeoutAction: (8)
  37. status: (9)
  38. computedMaxConcurrency: 2
  39. conditions:
  40. - lastTransitionTime: '2022-11-18T16:27:15Z'
  41. message: All selected clusters are valid
  42. reason: ClusterSelectionCompleted
  43. status: 'True'
  44. type: ClustersSelected (10)
  45. - lastTransitionTime: '2022-11-18T16:27:15Z'
  46. message: Completed validation
  47. reason: ValidationCompleted
  48. status: 'True'
  49. type: Validated (11)
  50. - lastTransitionTime: '2022-11-18T16:37:16Z'
  51. message: Not enabled
  52. reason: NotEnabled
  53. status: 'False'
  54. type: Progressing
  55. managedPoliciesForUpgrade:
  56. - name: talm-policy
  57. namespace: talm-namespace
  58. managedPoliciesNs:
  59. talm-policy: talm-namespace
  60. remediationPlan:
  61. - - spoke1
  62. - - spoke2
  63. - spoke3
  64. status:
1Defines the list of clusters to update.
2The enable field is set to false.
3Lists the user-defined set of policies to remediate.
4Defines the specifics of the cluster updates.
5Defines the clusters for canary updates.
6Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the maxConcurrency value. The clusters that are already compliant with all the managed policies are excluded from the remediation plan.
7Displays the parameters for selecting clusters.
8Controls what happens if a batch times out. Possible values are abort or continue. If unspecified, the default is continue.
9Displays information about the status of the updates.
10The ClustersSelected condition shows that all selected clusters are valid.
11The Validated condition shows that all selected clusters have been validated.

Any failures during the update of a canary cluster stops the update process.

When the remediation plan is successfully created, you can you set the enable field to true and TALM starts to update the non-compliant clusters with the specified managed policies.

You can only make changes to the spec fields if the enable field of the ClusterGroupUpgrade CR is set to false.

Validating

TALM checks that all specified managed policies are available and correct, and uses the Validated condition to report the status and reasons as follows:

  • true

    Validation is completed.

  • false

    Policies are missing or invalid, or an invalid platform image has been specified.

Pre-caching

Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed. You can use pre-caching to avoid this. The container image pre-caching starts when you create a ClusterGroupUpgrade CR with the preCaching field set to true.

TALM uses the PrecacheSpecValid condition to report status information as follows:

  • true

    The pre-caching spec is valid and consistent.

  • false

    The pre-caching spec is incomplete.

TALM uses the PrecachingSucceeded condition to report status information as follows:

  • true

    TALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.

  • false

    Pre-caching is still in progress for one or more clusters or has failed for all clusters.

For more information see the “Using the container image pre-cache feature” section.

Creating a backup

For single-node OpenShift, TALM can create a backup of a deployment before an update. If the update fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications. To use the backup feature you first create a ClusterGroupUpgrade CR with the backup field set to true. To ensure that the contents of the backup are up to date, the backup is not taken until you set the enable field in the ClusterGroupUpgrade CR to true.

TALM uses the BackupSucceeded condition to report the status and reasons as follows:

  • true

    Backup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update fails for that cluster but proceeds for all other clusters.

  • false

    Backup is still in progress for one or more clusters or has failed for all clusters.

For more information, see the “Creating a backup of cluster resources before upgrade” section.

Updating clusters

TALM enforces the policies following the remediation plan. Enforcing the policies for subsequent batches starts immediately after all the clusters of the current batch are compliant with all the managed policies. If the batch times out, TALM moves on to the next batch. The timeout value of a batch is the spec.timeout field divided by the number of batches in the remediation plan.

TALM uses the Progressing condition to report the status and reasons as follows:

  • true

    TALM is remediating non-compliant policies.

  • false

    The update is not in progress. Possible reasons for this are:

    • All clusters are compliant with all the managed policies.

    • The update has timed out as policy remediation took too long.

    • Blocking CRs are missing from the system or have not yet completed.

    • The ClusterGroupUpgrade CR is not enabled.

    • Backup is still in progress.

The managed policies apply in the order that they are listed in the managedPolicies field in the ClusterGroupUpgrade CR. One managed policy is applied to the specified clusters at a time. When a cluster complies with the current policy, the next managed policy is applied to it.

Sample ClusterGroupUpgrade CR in the Progressing state

  1. apiVersion: ran.openshift.io/v1alpha1
  2. kind: ClusterGroupUpgrade
  3. metadata:
  4. creationTimestamp: '2022-11-18T16:27:15Z'
  5. finalizers:
  6. - ran.openshift.io/cleanup-finalizer
  7. generation: 1
  8. name: talm-cgu
  9. namespace: talm-namespace
  10. resourceVersion: '40451823'
  11. uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
  12. Spec:
  13. actions:
  14. afterCompletion:
  15. deleteObjects: true
  16. beforeEnable: {}
  17. backup: false
  18. clusters:
  19. - spoke1
  20. enable: true
  21. managedPolicies:
  22. - talm-policy
  23. preCaching: true
  24. remediationStrategy:
  25. canaries:
  26. - spoke1
  27. maxConcurrency: 2
  28. timeout: 240
  29. clusterLabelSelectors:
  30. - matchExpressions:
  31. - key: label1
  32. operator: In
  33. values:
  34. - value1a
  35. - value1b
  36. batchTimeoutAction:
  37. status:
  38. clusters:
  39. - name: spoke1
  40. state: complete
  41. computedMaxConcurrency: 2
  42. conditions:
  43. - lastTransitionTime: '2022-11-18T16:27:15Z'
  44. message: All selected clusters are valid
  45. reason: ClusterSelectionCompleted
  46. status: 'True'
  47. type: ClustersSelected
  48. - lastTransitionTime: '2022-11-18T16:27:15Z'
  49. message: Completed validation
  50. reason: ValidationCompleted
  51. status: 'True'
  52. type: Validated
  53. - lastTransitionTime: '2022-11-18T16:37:16Z'
  54. message: Remediating non-compliant policies
  55. reason: InProgress
  56. status: 'True'
  57. type: Progressing (1)
  58. managedPoliciesForUpgrade:
  59. - name: talm-policy
  60. namespace: talm-namespace
  61. managedPoliciesNs:
  62. talm-policy: talm-namespace
  63. remediationPlan:
  64. - - spoke1
  65. - - spoke2
  66. - spoke3
  67. status:
  68. currentBatch: 2
  69. currentBatchRemediationProgress:
  70. spoke2:
  71. state: Completed
  72. spoke3:
  73. policyIndex: 0
  74. state: InProgress
  75. currentBatchStartedAt: '2022-11-18T16:27:16Z'
  76. startedAt: '2022-11-18T16:27:15Z'
1The Progressing fields show that TALM is in the process of remediating policies.

Update status

TALM uses the Succeeded condition to report the status and reasons as follows:

  • true

    All clusters are compliant with the specified managed policies.

  • false

    Policy remediation failed as there were no clusters available for remediation, or because policy remediation took too long for one of the following reasons:

    • The current batch contains canary updates and the cluster in the batch does not comply with all the managed policies within the batch timeout.

    • Clusters did not comply with the managed policies within the timeout value specified in the remediationStrategy field.

Sample ClusterGroupUpgrade CR in the Succeeded state

  1. apiVersion: ran.openshift.io/v1alpha1
  2. kind: ClusterGroupUpgrade
  3. metadata:
  4. name: cgu-upgrade-complete
  5. namespace: default
  6. spec:
  7. clusters:
  8. - spoke1
  9. - spoke4
  10. enable: true
  11. managedPolicies:
  12. - policy1-common-cluster-version-policy
  13. - policy2-common-pao-sub-policy
  14. remediationStrategy:
  15. maxConcurrency: 1
  16. timeout: 240
  17. status: (3)
  18. clusters:
  19. - name: spoke1
  20. state: complete
  21. - name: spoke4
  22. state: complete
  23. conditions:
  24. - message: All selected clusters are valid
  25. reason: ClusterSelectionCompleted
  26. status: "True"
  27. type: ClustersSelected
  28. - message: Completed validation
  29. reason: ValidationCompleted
  30. status: "True"
  31. type: Validated
  32. - message: All clusters are compliant with all the managed policies
  33. reason: Completed
  34. status: "False"
  35. type: Progressing (1)
  36. - message: All clusters are compliant with all the managed policies
  37. reason: Completed
  38. status: "True"
  39. type: Succeeded (2)
  40. managedPoliciesForUpgrade:
  41. - name: policy1-common-cluster-version-policy
  42. namespace: default
  43. - name: policy2-common-pao-sub-policy
  44. namespace: default
  45. remediationPlan:
  46. - - spoke1
  47. - - spoke4
  48. status:
  49. completedAt: '2022-11-18T16:27:16Z'
  50. startedAt: '2022-11-18T16:27:15Z'
1In the Progressing fields, the status is false as the update has completed; clusters are compliant with all the managed policies.
2The Succeeded fields show that the validations completed successfully.
3The status field includes a list of clusters and their respective statuses. The status of a cluster can be complete or timedout.

Sample ClusterGroupUpgrade CR in the timedout state

  1. apiVersion: ran.openshift.io/v1alpha1
  2. kind: ClusterGroupUpgrade
  3. metadata:
  4. creationTimestamp: '2022-11-18T16:27:15Z'
  5. finalizers:
  6. - ran.openshift.io/cleanup-finalizer
  7. generation: 1
  8. name: talm-cgu
  9. namespace: talm-namespace
  10. resourceVersion: '40451823'
  11. uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
  12. spec:
  13. actions:
  14. afterCompletion:
  15. deleteObjects: true
  16. beforeEnable: {}
  17. backup: false
  18. clusters:
  19. - spoke1
  20. - spoke2
  21. enable: true
  22. managedPolicies:
  23. - talm-policy
  24. preCaching: false
  25. remediationStrategy:
  26. maxConcurrency: 2
  27. timeout: 240
  28. status:
  29. clusters:
  30. - name: spoke1
  31. state: complete
  32. - currentPolicy: (1)
  33. name: talm-policy
  34. status: NonCompliant
  35. name: spoke2
  36. state: timedout
  37. computedMaxConcurrency: 2
  38. conditions:
  39. - lastTransitionTime: '2022-11-18T16:27:15Z'
  40. message: All selected clusters are valid
  41. reason: ClusterSelectionCompleted
  42. status: 'True'
  43. type: ClustersSelected
  44. - lastTransitionTime: '2022-11-18T16:27:15Z'
  45. message: Completed validation
  46. reason: ValidationCompleted
  47. status: 'True'
  48. type: Validated
  49. - lastTransitionTime: '2022-11-18T16:37:16Z'
  50. message: Policy remediation took too long
  51. reason: TimedOut
  52. status: 'False'
  53. type: Progressing
  54. - lastTransitionTime: '2022-11-18T16:37:16Z'
  55. message: Policy remediation took too long
  56. reason: TimedOut
  57. status: 'False'
  58. type: Succeeded (2)
  59. managedPoliciesForUpgrade:
  60. - name: talm-policy
  61. namespace: talm-namespace
  62. managedPoliciesNs:
  63. talm-policy: talm-namespace
  64. remediationPlan:
  65. - - spoke1
  66. - spoke2
  67. status:
  68. startedAt: '2022-11-18T16:27:15Z'
  69. completedAt: '2022-11-18T20:27:15Z'
1If a cluster’s state is timedout, the currentPolicy field shows the name of the policy and the policy status.
2The status for succeeded is false and the message indicates that policy remediation took too long.

Blocking ClusterGroupUpgrade CRs

You can create multiple ClusterGroupUpgrade CRs and control their order of application.

For example, if you create ClusterGroupUpgrade CR C that blocks the start of ClusterGroupUpgrade CR A, then ClusterGroupUpgrade CR A cannot start until the status of ClusterGroupUpgrade CR C becomes UpgradeComplete.

One ClusterGroupUpgrade CR can have multiple blocking CRs. In this case, all the blocking CRs must complete before the upgrade for the current CR can start.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).

  • Provision one or more managed clusters.

  • Log in as a user with cluster-admin privileges.

  • Create RHACM policies in the hub cluster.

Procedure

  1. Save the content of the ClusterGroupUpgrade CRs in the cgu-a.yaml, cgu-b.yaml, and cgu-c.yaml files.

    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: cgu-a
    5. namespace: default
    6. spec:
    7. blockingCRs: (1)
    8. - name: cgu-c
    9. namespace: default
    10. clusters:
    11. - spoke1
    12. - spoke2
    13. - spoke3
    14. enable: false
    15. managedPolicies:
    16. - policy1-common-cluster-version-policy
    17. - policy2-common-pao-sub-policy
    18. - policy3-common-ptp-sub-policy
    19. remediationStrategy:
    20. canaries:
    21. - spoke1
    22. maxConcurrency: 2
    23. timeout: 240
    24. status:
    25. conditions:
    26. - message: The ClusterGroupUpgrade CR is not enabled
    27. reason: UpgradeNotStarted
    28. status: "False"
    29. type: Ready
    30. copiedPolicies:
    31. - cgu-a-policy1-common-cluster-version-policy
    32. - cgu-a-policy2-common-pao-sub-policy
    33. - cgu-a-policy3-common-ptp-sub-policy
    34. managedPoliciesForUpgrade:
    35. - name: policy1-common-cluster-version-policy
    36. namespace: default
    37. - name: policy2-common-pao-sub-policy
    38. namespace: default
    39. - name: policy3-common-ptp-sub-policy
    40. namespace: default
    41. placementBindings:
    42. - cgu-a-policy1-common-cluster-version-policy
    43. - cgu-a-policy2-common-pao-sub-policy
    44. - cgu-a-policy3-common-ptp-sub-policy
    45. placementRules:
    46. - cgu-a-policy1-common-cluster-version-policy
    47. - cgu-a-policy2-common-pao-sub-policy
    48. - cgu-a-policy3-common-ptp-sub-policy
    49. remediationPlan:
    50. - - spoke1
    51. - - spoke2
    1Defines the blocking CRs. The cgu-a update cannot start until cgu-c is complete.
    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: cgu-b
    5. namespace: default
    6. spec:
    7. blockingCRs: (1)
    8. - name: cgu-a
    9. namespace: default
    10. clusters:
    11. - spoke4
    12. - spoke5
    13. enable: false
    14. managedPolicies:
    15. - policy1-common-cluster-version-policy
    16. - policy2-common-pao-sub-policy
    17. - policy3-common-ptp-sub-policy
    18. - policy4-common-sriov-sub-policy
    19. remediationStrategy:
    20. maxConcurrency: 1
    21. timeout: 240
    22. status:
    23. conditions:
    24. - message: The ClusterGroupUpgrade CR is not enabled
    25. reason: UpgradeNotStarted
    26. status: "False"
    27. type: Ready
    28. copiedPolicies:
    29. - cgu-b-policy1-common-cluster-version-policy
    30. - cgu-b-policy2-common-pao-sub-policy
    31. - cgu-b-policy3-common-ptp-sub-policy
    32. - cgu-b-policy4-common-sriov-sub-policy
    33. managedPoliciesForUpgrade:
    34. - name: policy1-common-cluster-version-policy
    35. namespace: default
    36. - name: policy2-common-pao-sub-policy
    37. namespace: default
    38. - name: policy3-common-ptp-sub-policy
    39. namespace: default
    40. - name: policy4-common-sriov-sub-policy
    41. namespace: default
    42. placementBindings:
    43. - cgu-b-policy1-common-cluster-version-policy
    44. - cgu-b-policy2-common-pao-sub-policy
    45. - cgu-b-policy3-common-ptp-sub-policy
    46. - cgu-b-policy4-common-sriov-sub-policy
    47. placementRules:
    48. - cgu-b-policy1-common-cluster-version-policy
    49. - cgu-b-policy2-common-pao-sub-policy
    50. - cgu-b-policy3-common-ptp-sub-policy
    51. - cgu-b-policy4-common-sriov-sub-policy
    52. remediationPlan:
    53. - - spoke4
    54. - - spoke5
    55. status: {}
    1The cgu-b update cannot start until cgu-a is complete.
    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: cgu-c
    5. namespace: default
    6. spec: (1)
    7. clusters:
    8. - spoke6
    9. enable: false
    10. managedPolicies:
    11. - policy1-common-cluster-version-policy
    12. - policy2-common-pao-sub-policy
    13. - policy3-common-ptp-sub-policy
    14. - policy4-common-sriov-sub-policy
    15. remediationStrategy:
    16. maxConcurrency: 1
    17. timeout: 240
    18. status:
    19. conditions:
    20. - message: The ClusterGroupUpgrade CR is not enabled
    21. reason: UpgradeNotStarted
    22. status: "False"
    23. type: Ready
    24. copiedPolicies:
    25. - cgu-c-policy1-common-cluster-version-policy
    26. - cgu-c-policy4-common-sriov-sub-policy
    27. managedPoliciesCompliantBeforeUpgrade:
    28. - policy2-common-pao-sub-policy
    29. - policy3-common-ptp-sub-policy
    30. managedPoliciesForUpgrade:
    31. - name: policy1-common-cluster-version-policy
    32. namespace: default
    33. - name: policy4-common-sriov-sub-policy
    34. namespace: default
    35. placementBindings:
    36. - cgu-c-policy1-common-cluster-version-policy
    37. - cgu-c-policy4-common-sriov-sub-policy
    38. placementRules:
    39. - cgu-c-policy1-common-cluster-version-policy
    40. - cgu-c-policy4-common-sriov-sub-policy
    41. remediationPlan:
    42. - - spoke6
    43. status: {}
    1The cgu-c update does not have any blocking CRs. TALM starts the cgu-c update when the enable field is set to true.
  2. Create the ClusterGroupUpgrade CRs by running the following command for each relevant CR:

    1. $ oc apply -f <name>.yaml
  3. Start the update process by running the following command for each relevant CR:

    1. $ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \
    2. --type merge -p '{"spec":{"enable":true}}'

    The following examples show ClusterGroupUpgrade CRs where the enable field is set to true:

    Example for cgu-a with blocking CRs

    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: cgu-a
    5. namespace: default
    6. spec:
    7. blockingCRs:
    8. - name: cgu-c
    9. namespace: default
    10. clusters:
    11. - spoke1
    12. - spoke2
    13. - spoke3
    14. enable: true
    15. managedPolicies:
    16. - policy1-common-cluster-version-policy
    17. - policy2-common-pao-sub-policy
    18. - policy3-common-ptp-sub-policy
    19. remediationStrategy:
    20. canaries:
    21. - spoke1
    22. maxConcurrency: 2
    23. timeout: 240
    24. status:
    25. conditions:
    26. - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet
    27. completed: [cgu-c]' (1)
    28. reason: UpgradeCannotStart
    29. status: "False"
    30. type: Ready
    31. copiedPolicies:
    32. - cgu-a-policy1-common-cluster-version-policy
    33. - cgu-a-policy2-common-pao-sub-policy
    34. - cgu-a-policy3-common-ptp-sub-policy
    35. managedPoliciesForUpgrade:
    36. - name: policy1-common-cluster-version-policy
    37. namespace: default
    38. - name: policy2-common-pao-sub-policy
    39. namespace: default
    40. - name: policy3-common-ptp-sub-policy
    41. namespace: default
    42. placementBindings:
    43. - cgu-a-policy1-common-cluster-version-policy
    44. - cgu-a-policy2-common-pao-sub-policy
    45. - cgu-a-policy3-common-ptp-sub-policy
    46. placementRules:
    47. - cgu-a-policy1-common-cluster-version-policy
    48. - cgu-a-policy2-common-pao-sub-policy
    49. - cgu-a-policy3-common-ptp-sub-policy
    50. remediationPlan:
    51. - - spoke1
    52. - - spoke2
    53. status: {}
    1Shows the list of blocking CRs.

    Example for cgu-b with blocking CRs

    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: cgu-b
    5. namespace: default
    6. spec:
    7. blockingCRs:
    8. - name: cgu-a
    9. namespace: default
    10. clusters:
    11. - spoke4
    12. - spoke5
    13. enable: true
    14. managedPolicies:
    15. - policy1-common-cluster-version-policy
    16. - policy2-common-pao-sub-policy
    17. - policy3-common-ptp-sub-policy
    18. - policy4-common-sriov-sub-policy
    19. remediationStrategy:
    20. maxConcurrency: 1
    21. timeout: 240
    22. status:
    23. conditions:
    24. - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet
    25. completed: [cgu-a]' (1)
    26. reason: UpgradeCannotStart
    27. status: "False"
    28. type: Ready
    29. copiedPolicies:
    30. - cgu-b-policy1-common-cluster-version-policy
    31. - cgu-b-policy2-common-pao-sub-policy
    32. - cgu-b-policy3-common-ptp-sub-policy
    33. - cgu-b-policy4-common-sriov-sub-policy
    34. managedPoliciesForUpgrade:
    35. - name: policy1-common-cluster-version-policy
    36. namespace: default
    37. - name: policy2-common-pao-sub-policy
    38. namespace: default
    39. - name: policy3-common-ptp-sub-policy
    40. namespace: default
    41. - name: policy4-common-sriov-sub-policy
    42. namespace: default
    43. placementBindings:
    44. - cgu-b-policy1-common-cluster-version-policy
    45. - cgu-b-policy2-common-pao-sub-policy
    46. - cgu-b-policy3-common-ptp-sub-policy
    47. - cgu-b-policy4-common-sriov-sub-policy
    48. placementRules:
    49. - cgu-b-policy1-common-cluster-version-policy
    50. - cgu-b-policy2-common-pao-sub-policy
    51. - cgu-b-policy3-common-ptp-sub-policy
    52. - cgu-b-policy4-common-sriov-sub-policy
    53. remediationPlan:
    54. - - spoke4
    55. - - spoke5
    56. status: {}
    1Shows the list of blocking CRs.

    Example for cgu-c with blocking CRs

    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: cgu-c
    5. namespace: default
    6. spec:
    7. clusters:
    8. - spoke6
    9. enable: true
    10. managedPolicies:
    11. - policy1-common-cluster-version-policy
    12. - policy2-common-pao-sub-policy
    13. - policy3-common-ptp-sub-policy
    14. - policy4-common-sriov-sub-policy
    15. remediationStrategy:
    16. maxConcurrency: 1
    17. timeout: 240
    18. status:
    19. conditions:
    20. - message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant (1)
    21. reason: UpgradeNotCompleted
    22. status: "False"
    23. type: Ready
    24. copiedPolicies:
    25. - cgu-c-policy1-common-cluster-version-policy
    26. - cgu-c-policy4-common-sriov-sub-policy
    27. managedPoliciesCompliantBeforeUpgrade:
    28. - policy2-common-pao-sub-policy
    29. - policy3-common-ptp-sub-policy
    30. managedPoliciesForUpgrade:
    31. - name: policy1-common-cluster-version-policy
    32. namespace: default
    33. - name: policy4-common-sriov-sub-policy
    34. namespace: default
    35. placementBindings:
    36. - cgu-c-policy1-common-cluster-version-policy
    37. - cgu-c-policy4-common-sriov-sub-policy
    38. placementRules:
    39. - cgu-c-policy1-common-cluster-version-policy
    40. - cgu-c-policy4-common-sriov-sub-policy
    41. remediationPlan:
    42. - - spoke6
    43. status:
    44. currentBatch: 1
    45. remediationPlanForBatch:
    46. spoke6: 0
    1The cgu-c update does not have any blocking CRs.

Update policies on managed clusters

The Topology Aware Lifecycle Manager (TALM) remediates a set of inform policies for the clusters specified in the ClusterGroupUpgrade CR. TALM remediates inform policies by making enforce copies of the managed RHACM policies. Each copied policy has its own corresponding RHACM placement rule and RHACM placement binding.

One by one, TALM adds each cluster from the current batch to the placement rule that corresponds with the applicable managed policy. If a cluster is already compliant with a policy, TALM skips applying that policy on the compliant cluster. TALM then moves on to applying the next policy to the non-compliant cluster. After TALM completes the updates in a batch, all clusters are removed from the placement rules associated with the copied policies. Then, the update of the next batch starts.

If a spoke cluster does not report any compliant state to RHACM, the managed policies on the hub cluster can be missing status information that TALM needs. TALM handles these cases in the following ways:

  • If a policy’s status.compliant field is missing, TALM ignores the policy and adds a log entry. Then, TALM continues looking at the policy’s status.status field.

  • If a policy’s status.status is missing, TALM produces an error.

  • If a cluster’s compliance status is missing in the policy’s status.status field, TALM considers that cluster to be non-compliant with that policy.

The ClusterGroupUpgrade CR’s batchTimeoutAction determines what happens if an upgrade fails for a cluster. You can specify continue to skip the failing cluster and continue to upgrade other clusters, or specify abort to stop the policy remediation for all clusters. Once the timeout elapses, TALM removes all enforce policies to ensure that no further updates are made to clusters.

For more information about RHACM policies, see Policy overview.

Additional resources

For more information about the PolicyGenTemplate CRD, see About the PolicyGenTemplate CRD.

Applying update policies to managed clusters

You can update your managed clusters by applying your policies.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).

  • Provision one or more managed clusters.

  • Log in as a user with cluster-admin privileges.

  • Create RHACM policies in the hub cluster.

Procedure

  1. Save the contents of the ClusterGroupUpgrade CR in the cgu-1.yaml file.

    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: cgu-1
    5. namespace: default
    6. spec:
    7. managedPolicies: (1)
    8. - policy1-common-cluster-version-policy
    9. - policy2-common-nto-sub-policy
    10. - policy3-common-ptp-sub-policy
    11. - policy4-common-sriov-sub-policy
    12. enable: false
    13. clusters: (2)
    14. - spoke1
    15. - spoke2
    16. - spoke5
    17. - spoke6
    18. remediationStrategy:
    19. maxConcurrency: 2 (3)
    20. timeout: 240 (4)
    21. batchTimeoutAction: (5)
    1The name of the policies to apply.
    2The list of clusters to update.
    3The maxConcurrency field signifies the number of clusters updated at the same time.
    4The update timeout in minutes.
    5Controls what happens if a batch times out. Possible values are abort or continue. If unspecified, the default is continue.
  2. Create the ClusterGroupUpgrade CR by running the following command:

    1. $ oc create -f cgu-1.yaml
    1. Check if the ClusterGroupUpgrade CR was created in the hub cluster by running the following command:

      1. $ oc get cgu --all-namespaces

      Example output

      1. NAMESPACE NAME AGE STATE DETAILS
      2. default cgu-1 8m55 NotEnabled Not Enabled
    2. Check the status of the update by running the following command:

      1. $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq

      Example output

      1. {
      2. "computedMaxConcurrency": 2,
      3. "conditions": [
      4. {
      5. "lastTransitionTime": "2022-02-25T15:34:07Z",
      6. "message": "Not enabled", (1)
      7. "reason": "NotEnabled",
      8. "status": "False",
      9. "type": "Progressing"
      10. }
      11. ],
      12. "copiedPolicies": [
      13. "cgu-policy1-common-cluster-version-policy",
      14. "cgu-policy2-common-nto-sub-policy",
      15. "cgu-policy3-common-ptp-sub-policy",
      16. "cgu-policy4-common-sriov-sub-policy"
      17. ],
      18. "managedPoliciesContent": {
      19. "policy1-common-cluster-version-policy": "null",
      20. "policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]",
      21. "policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]",
      22. "policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]"
      23. },
      24. "managedPoliciesForUpgrade": [
      25. {
      26. "name": "policy1-common-cluster-version-policy",
      27. "namespace": "default"
      28. },
      29. {
      30. "name": "policy2-common-nto-sub-policy",
      31. "namespace": "default"
      32. },
      33. {
      34. "name": "policy3-common-ptp-sub-policy",
      35. "namespace": "default"
      36. },
      37. {
      38. "name": "policy4-common-sriov-sub-policy",
      39. "namespace": "default"
      40. }
      41. ],
      42. "managedPoliciesNs": {
      43. "policy1-common-cluster-version-policy": "default",
      44. "policy2-common-nto-sub-policy": "default",
      45. "policy3-common-ptp-sub-policy": "default",
      46. "policy4-common-sriov-sub-policy": "default"
      47. },
      48. "placementBindings": [
      49. "cgu-policy1-common-cluster-version-policy",
      50. "cgu-policy2-common-nto-sub-policy",
      51. "cgu-policy3-common-ptp-sub-policy",
      52. "cgu-policy4-common-sriov-sub-policy"
      53. ],
      54. "placementRules": [
      55. "cgu-policy1-common-cluster-version-policy",
      56. "cgu-policy2-common-nto-sub-policy",
      57. "cgu-policy3-common-ptp-sub-policy",
      58. "cgu-policy4-common-sriov-sub-policy"
      59. ],
      60. "precaching": {
      61. "spec": {}
      62. },
      63. "remediationPlan": [
      64. [
      65. "spoke1",
      66. "spoke2"
      67. ],
      68. [
      69. "spoke5",
      70. "spoke6"
      71. ]
      72. ],
      73. "status": {}
      74. }
      1The spec.enable field in the ClusterGroupUpgrade CR is set to false.
    3. Check the status of the policies by running the following command:

      1. $ oc get policies -A

      Example output

      1. NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
      2. default cgu-policy1-common-cluster-version-policy enforce 17m (1)
      3. default cgu-policy2-common-nto-sub-policy enforce 17m
      4. default cgu-policy3-common-ptp-sub-policy enforce 17m
      5. default cgu-policy4-common-sriov-sub-policy enforce 17m
      6. default policy1-common-cluster-version-policy inform NonCompliant 15h
      7. default policy2-common-nto-sub-policy inform NonCompliant 15h
      8. default policy3-common-ptp-sub-policy inform NonCompliant 18m
      9. default policy4-common-sriov-sub-policy inform NonCompliant 18m
      1The spec.remediationAction field of policies currently applied on the clusters is set to enforce. The managed policies in inform mode from the ClusterGroupUpgrade CR remain in inform mode during the update.
  3. Change the value of the spec.enable field to true by running the following command:

    1. $ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \
    2. --patch '{"spec":{"enable":true}}' --type=merge

Verification

  1. Check the status of the update again by running the following command:

    1. $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq

    Example output

    1. {
    2. "computedMaxConcurrency": 2,
    3. "conditions": [ (1)
    4. {
    5. "lastTransitionTime": "2022-02-25T15:33:07Z",
    6. "message": "All selected clusters are valid",
    7. "reason": "ClusterSelectionCompleted",
    8. "status": "True",
    9. "type": "ClustersSelected",
    10. "lastTransitionTime": "2022-02-25T15:33:07Z",
    11. "message": "Completed validation",
    12. "reason": "ValidationCompleted",
    13. "status": "True",
    14. "type": "Validated",
    15. "lastTransitionTime": "2022-02-25T15:34:07Z",
    16. "message": "Remediating non-compliant policies",
    17. "reason": "InProgress",
    18. "status": "True",
    19. "type": "Progressing"
    20. }
    21. ],
    22. "copiedPolicies": [
    23. "cgu-policy1-common-cluster-version-policy",
    24. "cgu-policy2-common-nto-sub-policy",
    25. "cgu-policy3-common-ptp-sub-policy",
    26. "cgu-policy4-common-sriov-sub-policy"
    27. ],
    28. "managedPoliciesContent": {
    29. "policy1-common-cluster-version-policy": "null",
    30. "policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]",
    31. "policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]",
    32. "policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]"
    33. },
    34. "managedPoliciesForUpgrade": [
    35. {
    36. "name": "policy1-common-cluster-version-policy",
    37. "namespace": "default"
    38. },
    39. {
    40. "name": "policy2-common-nto-sub-policy",
    41. "namespace": "default"
    42. },
    43. {
    44. "name": "policy3-common-ptp-sub-policy",
    45. "namespace": "default"
    46. },
    47. {
    48. "name": "policy4-common-sriov-sub-policy",
    49. "namespace": "default"
    50. }
    51. ],
    52. "managedPoliciesNs": {
    53. "policy1-common-cluster-version-policy": "default",
    54. "policy2-common-nto-sub-policy": "default",
    55. "policy3-common-ptp-sub-policy": "default",
    56. "policy4-common-sriov-sub-policy": "default"
    57. },
    58. "placementBindings": [
    59. "cgu-policy1-common-cluster-version-policy",
    60. "cgu-policy2-common-nto-sub-policy",
    61. "cgu-policy3-common-ptp-sub-policy",
    62. "cgu-policy4-common-sriov-sub-policy"
    63. ],
    64. "placementRules": [
    65. "cgu-policy1-common-cluster-version-policy",
    66. "cgu-policy2-common-nto-sub-policy",
    67. "cgu-policy3-common-ptp-sub-policy",
    68. "cgu-policy4-common-sriov-sub-policy"
    69. ],
    70. "precaching": {
    71. "spec": {}
    72. },
    73. "remediationPlan": [
    74. [
    75. "spoke1",
    76. "spoke2"
    77. ],
    78. [
    79. "spoke5",
    80. "spoke6"
    81. ]
    82. ],
    83. "status": {
    84. "currentBatch": 1,
    85. "currentBatchStartedAt": "2022-02-25T15:54:16Z",
    86. "remediationPlanForBatch": {
    87. "spoke1": 0,
    88. "spoke2": 1
    89. },
    90. "startedAt": "2022-02-25T15:54:16Z"
    91. }
    92. }
    1Reflects the update progress of the current batch. Run this command again to receive updated information about the progress.
  2. If the policies include Operator subscriptions, you can check the installation progress directly on the single-node cluster.

    1. Export the KUBECONFIG file of the single-node cluster you want to check the installation progress for by running the following command:

      1. $ export KUBECONFIG=<cluster_kubeconfig_absolute_path>
    2. Check all the subscriptions present on the single-node cluster and look for the one in the policy you are trying to install through the ClusterGroupUpgrade CR by running the following command:

      1. $ oc get subs -A | grep -i <subscription_name>

      Example output for cluster-logging policy

      1. NAMESPACE NAME PACKAGE SOURCE CHANNEL
      2. openshift-logging cluster-logging cluster-logging redhat-operators stable
  3. If one of the managed policies includes a ClusterVersion CR, check the status of platform updates in the current batch by running the following command against the spoke cluster:

    1. $ oc get clusterversion

    Example output

    1. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
    2. version 4.9.5 True True 43s Working towards 4.9.7: 71 of 735 done (9% complete)
  4. Check the Operator subscription by running the following command:

    1. $ oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"
  5. Check the install plans present on the single-node cluster that is associated with the desired subscription by running the following command:

    1. $ oc get installplan -n <subscription_namespace>

    Example output for cluster-logging Operator

    1. NAMESPACE NAME CSV APPROVAL APPROVED
    2. openshift-logging install-6khtw cluster-logging.5.3.3-4 Manual true (1)
    1The install plans have their Approval field set to Manual and their Approved field changes from true to false after TALM approves the install plan.
  6. Check if the cluster service version for the Operator of the policy that the ClusterGroupUpgrade is installing reached the Succeeded phase by running the following command:

    1. $ oc get csv -n <operator_namespace>

    Example output for OpenShift Logging Operator

    1. NAME DISPLAY VERSION REPLACES PHASE
    2. cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded

Creating a backup of cluster resources before upgrade

For single-node OpenShift, the Topology Aware Lifecycle Manager (TALM) can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.

To use the backup feature you first create a ClusterGroupUpgrade CR with the backup field set to true. To ensure that the contents of the backup are up to date, the backup is not taken until you set the enable field in the ClusterGroupUpgrade CR to true.

TALM uses the BackupSucceeded condition to report the status and reasons as follows:

  • true

    Backup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update does not proceed for that cluster.

  • false

    Backup is still in progress for one or more clusters or has failed for all clusters. The backup process running in the spoke clusters can have the following statuses:

    • PreparingToStart

      The first reconciliation pass is in progress. The TALM deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.

    • Starting

      The backup prerequisites and backup job are being created.

    • Active

      The backup is in progress.

    • Succeeded

      The backup succeeded.

    • BackupTimeout

      Artifact backup is partially done.

    • UnrecoverableError

      The backup has ended with a non-zero exit code.

If the backup of a cluster fails and enters the BackupTimeout or UnrecoverableError state, the cluster update does not proceed for that cluster. Updates to other clusters are not affected and continue.

Creating a ClusterGroupUpgrade CR with backup

For single-node OpenShift, you can create a backup of a deployment before an upgrade. If the upgrade fails you can use the upgrade-recovery.sh script generated by Topology Aware Lifecycle Manager (TALM) to return the system to its preupgrade state. The backup consists of the following items:

Cluster backup

A snapshot of etcd and static pod manifests.

Content backup

Backups of folders, for example, /etc, /usr/local, /var/lib/kubelet.

Changed files backup

Any files managed by machine-config that have been changed.

Deployment

A pinned ostree deployment.

Images (Optional)

Any container images that are in use.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).

  • Provision one or more managed clusters.

  • Log in as a user with cluster-admin privileges.

  • Install Red Hat Advanced Cluster Management (RHACM).

It is highly recommended that you create a recovery partition. The following is an example SiteConfig custom resource (CR) for a recovery partition of 50 GB:

  1. nodes:
  2. - hostName: snonode.sno-worker-0.e2e.bos.redhat.com
  3. role: master
  4. rootDeviceHints:
  5. hctl: 0:2:0:0
  6. deviceName: /dev/sda
  7. ……..
  8. ……..
  9. #Disk /dev/sda: 893.3 GiB, 959119884288 bytes, 1873281024 sectors
  10. diskPartition:
  11. - device: /dev/sda
  12. partitions:
  13. - mount_point: /var/recovery
  14. size: 51200
  15. start: 800000

Procedure

  1. Save the contents of the ClusterGroupUpgrade CR with the backup and enable fields set to true in the clustergroupupgrades-group-du.yaml file:

    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: du-upgrade-4918
    5. namespace: ztp-group-du-sno
    6. spec:
    7. preCaching: true
    8. backup: true
    9. clusters:
    10. - cnfdb1
    11. - cnfdb2
    12. enable: true
    13. managedPolicies:
    14. - du-upgrade-platform-upgrade
    15. remediationStrategy:
    16. maxConcurrency: 2
    17. timeout: 240
  2. To start the update, apply the ClusterGroupUpgrade CR by running the following command:

    1. $ oc apply -f clustergroupupgrades-group-du.yaml

Verification

  • Check the status of the upgrade in the hub cluster by running the following command:

    1. $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'

    Example output

    1. {
    2. "backup": {
    3. "clusters": [
    4. "cnfdb2",
    5. "cnfdb1"
    6. ],
    7. "status": {
    8. "cnfdb1": "Succeeded",
    9. "cnfdb2": "Failed" (1)
    10. }
    11. },
    12. "computedMaxConcurrency": 1,
    13. "conditions": [
    14. {
    15. "lastTransitionTime": "2022-04-05T10:37:19Z",
    16. "message": "Backup failed for 1 cluster", (2)
    17. "reason": "PartiallyDone", (3)
    18. "status": "True", (4)
    19. "type": "Succeeded"
    20. }
    21. ],
    22. "precaching": {
    23. "spec": {}
    24. },
    25. "status": {}
    1Backup has failed for one cluster.
    2The message confirms that the backup failed for one cluster.
    3The backup was partially successful.
    4The backup process has finished.

Recovering a cluster after a failed upgrade

If an upgrade of a cluster fails, you can manually log in to the cluster and use the backup to return the cluster to its preupgrade state. There are two stages:

Rollback

If the attempted upgrade included a change to the platform OS deployment, you must roll back to the previous version before running the recovery script.

Recovery

The recovery shuts down containers and uses files from the backup partition to relaunch containers and restore clusters.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).

  • Provision one or more managed clusters.

  • Install Red Hat Advanced Cluster Management (RHACM).

  • Log in as a user with cluster-admin privileges.

  • Run an upgrade that is configured for backup.

Procedure

  1. Delete the previously created ClusterGroupUpgrade custom resource (CR) by running the following command:

    1. $ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
  2. Log in to the cluster that you want to recover.

  3. Check the status of the platform OS deployment by running the following command:

    1. $ ostree admin status

    Example outputs

    1. [root@lab-test-spoke2-node-0 core]# ostree admin status
    2. * rhcos c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9.0
    3. Version: 49.84.202202230006-0
    4. Pinned: yes (1)
    5. origin refspec: c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9
    1The current deployment is pinned. A platform OS deployment rollback is not necessary.
    1. [root@lab-test-spoke2-node-0 core]# ostree admin status
    2. * rhcos f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa.0
    3. Version: 410.84.202204050541-0
    4. origin refspec: f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa
    5. rhcos ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca.0 (rollback) (1)
    6. Version: 410.84.202203290245-0
    7. Pinned: yes (2)
    8. origin refspec: ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca
    1This platform OS deployment is marked for rollback.
    2The previous deployment is pinned and can be rolled back.
  4. To trigger a rollback of the platform OS deployment, run the following command:

    1. $ rpm-ostree rollback -r
  5. The first phase of the recovery shuts down containers and restores files from the backup partition to the targeted directories. To begin the recovery, run the following command:

    1. $ /var/recovery/upgrade-recovery.sh
  6. When prompted, reboot the cluster by running the following command:

    1. $ systemctl reboot
  7. After the reboot, restart the recovery by running the following command:

    1. $ /var/recovery/upgrade-recovery.sh --resume

If the recovery utility fails, you can retry with the —restart option:

  1. $ /var/recovery/upgrade-recovery.sh restart

Verification

  • To check the status of the recovery run the following command:

    1. $ oc get clusterversion,nodes,clusteroperator

    Example output

    ``` NAME VERSION AVAILABLE PROGRESSING SINCE STATUS clusterversion.config.openshift.io/version 4.9.23 True False 86d Cluster version is 4.9.23 (1)

  1. NAME STATUS ROLES AGE VERSION
  2. node/lab-test-spoke1-node-0 Ready master,worker 86d v1.22.3+b93fd35 (2)
  3. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
  4. clusteroperator.config.openshift.io/authentication 4.9.23 True False False 2d7h (3)
  5. clusteroperator.config.openshift.io/baremetal 4.9.23 True False False 86d
  6. ..............
  7. ```
  8. <table><tbody><tr><td><i data-value="1"></i><b>1</b></td><td>The cluster version is available and has the correct version.</td></tr><tr><td><i data-value="2"></i><b>2</b></td><td>The node status is <code>Ready</code>.</td></tr><tr><td><i data-value="3"></i><b>3</b></td><td>The <code>ClusterOperator</code> object’s availability is <code>True</code>.</td></tr></tbody></table>

Using the container image pre-cache feature

Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.

The time of the update is not set by TALM. You can apply the ClusterGroupUpgrade CR at the beginning of the update by manual application or by external automation.

The container image pre-caching starts when the preCaching field is set to true in the ClusterGroupUpgrade CR.

TALM uses the PrecacheSpecValid condition to report status information as follows:

  • true

    The pre-caching spec is valid and consistent.

  • false

    The pre-caching spec is incomplete.

TALM uses the PrecachingSucceeded condition to report status information as follows:

  • true

    TALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.

  • false

    Pre-caching is still in progress for one or more clusters or has failed for all clusters.

After a successful pre-caching process, you can start remediating policies. The remediation actions start when the enable field is set to true. If there is a pre-caching failure on a cluster, the upgrade fails for that cluster. The upgrade process continues for all other clusters that have a successful pre-cache.

The pre-caching process can be in the following statuses:

  • NotStarted

    This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the ClusterGroupUpgrade CR. In this state, TALM deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. TALM then creates a new ManagedClusterView resource for the spoke pre-caching namespace to verify its deletion in the PrecachePreparing state.

  • PreparingToStart

    Cleaning up any remaining resources from previous incomplete updates is in progress.

  • Starting

    Pre-caching job prerequisites and the job are created.

  • Active

    The job is in “Active” state.

  • Succeeded

    The pre-cache job succeeded.

  • PrecacheTimeout

    The artifact pre-caching is partially done.

  • UnrecoverableError

    The job ends with a non-zero exit code.

Creating a ClusterGroupUpgrade CR with pre-caching

The pre-cache feature allows the required container images to be present on the spoke cluster before the update starts.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).

  • Provision one or more managed clusters.

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Save the contents of the ClusterGroupUpgrade CR with the preCaching field set to true in the clustergroupupgrades-group-du.yaml file:

    1. apiVersion: ran.openshift.io/v1alpha1
    2. kind: ClusterGroupUpgrade
    3. metadata:
    4. name: du-upgrade-4918
    5. namespace: ztp-group-du-sno
    6. spec:
    7. preCaching: true (1)
    8. clusters:
    9. - cnfdb1
    10. - cnfdb2
    11. enable: false
    12. managedPolicies:
    13. - du-upgrade-platform-upgrade
    14. remediationStrategy:
    15. maxConcurrency: 2
    16. timeout: 240
    1The preCaching field is set to true, which enables TALM to pull the container images before starting the update.
  2. When you want to start pre-caching, apply the ClusterGroupUpgrade CR by running the following command:

    1. $ oc apply -f clustergroupupgrades-group-du.yaml

Verification

  1. Check if the ClusterGroupUpgrade CR exists in the hub cluster by running the following command:

    1. $ oc get cgu -A

    Example output

    1. NAMESPACE NAME AGE STATE DETAILS
    2. ztp-group-du-sno du-upgrade-4918 10s InProgress Precaching is required and not done (1)
    1The CR is created.
  2. Check the status of the pre-caching task by running the following command:

    1. $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'

    Example output

    1. {
    2. "conditions": [
    3. {
    4. "lastTransitionTime": "2022-01-27T19:07:24Z",
    5. "message": "Precaching is required and not done",
    6. "reason": "InProgress",
    7. "status": "False",
    8. "type": "PrecachingSucceeded"
    9. },
    10. {
    11. "lastTransitionTime": "2022-01-27T19:07:34Z",
    12. "message": "Pre-caching spec is valid and consistent",
    13. "reason": "PrecacheSpecIsWellFormed",
    14. "status": "True",
    15. "type": "PrecacheSpecValid"
    16. }
    17. ],
    18. "precaching": {
    19. "clusters": [
    20. "cnfdb1" (1)
    21. "cnfdb2"
    22. ],
    23. "spec": {
    24. "platformImage": "image.example.io"},
    25. "status": {
    26. "cnfdb1": "Active"
    27. "cnfdb2": "Succeeded"}
    28. }
    29. }
    1Displays the list of identified clusters.
  3. Check the status of the pre-caching job by running the following command on the spoke cluster:

    1. $ oc get jobs,pods -n openshift-talm-pre-cache

    Example output

    1. NAME COMPLETIONS DURATION AGE
    2. job.batch/pre-cache 0/1 3m10s 3m10s
    3. NAME READY STATUS RESTARTS AGE
    4. pod/pre-cache--1-9bmlr 1/1 Running 0 3m10s
  4. Check the status of the ClusterGroupUpgrade CR by running the following command:

    1. $ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'

    Example output

    1. "conditions": [
    2. {
    3. "lastTransitionTime": "2022-01-27T19:30:41Z",
    4. "message": "The ClusterGroupUpgrade CR has all clusters compliant with all the managed policies",
    5. "reason": "UpgradeCompleted",
    6. "status": "True",
    7. "type": "Ready"
    8. },
    9. {
    10. "lastTransitionTime": "2022-01-27T19:28:57Z",
    11. "message": "Precaching is completed",
    12. "reason": "PrecachingCompleted",
    13. "status": "True",
    14. "type": "PrecachingSucceeded" (1)
    15. }
    1The pre-cache tasks are done.

Troubleshooting the Topology Aware Lifecycle Manager

The Topology Aware Lifecycle Manager (TALM) is an OKD Operator that remediates RHACM policies. When issues occur, use the oc adm must-gather command to gather details and logs and to take steps in debugging the issues.

For more information about related topics, see the following documentation:

General troubleshooting

You can determine the cause of the problem by reviewing the following questions:

To ensure that the ClusterGroupUpgrade configuration is functional, you can do the following:

  1. Create the ClusterGroupUpgrade CR with the spec.enable field set to false.

  2. Wait for the status to be updated and go through the troubleshooting questions.

  3. If everything looks as expected, set the spec.enable field to true in the ClusterGroupUpgrade CR.

After you set the spec.enable field to true in the ClusterUpgradeGroup CR, the update procedure starts and you cannot edit the CR’s spec fields anymore.

Cannot modify the ClusterUpgradeGroup CR

Issue

You cannot edit the ClusterUpgradeGroup CR after enabling the update.

Resolution

Restart the procedure by performing the following steps:

  1. Remove the old ClusterGroupUpgrade CR by running the following command:

    1. $ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
  2. Check and fix the existing issues with the managed clusters and policies.

    1. Ensure that all the clusters are managed clusters and available.

    2. Ensure that all the policies exist and have the spec.remediationAction field set to inform.

  3. Create a new ClusterGroupUpgrade CR with the correct configurations.

    1. $ oc apply -f <ClusterGroupUpgradeCR_YAML>

Managed policies

Checking managed policies on the system

Issue

You want to check if you have the correct managed policies on the system.

Resolution

Run the following command:

  1. $ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'

Example output

  1. ["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]

Checking remediationAction mode

Issue

You want to check if the remediationAction field is set to inform in the spec of the managed policies.

Resolution

Run the following command:

  1. $ oc get policies --all-namespaces

Example output

  1. NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
  2. default policy1-common-cluster-version-policy inform NonCompliant 5d21h
  3. default policy2-common-nto-sub-policy inform Compliant 5d21h
  4. default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
  5. default policy4-common-sriov-sub-policy inform NonCompliant 5d21h

Checking policy compliance state

Issue

You want to check the compliance state of policies.

Resolution

Run the following command:

  1. $ oc get policies --all-namespaces

Example output

  1. NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
  2. default policy1-common-cluster-version-policy inform NonCompliant 5d21h
  3. default policy2-common-nto-sub-policy inform Compliant 5d21h
  4. default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
  5. default policy4-common-sriov-sub-policy inform NonCompliant 5d21h

Clusters

Checking if managed clusters are present

Issue

You want to check if the clusters in the ClusterGroupUpgrade CR are managed clusters.

Resolution

Run the following command:

  1. $ oc get managedclusters

Example output

  1. NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
  2. local-cluster true https://api.hub.example.com:6443 True Unknown 13d
  3. spoke1 true https://api.spoke1.example.com:6443 True True 13d
  4. spoke3 true https://api.spoke3.example.com:6443 True True 27h
  1. Alternatively, check the TALM manager logs:

    1. Get the name of the TALM manager by running the following command:

      1. $ oc get pod -n openshift-operators

      Example output

      1. NAME READY STATUS RESTARTS AGE
      2. cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45m
    2. Check the TALM manager logs by running the following command:

      1. $ oc logs -n openshift-operators \
      2. cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager

      Example output

      1. ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} (1)
      2. sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
      1The error message shows that the cluster is not a managed cluster.

Checking if managed clusters are available

Issue

You want to check if the managed clusters specified in the ClusterGroupUpgrade CR are available.

Resolution

Run the following command:

  1. $ oc get managedclusters

Example output

  1. NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
  2. local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d
  3. spoke1 true https://api.spoke1.testlab.com:6443 True True 13d (1)
  4. spoke3 true https://api.spoke3.testlab.com:6443 True True 27h (1)
1The value of the AVAILABLE field is True for the managed clusters.

Checking clusterLabelSelector

Issue

You want to check if the clusterLabelSelector field specified in the ClusterGroupUpgrade CR matches at least one of the managed clusters.

Resolution

Run the following command:

  1. $ oc get managedcluster --selector=upgrade=true (1)
1The label for the clusters you want to update is upgrade:true.

Example output

  1. NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
  2. spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
  3. spoke3 true https://api.spoke3.testlab.com:6443 True True 27h

Checking if canary clusters are present

Issue

You want to check if the canary clusters are present in the list of clusters.

Example ClusterGroupUpgrade CR

  1. spec:
  2. remediationStrategy:
  3. canaries:
  4. - spoke3
  5. maxConcurrency: 2
  6. timeout: 240
  7. clusterLabelSelectors:
  8. - matchLabels:
  9. upgrade: true

Resolution

Run the following commands:

  1. $ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'

Example output

  1. ["spoke1", "spoke3"]
  1. Check if the canary clusters are present in the list of clusters that match clusterLabelSelector labels by running the following command:

    1. $ oc get managedcluster --selector=upgrade=true

    Example output

    1. NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
    2. spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
    3. spoke3 true https://api.spoke3.testlab.com:6443 True True 27h

A cluster can be present in spec.clusters and also be matched by the spec.clusterLabelSelector label.

Checking the pre-caching status on spoke clusters

  1. Check the status of pre-caching by running the following command on the spoke cluster:

    1. $ oc get jobs,pods -n openshift-talo-pre-cache

Remediation Strategy

Checking if remediationStrategy is present in the ClusterGroupUpgrade CR

Issue

You want to check if the remediationStrategy is present in the ClusterGroupUpgrade CR.

Resolution

Run the following command:

  1. $ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'

Example output

  1. {"maxConcurrency":2, "timeout":240}

Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR

Issue

You want to check if the maxConcurrency is specified in the ClusterGroupUpgrade CR.

Resolution

Run the following command:

  1. $ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'

Example output

  1. 2

Topology Aware Lifecycle Manager

Checking condition message and status in the ClusterGroupUpgrade CR

Issue

You want to check the value of the status.conditions field in the ClusterGroupUpgrade CR.

Resolution

Run the following command:

  1. $ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'

Example output

  1. {"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}

Checking corresponding copied policies

Issue

You want to check if every policy from status.managedPoliciesForUpgrade has a corresponding policy in status.copiedPolicies.

Resolution

Run the following command:

  1. $ oc get cgu lab-upgrade -oyaml

Example output

  1. status:
  2. copiedPolicies:
  3. - lab-upgrade-policy3-common-ptp-sub-policy
  4. managedPoliciesForUpgrade:
  5. - name: policy3-common-ptp-sub-policy
  6. namespace: default

Checking if status.remediationPlan was computed

Issue

You want to check if status.remediationPlan is computed.

Resolution

Run the following command:

  1. $ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'

Example output

  1. [["spoke2", "spoke3"]]

Errors in the TALM manager container

Issue

You want to check the logs of the manager container of TALM.

Resolution

Run the following command:

  1. $ oc logs -n openshift-operators \
  2. cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager

Example output

  1. ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} (1)
  2. sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
1Displays the error.

Clusters are not compliant to some policies after a ClusterGroupUpgrade CR has completed

Issue

The policy compliance status that TALM uses to decide if remediation is needed has not yet fully updated for all clusters. This may be because:

  • The CGU was run too soon after a policy was created or updated.

  • The remediation of a policy affects the compliance of subsequent policies in the ClusterGroupUpgrade CR.

Resolution

Create a new and apply ClusterGroupUpdate CR with the same specification .

Additional resources