- Managing hosted control planes
- Updates for hosted control planes
- Updating node pools for hosted control planes
- Pausing the reconciliation of a hosted cluster and hosted control plane
- Configuring metrics sets for hosted control planes
- Creating monitoring dashboards for hosted clusters
- Scaling down the data plane to zero
- Deleting a hosted cluster
Managing hosted control planes
After you configure your environment for hosted control planes and create a hosted cluster, you can further manage your clusters and nodes.
Updates for hosted control planes
Updates for hosted control planes involve updating the hosted cluster and the node pools. For a cluster to remain fully operational during an update process, you must meet the requirements of the Kubernetes version skew policy while completing the control plane and node updates.
Updates for the hosted cluster
The spec.release
value dictates the version of the control plane. The HostedCluster
object transmits the intended spec.release
value to the HostedControlPlane.spec.release
value and runs the appropriate Control Plane Operator version.
The hosted control plane manages the rollout of the new version of the control plane components along with any OKD components through the new version of the Cluster Version Operator (CVO).
Updates for node pools
With node pools, you can configure the software that is running in the nodes by exposing the spec.release
and spec.config
values. You can start a rolling node pool update in the following ways:
Changing the
spec.release
orspec.config
values.Changing any platform-specific field, such as the AWS instance type. The result is a set of new instances with the new type.
Changing the cluster configuration, if the change propagates to the node.
Node pools support replace updates and in-place updates. The nodepool.spec.release
value dictates the version of any particular node pool. A NodePool
object completes a replace or an in-place rolling update according to the .spec.management.upgradeType
value.
After you create a node pool, you cannot change the update type. If you want to change the update type, you must create a node pool and delete the other one.
Replace updates for node pools
A replace update creates instances in the new version while it removes old instances from the previous version. This update type is effective in cloud environments where this level of immutability is cost effective.
Replace updates do not preserve any manual changes because the node is entirely re-provisioned.
In place updates for node pools
An in-place update directly updates the operating systems of the instances. This type is suitable for environments where the infrastructure constraints are higher, such as bare metal.
In-place updates can preserve manual changes, but will report errors if you make manual changes to any file system or operating system configuration that the cluster directly manages, such as kubelet certificates.
Updating node pools for hosted control planes
On hosted control planes, you update your version of OKD by updating the node pools. The node pool version must not surpass the hosted control plane version.
Procedure
To start the process to update to a new version of OKD, change the
spec.release.image
value of the node pool by entering the following command:$ oc -n NAMESPACE patch HC HCNAME --patch '{"spec":{"release":{"image": "example"}}}' --type=merge
Verification
- To verify that the new version was rolled out, check the
.status.version
value and the status conditions.
Pausing the reconciliation of a hosted cluster and hosted control plane
If you are a cluster instance administrator, you can pause the reconciliation of a hosted cluster and hosted control plane. You might want to pause reconciliation when you back up and restore an etcd database or when you need to debug problems with a hosted cluster or hosted control plane.
Procedure
To pause reconciliation for a hosted cluster and hosted control plane, populate the
pausedUntil
field of theHostedCluster
resource, as shown in the following examples. In the examples, the value forpausedUntil
is defined in an environment variable prior to the command.To pause the reconciliation until a specific time, specify an RFC339 timestamp:
PAUSED_UNTIL="2022-03-03T03:28:48Z"
oc patch -n <hosted-cluster-namespace> hostedclusters/<hosted-cluster-name> -p '{"spec":{"pausedUntil":"'${PAUSED_UNTIL}'"}}' --type=merge
The reconciliation is paused until the specified time is passed.
To pause the reconciliation indefinitely, pass a Boolean value of
true
:PAUSED_UNTIL="true"
oc patch -n <hosted-cluster-namespace> hostedclusters/<hosted-cluster-name> -p '{"spec":{"pausedUntil":"'${PAUSED_UNTIL}'"}}' --type=merge
The reconciliation is paused until you remove the field from the
HostedCluster
resource.When the pause reconciliation field is populated for the
HostedCluster
resource, the field is automatically added to the associatedHostedControlPlane
resource.
To remove the
pausedUntil
field, enter the following patch command:oc patch -n <hosted-cluster-namespace> hostedclusters/<hosted-cluster-name> -p '{"spec":{"pausedUntil":null}}' --type=merge
Configuring metrics sets for hosted control planes
Hosted control planes for Red Hat OKD creates ServiceMonitor
resources in each control plane namespace that allow a Prometheus stack to gather metrics from the control planes. The ServiceMonitor
resources use metrics relabelings to define which metrics are included or excluded from a particular component, such as etcd or the Kubernetes API server. The number of metrics that are produced by control planes directly impacts the resource requirements of the monitoring stack that gathers them.
Instead of producing a fixed number of metrics that apply to all situations, you can configure a metrics set that identifies a set of metrics to produce for each control plane. The following metrics sets are supported:
Telemetry
: These metrics are needed for telemetry. This set is the default set and is the smallest set of metrics.SRE
: This set includes the necessary metrics to produce alerts and allow the troubleshooting of control plane components.All
: This set includes all of the metrics that are produced by standalone OKD control plane components.
To configure a metrics set, set the METRICS_SET
environment variable in the HyperShift Operator deployment by entering the following command:
$ oc set env -n hypershift deployment/operator METRICS_SET=All
Configuring the SRE metrics set
When you specify the SRE
metrics set, the HyperShift Operator looks for a config map named sre-metric-set
with a single key: config
. The value of the config
key must contain a set of RelabelConfigs
that are organized by control plane component.
You can specify the following components:
etcd
kubeAPIServer
kubeControllerManager
openshiftAPIServer
openshiftControllerManager
openshiftRouteControllerManager
cvo
olm
catalogOperator
registryOperator
nodeTuningOperator
controlPlaneOperator
hostedClusterConfigOperator
A configuration of the SRE
metrics set is illustrated in the following example:
kubeAPIServer:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_controller_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_step_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "transformation_(transformation_latencies_microseconds|failures_total)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "network_plugin_operations_latency_microseconds|sync_proxy_rules_latency_microseconds|rest_client_request_latency_seconds"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
sourceLabels: ["__name__", "le"]
kubeControllerManager:
- action: "drop"
regex: "etcd_(debugging|disk|request|server).*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "rest_client_request_latency_seconds_(bucket|count|sum)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "root_ca_cert_publisher_sync_duration_seconds_(bucket|count|sum)"
sourceLabels: ["__name__"]
openshiftAPIServer:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_controller_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_step_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
sourceLabels: ["__name__", "le"]
openshiftControllerManager:
- action: "drop"
regex: "etcd_(debugging|disk|request|server).*"
sourceLabels: ["__name__"]
openshiftRouteControllerManager:
- action: "drop"
regex: "etcd_(debugging|disk|request|server).*"
sourceLabels: ["__name__"]
olm:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
catalogOperator:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
cvo:
- action: drop
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
Creating monitoring dashboards for hosted clusters
The HyperShift Operator can create or delete monitoring dashboards in the management cluster for each hosted cluster that it manages.
Enabling monitoring dashboards
To enable monitoring dashboards in a hosted cluster, complete the following steps:
Procedure
Create the
hypershift-operator-install-flags
config map in thelocal-cluster
namespace, being sure to specify the--monitoring-dashboards
flag in thedata.installFlagsToAdd
section. For example:kind: ConfigMap
apiVersion: v1
metadata:
name: hypershift-operator-install-flags
namespace: local-cluster
data:
installFlagsToAdd: "--monitoring-dashboards"
installFlagsToRemove: ""
Wait a couple of minutes for the HyperShift Operator deployment in the
hypershift
namespace to be updated to include the following environment variable:- name: MONITORING_DASHBOARDS
value: "1"
When monitoring dashboards are enabled, for each hosted cluster that the HyperShift Operator manages, the Operator creates a config map named
cp-[NAMESPACE]-[NAME]
in theopenshift-config-managed
namespace, whereNAMESPACE
is the namespace of the hosted cluster andNAME
is the name of the hosted cluster. As a result, a new dashboard is added in the administrative console of the management cluster.To view the dashboard, log in to the management cluster’s console and go to the dashboard for the hosted cluster by clicking Observe → Dashboards.
Optional: To disable a monitoring dashboards in a hosted cluster, remove the
--monitoring-dashboards
flag from thehypershift-operator-install-flags
config map. When you delete a hosted cluster, its corresponding dashboard is also deleted.
Dashboard customization
To generate dashboards for each hosted cluster, the HyperShift Operator uses a template that is stored in the monitoring-dashboard-template
config map in the operator namespace (hypershift
). This template contains a set of Grafana panels that contain the metrics for the dashboard. You can edit the content of the config map to customize the dashboards.
When a dashboard is generated, the following strings are replaced with values that correspond to a specific hosted cluster:
Name | Description |
| The name of the hosted cluster |
| The namespace of the hosted cluster |
| The namespace where the control plane pods of the hosted cluster are placed |
| The UUID of the hosted cluster, which matches the |
Scaling down the data plane to zero
If you are not using the hosted control plane, to save the resources and cost you can scale down a data plane to zero.
Ensure you are prepared to scale down the data plane to zero. Because the workload from the worker nodes disappears after scaling down. |
Procedure
Set the
kubeconfig
file to access the hosted cluster by running the following command:$ export KUBECONFIG=<install_directory>/auth/kubeconfig
Get the name of the
NodePool
resource associated to your hosted cluster by running the following command:$ oc get nodepool --namespace <HOSTED_CLUSTER_NAMESPACE>
Optional: To prevent the pods from draining, add the
nodeDrainTimeout
field in theNodePool
resource by running the following command:$ oc edit NodePool <nodepool> -o yaml --namespace <HOSTED_CLUSTER_NAMESPACE>
Example output
apiVersion: hypershift.openshift.io/v1alpha1
kind: NodePool
metadata:
# ...
name: nodepool-1
namespace: clusters
# ...
spec:
arch: amd64
clusterName: clustername (1)
management:
autoRepair: false
replace:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
strategy: RollingUpdate
upgradeType: Replace
nodeDrainTimeout: 0s (2)
# ...
1 Defines the name of your hosted cluster. 2 Specifies the total amount of time that the controller spends to drain a node. By default, the nodeDrainTimeout: 0s
setting blocks the node draining process.To allow the node draining process to continue for a certain period of time, you can set the value of the
nodeDrainTimeout
field accordingly, for example,nodeDrainTimeout: 1m
.Scale down the
NodePool
resource associated to your hosted cluster by running the following command:$ oc scale nodepool/<NODEPOOL_NAME> --namespace <HOSTED_CLUSTER_NAMESPACE> --replicas=0
After scaling down the data plan to zero, some pods in the control plane stay in the
Pending
status and the hosted control plane stays up and running. If necessary, you can scale up theNodePool
resource.Optional: Scale up the
NodePool
resource associated to your hosted cluster by running the following command:$ oc scale nodepool/<NODEPOOL_NAME> --namespace <HOSTED_CLUSTER_NAMESPACE> --replicas=1
After rescaling the
NodePool
resource, wait for couple of minutes for theNodePool
resource to become available in aReady
state.
Deleting a hosted cluster
The steps to delete a hosted cluster differ depending on which provider you use.
Procedure
If the cluster is on AWS, follow the instructions in Destroying a hosted cluster on AWS.
If the cluster is on bare metal, follow the instructions in Destroying a hosted cluster on bare metal.
If the cluster is on OKD Virtualization, follow the instructions in Destroying a hosted cluster on OpenShift Virtualization.
Next steps
If you want to disable the hosted control plane feature, see Disabling the hosted control plane feature.