Operator Upgrade Process
This page details the process of upgrading the operator to a new version.
Please check the compatibility page for the complete overview of the backward compatibility guarantees before upgrading to new versions.
Upgrading from the preview/experimental
v1alpha1
release tov1beta1
requires a one time manual process. Please check the related section.
Normal Upgrade Process
If you are upgrading from kubernetes-operator-1.0.0
or later, please refer to the following two steps:
- Upgrading the CRDs
- Upgrading the Helm deployment
We will cover these steps in detail in the next sections.
1. Upgrading the CRD
The first step of the upgrade process is upgrading the CRDs for FlinkDeployment
and FlinkSessionJob
resources. This step must be completed manually and is not part of the helm installation logic.
kubectl replace -f helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml
kubectl replace -f helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml
Please note that we are using the
replace
command here which ensures that running deployments are unaffected.
2. Upgrading the Helm deployment
Before upgrading, please compare the version difference between the currently generated yaml and the running yaml, which will be used for backup and restore.
helm template flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings> | kubectl diff -f -
Once we have the new CRDs versions we can upgrade the Helm deployment:
helm upgrade flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings>
or
# Uninstall running Helm deployment and install new version
helm uninstall flink-kubernetes-operator
helm install flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings>
The exact installation/upgrade command depends on your current environment and settings. Please see the helm page for details.
Upgrading from v1alpha1 -> v1beta1
If you are upgrading from kubernetes-operator-0.1.0
, please refer to the following steps. Because the first stable v1beta1
release introduced some breaking changes on the operator side when upgrading from the preview (v1alpha1
) release. These changes require a one time manual upgrade process for the running jobs.
1. Upgrading without existing FlinkDeployments
In an environment without any FlinkDeployments
you need to uninstall the operator and delete the v1alpha1
CRD.
# Uninstall helm deployment
helm uninstall flink-kubernetes-operator
# Delete CRD
kubectl delete crd flinkdeployments.flink.apache.org
# Now reinstall the operator with the new v1beta1 version
helm install flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings>
2. Upgrading with existing FlinkDeployments
The following steps demonstrate the CRD upgrade process from v1alpha1
to v1beta1
in an environment with an existing stateful job with an old v1alpha1
apiVersion. After the CRD upgrade, the job will resumed from the savepoint. Here is a reference example of upgrading a basic-checkpoint-ha-example
deployment.
Suspend the job and create savepoint:
kubectl patch flinkdeployment/basic-checkpoint-ha-example --type=merge -p '{"spec": {"job": {"state": "suspended", "upgradeMode": "savepoint"}}}'
Verify
deploy/basic-checkpoint-ha-example
has terminated andflinkdeployment/basic-checkpoint-ha-example
has the Last Savepoint Location similar tofile:/flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata
. This file will used to restore the job. See stateful and stateless application upgrade for more detail.Delete the job:
kubectl delete flinkdeployment/basic-checkpoint-ha-example
Uninstall flink-kubernetes-operator helm chart and the CRD with the old
v1alpha1
version:helm uninstall flink-kubernetes-operator
kubectl delete crd flinkdeployments.flink.apache.org
Reinstall the flink-kubernetes-operator helm chart with the
v1beta1
CRDhelm repo update flink-operator-repo
helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator
Verify the
deploy/flink-kubernetes-operator
log has:2022-04-13 06:09:40,761 i.j.o.Operator [INFO ] Registered reconciler: 'flinkdeploymentcontroller' for resource: 'class org.apache.flink.kubernetes.operator.crd.FlinkDeployment' for namespace(s): [all namespaces]
2022-04-13 06:09:40,943 i.f.k.c.i.VersionUsageUtils [WARN ] The client is using resource type 'flinksessionjobs' with unstable version 'v1beta1'
2022-04-13 06:09:41,461 i.j.o.Operator [INFO ] Registered reconciler: 'flinksessionjobcontroller' for resource: 'class org.apache.flink.kubernetes.operator.crd.FlinkSessionJob' for namespace(s): [all namespaces]
2022-04-13 06:09:41,464 i.j.o.Operator [INFO ] Operator SDK 2.1.2 (commit: a3a81ef) built on 2022-03-15T09:59:42.000+0000 starting...
2022-04-13 06:09:41,464 i.j.o.Operator [INFO ] Client version: 5.12.1
2022-04-13 06:09:41,499 i.f.k.c.i.VersionUsageUtils [WARN ] The client is using resource type 'flinkdeployments' with unstable version 'v1beta1'
Restore the job:
Deploy the previously deleted job using this FlinkDeployment with
v1beta1
and explicitly set thejob.initialSavepointPath
to the savepoint location obtained from the step 1.spec:
...
job:
initialSavepointPath: /flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata
upgradeMode: savepoint
...
Alternatively, we may use this command to edit and deploy the manifest:
wget -qO - https://raw.githubusercontent.com/apache/flink-kubernetes-operator/main/examples/basic-checkpoint-ha.yaml| yq w - "spec.job.initialSavepointPath" "/flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata"| kubectl apply -f -
Finally, verify that
deploy/basic-checkpoint-ha-example
log has:Starting job 00000000000000000000000000000000 from savepoint /flink-data/savepoints/savepoint-000000-2f40a9c8e4b9/_metadata
3. Changes of default values of FlinkDeployment
There are some changes or improvement of default values in the fields of the FlinkDeployment in v1beta1
:
- Default value of
crd.spec.Resource#cpu
is1.0
. - Default value of
crd.spec.JobManagerSpec#replicas
is1
. - No default value of
crd.spec.FlinkDeploymentSpec#serviceAccount
and users must specify its value explicitly.