Operator Upgrade Process

This page details the process of upgrading the operator to a new version.

Please check the compatibility page for the complete overview of the backward compatibility guarantees before upgrading to new versions.

Upgrading from the preview/experimental v1alpha1 release to v1beta1 requires a one time manual process. Please check the related section.

Normal Upgrade Process

If you are upgrading from kubernetes-operator-1.0.0 or later, please refer to the following two steps:

  1. Upgrading the CRDs
  2. Upgrading the Helm deployment

We will cover these steps in detail in the next sections.

1. Upgrading the CRD

The first step of the upgrade process is upgrading the CRDs for FlinkDeployment and FlinkSessionJob resources. This step must be completed manually and is not part of the helm installation logic.

  1. kubectl replace -f helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml
  2. kubectl replace -f helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml

Please note that we are using the replace command here which ensures that running deployments are unaffected.

2. Upgrading the Helm deployment

Before upgrading, please compare the version difference between the currently generated yaml and the running yaml, which will be used for backup and restore.

  1. helm template flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings> | kubectl diff -f -

Once we have the new CRDs versions we can upgrade the Helm deployment:

  1. helm upgrade flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings>

or

  1. # Uninstall running Helm deployment and install new version
  2. helm uninstall flink-kubernetes-operator
  3. helm install flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings>

The exact installation/upgrade command depends on your current environment and settings. Please see the helm page for details.

Upgrading from v1alpha1 -> v1beta1

If you are upgrading from kubernetes-operator-0.1.0 , please refer to the following steps. Because the first stable v1beta1 release introduced some breaking changes on the operator side when upgrading from the preview (v1alpha1) release. These changes require a one time manual upgrade process for the running jobs.

1. Upgrading without existing FlinkDeployments

In an environment without any FlinkDeployments you need to uninstall the operator and delete the v1alpha1 CRD.

  1. # Uninstall helm deployment
  2. helm uninstall flink-kubernetes-operator
  3. # Delete CRD
  4. kubectl delete crd flinkdeployments.flink.apache.org
  5. # Now reinstall the operator with the new v1beta1 version
  6. helm install flink-kubernetes-operator <helm-repo>/flink-kubernetes-operator <custom settings>

2. Upgrading with existing FlinkDeployments

The following steps demonstrate the CRD upgrade process from v1alpha1 to v1beta1 in an environment with an existing stateful job with an old v1alpha1 apiVersion. After the CRD upgrade, the job will resumed from the savepoint. Here is a reference example of upgrading a basic-checkpoint-ha-example deployment.

  1. Suspend the job and create savepoint:

    1. kubectl patch flinkdeployment/basic-checkpoint-ha-example --type=merge -p '{"spec": {"job": {"state": "suspended", "upgradeMode": "savepoint"}}}'

    Verify deploy/basic-checkpoint-ha-example has terminated and flinkdeployment/basic-checkpoint-ha-example has the Last Savepoint Location similar to file:/flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata. This file will used to restore the job. See stateful and stateless application upgrade for more detail.

  2. Delete the job:

    1. kubectl delete flinkdeployment/basic-checkpoint-ha-example
  3. Uninstall flink-kubernetes-operator helm chart and the CRD with the old v1alpha1 version:

    1. helm uninstall flink-kubernetes-operator
    2. kubectl delete crd flinkdeployments.flink.apache.org
  4. Reinstall the flink-kubernetes-operator helm chart with the v1beta1 CRD

    1. helm repo update flink-operator-repo
    2. helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator

    Verify the deploy/flink-kubernetes-operator log has:

    1. 2022-04-13 06:09:40,761 i.j.o.Operator [INFO ] Registered reconciler: 'flinkdeploymentcontroller' for resource: 'class org.apache.flink.kubernetes.operator.crd.FlinkDeployment' for namespace(s): [all namespaces]
    2. 2022-04-13 06:09:40,943 i.f.k.c.i.VersionUsageUtils [WARN ] The client is using resource type 'flinksessionjobs' with unstable version 'v1beta1'
    3. 2022-04-13 06:09:41,461 i.j.o.Operator [INFO ] Registered reconciler: 'flinksessionjobcontroller' for resource: 'class org.apache.flink.kubernetes.operator.crd.FlinkSessionJob' for namespace(s): [all namespaces]
    4. 2022-04-13 06:09:41,464 i.j.o.Operator [INFO ] Operator SDK 2.1.2 (commit: a3a81ef) built on 2022-03-15T09:59:42.000+0000 starting...
    5. 2022-04-13 06:09:41,464 i.j.o.Operator [INFO ] Client version: 5.12.1
    6. 2022-04-13 06:09:41,499 i.f.k.c.i.VersionUsageUtils [WARN ] The client is using resource type 'flinkdeployments' with unstable version 'v1beta1'
  5. Restore the job:

    Deploy the previously deleted job using this FlinkDeployment with v1beta1 and explicitly set the job.initialSavepointPath to the savepoint location obtained from the step 1.

    1. spec:
    2. ...
    3. job:
    4. initialSavepointPath: /flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata
    5. upgradeMode: savepoint
    6. ...

    Alternatively, we may use this command to edit and deploy the manifest:

    1. wget -qO - https://raw.githubusercontent.com/apache/flink-kubernetes-operator/main/examples/basic-checkpoint-ha.yaml| yq w - "spec.job.initialSavepointPath" "/flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata"| kubectl apply -f -

    Finally, verify that deploy/basic-checkpoint-ha-example log has:

    1. Starting job 00000000000000000000000000000000 from savepoint /flink-data/savepoints/savepoint-000000-2f40a9c8e4b9/_metadata

3. Changes of default values of FlinkDeployment

There are some changes or improvement of default values in the fields of the FlinkDeployment in v1beta1:

  1. Default value of crd.spec.Resource#cpu is 1.0.
  2. Default value of crd.spec.JobManagerSpec#replicas is 1.
  3. No default value of crd.spec.FlinkDeploymentSpec#serviceAccount and users must specify its value explicitly.