Overview

Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. Although Flink’s native Kubernetes integration already allows you to directly deploy Flink applications on a running Kubernetes(k8s) cluster, custom resources and the operator pattern have also become central to a Kubernetes native deployment experience.

Flink Kubernetes Operator aims to capture the responsibilities of a human operator who is managing Flink deployments. Human operators have deep knowledge of how Flink deployments ought to behave, how to start clusters, how to deploy jobs, how to upgrade them and how to react if there are problems. The main goal of the operator is the automation of these activities, which cannot be achieved through the Flink native integration alone.

Features

Core

  • Fully-automated Job Lifecycle Management
    • Running, suspending and deleting applications
    • Stateful and stateless application upgrades
    • Triggering and managing savepoints
    • Handling errors, rolling-back broken upgrades
  • Multiple Flink version support: v1.16, v1.17, v1.18, v1.19, v1.20
  • Deployment Modes:
    • Application cluster
    • Session cluster
    • Session job
  • Built-in High Availability
  • Extensible framework
  • Advanced Configuration management
    • Default configurations with dynamic updates
    • Per job configuration
    • Environment variables
  • POD augmentation via Pod Templates
    • Native Kubernetes POD definitions
    • Layering (Base/JobManager/TaskManager overrides)
  • Job Autoscaler
    • Collect lag and utilization metrics
    • Scale job vertices to the ideal parallelism
    • Scale up and down as the load changes
  • Snapshot management
    • Manage snapshots via Kubernetes CRs

Operations

Built-in Examples

The operator project comes with a wide variety of built in examples to show you how to use the operator functionality. The examples are maintained as part of the operator repo and can be found here.

What is covered:

  • Application, Session and SessionJob submission
  • Checkpointing and HA configuration
  • Java, SQL and Python Flink jobs
  • Ingress, logging and metrics configuration
  • Advanced operator deployment techniques using Kustomize
  • And some more…

Known Issues & Limitations

JobManager High-availability

The Operator supports both Kubernetes HA Services and Zookeeper HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down.

JobResultStore Resource Leak

To mitigate the impact of FLINK-27569 the operator introduced a workaround FLINK-27573 by setting job-result-store.delete-on-commit=false and a unique value for job-result-store.storage-path for every cluster launch. The storage path for older runs must be cleaned up manually, keeping the latest directory always:

  1. ls -lth /tmp/flink/ha/job-result-store/basic-checkpoint-ha-example/
  2. total 0
  3. drwxr-xr-x 2 9999 9999 40 May 12 09:51 119e0203-c3a9-4121-9a60-d58839576f01 <- must be retained
  4. drwxr-xr-x 2 9999 9999 60 May 12 09:46 a6031ec7-ab3e-4b30-ba77-6498e58e6b7f
  5. drwxr-xr-x 2 9999 9999 60 May 11 15:11 b6fb2a9c-d1cd-4e65-a9a1-e825c4b47543

AuditUtils can log sensitive information present in the custom resources

As reported in FLINK-30306 when Flink custom resources change the operator logs the change, which could include sensitive information. We suggest ingesting secrets to Flink containers during runtime to mitigate this. Also note that anyone who has access to the custom resources already had access to the potentially sensitive information in question, but folks who only have access to the logs could also see them now. We are planning to introduce redaction rules to AuditUtils to improve this in a later release.