Overview
Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. Although Flink’s native Kubernetes integration already allows you to directly deploy Flink applications on a running Kubernetes(k8s) cluster, custom resources and the operator pattern have also become central to a Kubernetes native deployment experience.
Flink Kubernetes Operator aims to capture the responsibilities of a human operator who is managing Flink deployments. Human operators have deep knowledge of how Flink deployments ought to behave, how to start clusters, how to deploy jobs, how to upgrade them and how to react if there are problems. The main goal of the operator is the automation of these activities, which cannot be achieved through the Flink native integration alone.
Features
Core
- Fully-automated Job Lifecycle Management
- Running, suspending and deleting applications
- Stateful and stateless application upgrades
- Triggering and managing savepoints
- Handling errors, rolling-back broken upgrades
- Multiple Flink version support: v1.16, v1.17, v1.18, v1.19, v1.20
- Deployment Modes:
- Application cluster
- Session cluster
- Session job
- Built-in High Availability
- Extensible framework
- Advanced Configuration management
- Default configurations with dynamic updates
- Per job configuration
- Environment variables
- POD augmentation via Pod Templates
- Native Kubernetes POD definitions
- Layering (Base/JobManager/TaskManager overrides)
- Job Autoscaler
- Collect lag and utilization metrics
- Scale job vertices to the ideal parallelism
- Scale up and down as the load changes
- Snapshot management
- Manage snapshots via Kubernetes CRs
Operations
- Operator Metrics
- Utilizes the well-established Flink Metric System
- Pluggable metrics reporters
- Detailed resources and kubernetes api access metrics
- Fully-customizable Logging
- Default log configuration
- Per job log configuration
- Sidecar based log forwarders
- Flink Web UI and REST Endpoint Access
- Fully supported Flink Native Kubernetes service expose types
- Dynamic Ingress templates
- Helm based installation
- Automated RBAC configuration
- Advanced customization techniques
- Up-to-date public repositories
- GitHub Container Registry ghcr.io/apache/flink-kubernetes-operator
- DockerHub https://hub.docker.com/r/apache/flink-kubernetes-operator
Built-in Examples
The operator project comes with a wide variety of built in examples to show you how to use the operator functionality. The examples are maintained as part of the operator repo and can be found here.
What is covered:
- Application, Session and SessionJob submission
- Checkpointing and HA configuration
- Java, SQL and Python Flink jobs
- Ingress, logging and metrics configuration
- Advanced operator deployment techniques using Kustomize
- And some more…
Known Issues & Limitations
JobManager High-availability
The Operator supports both Kubernetes HA Services and Zookeeper HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down.
JobResultStore Resource Leak
To mitigate the impact of FLINK-27569 the operator introduced a workaround FLINK-27573 by setting job-result-store.delete-on-commit=false
and a unique value for job-result-store.storage-path
for every cluster launch. The storage path for older runs must be cleaned up manually, keeping the latest directory always:
ls -lth /tmp/flink/ha/job-result-store/basic-checkpoint-ha-example/
total 0
drwxr-xr-x 2 9999 9999 40 May 12 09:51 119e0203-c3a9-4121-9a60-d58839576f01 <- must be retained
drwxr-xr-x 2 9999 9999 60 May 12 09:46 a6031ec7-ab3e-4b30-ba77-6498e58e6b7f
drwxr-xr-x 2 9999 9999 60 May 11 15:11 b6fb2a9c-d1cd-4e65-a9a1-e825c4b47543
AuditUtils can log sensitive information present in the custom resources
As reported in FLINK-30306 when Flink custom resources change the operator logs the change, which could include sensitive information. We suggest ingesting secrets to Flink containers during runtime to mitigate this. Also note that anyone who has access to the custom resources already had access to the potentially sensitive information in question, but folks who only have access to the logs could also see them now. We are planning to introduce redaction rules to AuditUtils to improve this in a later release.