基于 Table API 实现实时报表

Apache Flink offers a Table API as a unified, relational API for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded, batch data sets and produce the same results. The Table API in Flink is commonly used to ease the definition of data analytics, data pipelining, and ETL applications.

What Will You Be Building?

In this tutorial, you will learn how to build a real-time dashboard to track financial transactions by account. The pipeline will read data from Kafka and write the results to MySQL visualized via Grafana.

Prerequisites

This walkthrough assumes that you have some familiarity with Java or Scala, but you should be able to follow along even if you come from a different programming language. It also assumes that you are familiar with basic relational concepts such as SELECT and GROUP BY clauses.

Help, I’m Stuck!

If you get stuck, check out the community support resources. In particular, Apache Flink’s user mailing list consistently ranks as one of the most active of any Apache project and a great way to get help quickly.

If running docker on windows and your data generator container is failing to start, then please ensure that you’re using the right shell. For example docker-entrypoint.sh for table-walkthrough_data-generator_1 container requires bash. If unavailable, it will throw an error standard_init_linux.go:211: exec user process caused “no such file or directory”. A workaround is to switch the shell to sh on the first line of docker-entrypoint.sh.

How To Follow Along

If you want to follow along, you will require a computer with:

  • Java 8 or 11
  • Maven
  • Docker

The required configuration files are available in the flink-playgrounds repository. Once downloaded, open the project flink-playground/table-walkthrough in your IDE and navigate to the file SpendReport.

  1. EnvironmentSettings settings = EnvironmentSettings.newInstance().build();
  2. TableEnvironment tEnv = TableEnvironment.create(settings);
  3. tEnv.executeSql("CREATE TABLE transactions (\n" +
  4. " account_id BIGINT,\n" +
  5. " amount BIGINT,\n" +
  6. " transaction_time TIMESTAMP(3),\n" +
  7. " WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND\n" +
  8. ") WITH (\n" +
  9. " 'connector' = 'kafka',\n" +
  10. " 'topic' = 'transactions',\n" +
  11. " 'properties.bootstrap.servers' = 'kafka:9092',\n" +
  12. " 'format' = 'csv'\n" +
  13. ")");
  14. tEnv.executeSql("CREATE TABLE spend_report (\n" +
  15. " account_id BIGINT,\n" +
  16. " log_ts TIMESTAMP(3),\n" +
  17. " amount BIGINT\n," +
  18. " PRIMARY KEY (account_id, log_ts) NOT ENFORCED" +
  19. ") WITH (\n" +
  20. " 'connector' = 'jdbc',\n" +
  21. " 'url' = 'jdbc:mysql://mysql:3306/sql-demo',\n" +
  22. " 'table-name' = 'spend_report',\n" +
  23. " 'driver' = 'com.mysql.jdbc.Driver',\n" +
  24. " 'username' = 'sql-demo',\n" +
  25. " 'password' = 'demo-sql'\n" +
  26. ")");
  27. Table transactions = tEnv.from("transactions");
  28. report(transactions).executeInsert("spend_report");

Breaking Down The Code

The Execution Environment

The first two lines set up your TableEnvironment. The table environment is how you can set properties for your Job, specify whether you are writing a batch or a streaming application, and create your sources. This walkthrough creates a standard table environment that uses the streaming execution.

  1. EnvironmentSettings settings = EnvironmentSettings.newInstance().build();
  2. TableEnvironment tEnv = TableEnvironment.create(settings);

Registering Tables

Next, tables are registered in the current catalog that you can use to connect to external systems for reading and writing both batch and streaming data. A table source provides access to data stored in external systems, such as a database, a key-value store, a message queue, or a file system. A table sink emits a table to an external storage system. Depending on the type of source and sink, they support different formats such as CSV, JSON, Avro, or Parquet.

  1. tEnv.executeSql("CREATE TABLE transactions (\n" +
  2. " account_id BIGINT,\n" +
  3. " amount BIGINT,\n" +
  4. " transaction_time TIMESTAMP(3),\n" +
  5. " WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND\n" +
  6. ") WITH (\n" +
  7. " 'connector' = 'kafka',\n" +
  8. " 'topic' = 'transactions',\n" +
  9. " 'properties.bootstrap.servers' = 'kafka:9092',\n" +
  10. " 'format' = 'csv'\n" +
  11. ")");

Two tables are registered; a transaction input table, and a spend report output table. The transactions (transactions) table lets us read credit card transactions, which contain account ID’s (account_id), timestamps (transaction_time), and US$ amounts (amount). The table is a logical view over a Kafka topic called transactions containing CSV data.

  1. tEnv.executeSql("CREATE TABLE spend_report (\n" +
  2. " account_id BIGINT,\n" +
  3. " log_ts TIMESTAMP(3),\n" +
  4. " amount BIGINT\n," +
  5. " PRIMARY KEY (account_id, log_ts) NOT ENFORCED" +
  6. ") WITH (\n" +
  7. " 'connector' = 'jdbc',\n" +
  8. " 'url' = 'jdbc:mysql://mysql:3306/sql-demo',\n" +
  9. " 'table-name' = 'spend_report',\n" +
  10. " 'driver' = 'com.mysql.jdbc.Driver',\n" +
  11. " 'username' = 'sql-demo',\n" +
  12. " 'password' = 'demo-sql'\n" +
  13. ")");

The second table, spend_report, stores the final results of the aggregation. Its underlying storage is a table in a MySql database.

The Query

With the environment configured and tables registered, you are ready to build your first application. From the TableEnvironment you can read from an input table to read its rows and then write those results into an output table using executeInsert. The report function is where you will implement your business logic. It is currently unimplemented.

  1. Table transactions = tEnv.from("transactions");
  2. report(transactions).executeInsert("spend_report");

Testing

The project contains a secondary testing class SpendReportTest that validates the logic of the report. It creates a table environment in batch mode.

  1. EnvironmentSettings settings = EnvironmentSettings.newInstance().inBatchMode().build();
  2. TableEnvironment tEnv = TableEnvironment.create(settings);

One of Flink’s unique properties is that it provides consistent semantics across batch and streaming. This means you can develop and test applications in batch mode on static datasets, and deploy to production as streaming applications.

Attempt One

Now with the skeleton of a Job set-up, you are ready to add some business logic. The goal is to build a report that shows the total spend for each account across each hour of the day. This means the timestamp column needs be be rounded down from millisecond to hour granularity.

Flink supports developing relational applications in pure SQL or using the Table API. The Table API is a fluent DSL inspired by SQL, that can be written in Python, Java, or Scala and supports strong IDE integration. Just like a SQL query, Table programs can select the required fields and group by your keys. These features, allong with built-in functions like floor and sum, you can write this report.

  1. public static Table report(Table transactions) {
  2. return transactions.select(
  3. $("account_id"),
  4. $("transaction_time").floor(TimeIntervalUnit.HOUR).as("log_ts"),
  5. $("amount"))
  6. .groupBy($("account_id"), $("log_ts"))
  7. .select(
  8. $("account_id"),
  9. $("log_ts"),
  10. $("amount").sum().as("amount"));
  11. }

User Defined Functions

Flink contains a limited number of built-in functions, and sometimes you need to extend it with a user-defined function. If floor wasn’t predefined, you could implement it yourself.

  1. import java.time.LocalDateTime;
  2. import java.time.temporal.ChronoUnit;
  3. import org.apache.flink.table.annotation.DataTypeHint;
  4. import org.apache.flink.table.functions.ScalarFunction;
  5. public class MyFloor extends ScalarFunction {
  6. public @DataTypeHint("TIMESTAMP(3)") LocalDateTime eval(
  7. @DataTypeHint("TIMESTAMP(3)") LocalDateTime timestamp) {
  8. return timestamp.truncatedTo(ChronoUnit.HOURS);
  9. }
  10. }

And then quickly integrate it in your application.

  1. public static Table report(Table transactions) {
  2. return transactions.select(
  3. $("account_id"),
  4. call(MyFloor.class, $("transaction_time")).as("log_ts"),
  5. $("amount"))
  6. .groupBy($("account_id"), $("log_ts"))
  7. .select(
  8. $("account_id"),
  9. $("log_ts"),
  10. $("amount").sum().as("amount"));
  11. }

This query consumes all records from the transactions table, calculates the report, and outputs the results in an efficient, scalable manner. Running the test with this implementation will pass.

Adding Windows

Grouping data based on time is a typical operation in data processing, especially when working with infinite streams. A grouping based on time is called a window and Flink offers flexible windowing semantics. The most basic type of window is called a Tumble window, which has a fixed size and whose buckets do not overlap.

  1. public static Table report(Table transactions) {
  2. return transactions
  3. .window(Tumble.over(lit(1).hour()).on($("transaction_time")).as("log_ts"))
  4. .groupBy($("account_id"), $("log_ts"))
  5. .select(
  6. $("account_id"),
  7. $("log_ts").start().as("log_ts"),
  8. $("amount").sum().as("amount"));
  9. }

This defines your application as using one hour tumbling windows based on the timestamp column. So a row with timestamp 2019-06-01 01:23:47 is put in the 2019-06-01 01:00:00 window.

Aggregations based on time are unique because time, as opposed to other attributes, generally moves forward in a continuous streaming application. Unlike floor and your UDF, window functions are intrinsics, which allows the runtime to apply additional optimizations. In a batch context, windows offer a convenient API for grouping records by a timestamp attribute.

Running the test with this implementation will also pass.

Once More, With Streaming!

And that’s it, a fully functional, stateful, distributed streaming application! The query continuously consumes the stream of transactions from Kafka, computes the hourly spendings, and emits results as soon as they are ready. Since the input is unbounded, the query keeps running until it is manually stopped. And because the Job uses time window-based aggregations, Flink can perform specific optimizations such as state clean up when the framework knows that no more records will arrive for a particular window.

The table playground is fully dockerized and runnable locally as streaming application. The environment contains a Kafka topic, a continuous data generator, MySql, and Grafana.

From within the table-walkthrough folder start the docker-compose script.

  1. $ docker-compose build
  2. $ docker-compose up -d

You can see information on the running job via the Flink console.

![Flink Console]({% link /fig/spend-report-console.png %}){:height=“400px” width=“800px”}

Explore the results from inside MySQL.

  1. $ docker-compose exec mysql mysql -Dsql-demo -usql-demo -pdemo-sql
  2. mysql> use sql-demo;
  3. Database changed
  4. mysql> select count(*) from spend_report;
  5. +----------+
  6. | count(*) |
  7. +----------+
  8. | 110 |
  9. +----------+

Finally, go to Grafana to see the fully visualized result!

Grafana