DataSet API 编程指南

DataSet programs in Flink are regular programs that implement transformations on data sets (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources (e.g., by reading files, or from local collections). Results are returned via sinks, which may for example write the data to (distributed) files, or to standard output (for example the command line terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs. The execution can happen in a local JVM, or on clusters of many machines.

Please refer to the DataStream API overview for an introduction to the basic concepts of the Flink API. That overview is for the DataStream API but the basic concepts of the two APIs are the same.

In order to create your own Flink DataSet program, we encourage you to start with the anatomy of a Flink Program and gradually add your own transformations. The remaining sections act as references for additional operations and advanced features.

Starting with Flink 1.12 the DataSet has been soft deprecated. We recommend that you use the DataStream API with BATCH execution mode. The linked section also outlines cases where it makes sense to use the DataSet API but those cases will become rarer as development progresses and the DataSet API will eventually be removed. Please also see FLIP-131 for background information on this decision.

Example Program

The following program is a complete, working example of WordCount. You can copy & paste the code to run it locally. You only have to include the correct Flink’s library into your project and specify the imports. Then you are ready to go!

Java

  1. public class WordCountExample {
  2. public static void main(String[] args) throws Exception {
  3. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  4. DataSet<String> text = env.fromElements(
  5. "Who's there?",
  6. "I think I hear them. Stand, ho! Who's there?");
  7. DataSet<Tuple2<String, Integer>> wordCounts = text
  8. .flatMap(new LineSplitter())
  9. .groupBy(0)
  10. .sum(1);
  11. wordCounts.print();
  12. }
  13. public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
  14. @Override
  15. public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
  16. for (String word : line.split(" ")) {
  17. out.collect(new Tuple2<String, Integer>(word, 1));
  18. }
  19. }
  20. }
  21. }

Scala

  1. import org.apache.flink.api.scala._
  2. object WordCount {
  3. def main(args: Array[String]) {
  4. val env = ExecutionEnvironment.getExecutionEnvironment
  5. val text = env.fromElements(
  6. "Who's there?",
  7. "I think I hear them. Stand, ho! Who's there?")
  8. val counts = text
  9. .flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
  10. .map { (_, 1) }
  11. .groupBy(0)
  12. .sum(1)
  13. counts.print()
  14. }
  15. }

DataSet Transformations

Data transformations transform one or more DataSets into a new DataSet. Programs can combine multiple transformations into sophisticated assemblies.

Map

Takes one element and produces one element.

Java

  1. data.map(new MapFunction<String, Integer>() {
  2. public Integer map(String value) { return Integer.parseInt(value); }
  3. });

Scala

  1. data.map { x => x.toInt }

FlatMap

Takes one element and produces zero, one, or more elements.

Java

  1. data.flatMap(new FlatMapFunction<String, String>() {
  2. public void flatMap(String value, Collector<String> out) {
  3. for (String s : value.split(" ")) {
  4. out.collect(s);
  5. }
  6. }
  7. });

Scala

  1. data.flatMap { str => str.split(" ") }

MapPartition

Transforms a parallel partition in a single function call. The function gets the partition as an Iterable stream and can produce an arbitrary number of result values. The number of elements in each partition depends on the degree-of-parallelism and previous operations.

Java

  1. data.mapPartition(new MapPartitionFunction<String, Long>() {
  2. public void mapPartition(Iterable<String> values, Collector<Long> out) {
  3. long c = 0;
  4. for (String s : values) {
  5. c++;
  6. }
  7. out.collect(c);
  8. }
  9. });

Scala

  1. data.mapPartition { in => in map { (_, 1) } }

Filter

Evaluates a boolean function for each element and retains those for which the function returns true. IMPORTANT: The system assumes that the function does not modify the element on which the predicate is applied. Violating this assumption can lead to incorrect results.

Java

  1. data.filter(new FilterFunction<Integer>() {
  2. public boolean filter(Integer value) { return value > 1000; }
  3. });

Scala

  1. data.filter { _ > 1000 }

Reduce

Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set or on a grouped data set.

Java

  1. data.reduce(new ReduceFunction<Integer> {
  2. public Integer reduce(Integer a, Integer b) { return a + b; }
  3. });

Scala

  1. data.reduce { _ + _ }

If the reduce was applied to a grouped data set then you can specify the way that the runtime executes the combine phase of the reduce by supplying a CombineHint to setCombineHint. The hash-based strategy should be faster in most cases, especially if the number of different keys is small compared to the number of input elements (eg. 1/10).

ReduceGroup

Combines a group of elements into one or more elements. ReduceGroup may be applied on a full data set, or on a grouped data set.

Java

  1. data.reduceGroup(new GroupReduceFunction<Integer, Integer> {
  2. public void reduce(Iterable<Integer> values, Collector<Integer> out) {
  3. int prefixSum = 0;
  4. for (Integer i : values) {
  5. prefixSum += i;
  6. out.collect(prefixSum);
  7. }
  8. }
  9. });

Scala

  1. data.reduceGroup { elements => elements.sum }

Aggregate

Aggregates a group of values into a single value. Aggregation functions can be thought of as built-in reduce functions. Aggregate may be applied on a full data set, or on a grouped data set.

Java

  1. Dataset<Tuple3<Integer, String, Double>> input = // [...]
  2. DataSet<Tuple3<Integer, String, Double>> output = input.aggregate(SUM, 0).and(MIN, 2);

Scala

  1. val input: DataSet[(Int, String, Double)] = // [...]
  2. val output: DataSet[(Int, String, Double)] = input.aggregate(SUM, 0).aggregate(MIN, 2)

Distinct

Returns the distinct elements of a data set. It removes the duplicate entries from the input DataSet, with respect to all fields of the elements, or a subset of fields.

Java

  1. data.distinct()

Scala

  1. data.distinct()

Join

Joins two data sets by creating all pairs of elements that are equal on their keys. Optionally uses a JoinFunction to turn the pair of elements into a single element, or a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.

Java

  1. result = input1.join(input2)
  2. .where(0) // key of the first input (tuple field 0)
  3. .equalTo(1); // key of the second input (tuple field 1)

Scala

  1. // In this case tuple fields are used as keys. "0" is the join field on the first tuple
  2. // "1" is the join field on the second tuple.
  3. val result = input1.join(input2).where(0).equalTo(1)

You can specify the way that the runtime executes the join via Join Hints. The hints describe whether the join happens through partitioning or broadcasting, and whether it uses a sort-based or a hash-based algorithm. Please refer to the Transformations Guide for a list of possible hints and an example. If no hint is specified, the system will try to make an estimate of the input sizes and pick the best strategy according to those estimates.

Java

  1. // This executes a join by broadcasting the first data set
  2. // using a hash table for the broadcast data
  3. result = input1.join(input2, JoinHint.BROADCAST_HASH_FIRST)
  4. .where(0).equalTo(1);

Scala

  1. // This executes a join by broadcasting the first data set
  2. // using a hash table for the broadcast data
  3. val result = input1.join(input2, JoinHint.BROADCAST_HASH_FIRST)
  4. .where(0).equalTo(1)

Note that the join transformation works only for equi-joins. Other join types need to be expressed using OuterJoin or CoGroup.

OuterJoin

Performs a left, right, or full outer join on two data sets. Outer joins are similar to regular (inner) joins and create all pairs of elements that are equal on their keys. In addition, records of the “outer” side (left, right, or both in case of full) are preserved if no matching key is found in the other side. Matching pairs of elements (or one element and a null value for the other input) are given to a JoinFunction to turn the pair of elements into a single element, or to a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.

Java

  1. input1.leftOuterJoin(input2) // rightOuterJoin or fullOuterJoin for right or full outer joins
  2. .where(0) // key of the first input (tuple field 0)
  3. .equalTo(1) // key of the second input (tuple field 1)
  4. .with(new JoinFunction<String, String, String>() {
  5. public String join(String v1, String v2) {
  6. // NOTE:
  7. // - v2 might be null for leftOuterJoin
  8. // - v1 might be null for rightOuterJoin
  9. // - v1 OR v2 might be null for fullOuterJoin
  10. }
  11. });

Scala

  1. val joined = left.leftOuterJoin(right).where(0).equalTo(1) {
  2. (left, right) =>
  3. val a = if (left == null) "none" else left._1
  4. (a, right)
  5. }

CoGroup

The two-dimensional variant of the reduce operation. Groups each input on one or more fields and then joins the groups. The transformation function is called per pair of groups. See the keys section to learn how to define coGroup keys.

Java

  1. data1.coGroup(data2)
  2. .where(0)
  3. .equalTo(1)
  4. .with(new CoGroupFunction<String, String, String>() {
  5. public void coGroup(Iterable<String> in1, Iterable<String> in2, Collector<String> out) {
  6. out.collect(...);
  7. }
  8. });

Scala

  1. data1.coGroup(data2).where(0).equalTo(1)

Cross

Builds the Cartesian product (cross product) of two inputs, creating all pairs of elements. Optionally uses a CrossFunction to turn the pair of elements into a single element

Java

  1. DataSet<Integer> data1 = // [...]
  2. DataSet<String> data2 = // [...]
  3. DataSet<Tuple2<Integer, String>> result = data1.cross(data2);

Scala

  1. val data1: DataSet[Int] = // [...]
  2. val data2: DataSet[String] = // [...]
  3. val result: DataSet[(Int, String)] = data1.cross(data2)

Cross is potentially a very compute-intensive operation which can challenge even large compute clusters! It is advised to hint the system with the DataSet sizes by using crossWithTiny() and crossWithHuge().

Union

Produces the union of two data sets.

Java

  1. data.union(data2)

Scala

  1. data.union(data2)

Rebalance

Evenly rebalances the parallel partitions of a data set to eliminate data skew. Only Map-like transformations may follow a rebalance transformation.

Java

  1. DataSet<Int> data1 = // [...]
  2. DataSet<Tuple2<Int, String>> result = data1.rebalance().map(...)

Scala

  1. val data1: DataSet[Int] = // [...]
  2. val result: DataSet[(Int, String)] = data1.rebalance().map(...)

Hash-Partition

Hash-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.

Java

  1. DataSet<Tuple2<String,Integer>> in = // [...]
  2. DataSet<Integer> result = in.partitionByHash(0)
  3. .mapPartition(new PartitionMapper());

Scala

  1. val in: DataSet[(Int, String)] = // [...]
  2. val result = in.partitionByHash(0).mapPartition { ... }

Range-Partition

Range-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.

Java

  1. DataSet<Tuple2<String,Integer>> in = // [...]
  2. DataSet<Integer> result = in.partitionByRange(0)
  3. .mapPartition(new PartitionMapper());

Scala

  1. val in: DataSet[(Int, String)] = // [...]
  2. val result = in.partitionByRange(0).mapPartition { ... }

Custom Partitioning

Assigns records based on a key to a specific partition using a custom Partitioner function. The key can be specified as position key, expression key, and key selector function. Note: This method only works with a single field key.

Java

  1. DataSet<Tuple2<String,Integer>> in = // [...]
  2. DataSet<Integer> result = in.partitionCustom(partitioner, key)
  3. .mapPartition(new PartitionMapper());

Scala

  1. val in: DataSet[(Int, String)] = // [...]
  2. val result = in
  3. .partitionCustom(partitioner, key).mapPartition { ... }

Sort Partitioning

Locally sorts all partitions of a data set on a specified field in a specified order. Fields can be specified as tuple positions or field expressions. Sorting on multiple fields is done by chaining sortPartition() calls.

Java

  1. DataSet<Tuple2<String,Integer>> in = // [...]
  2. DataSet<Integer> result = in.sortPartition(1, Order.ASCENDING)
  3. .mapPartition(new PartitionMapper());

Scala

  1. val in: DataSet[(Int, String)] = // [...]
  2. val result = in.sortPartition(1, Order.ASCENDING).mapPartition { ... }

First-N

Returns the first n (arbitrary) elements of a data set. First-n can be applied on a regular data set, a grouped data set, or a grouped-sorted data set. Grouping keys can be specified as key-selector functions or field position keys.

Java

  1. DataSet<Tuple2<String,Integer>> in = // [...]
  2. // regular data set
  3. DataSet<Tuple2<String,Integer>> result1 = in.first(3);
  4. // grouped data set
  5. DataSet<Tuple2<String,Integer>> result2 = in.groupBy(0)
  6. .first(3);
  7. // grouped-sorted data set
  8. DataSet<Tuple2<String,Integer>> result3 = in.groupBy(0)
  9. .sortGroup(1, Order.ASCENDING)
  10. .first(3);

Scala

  1. val in: DataSet[(Int, String)] = // [...]
  2. // regular data set
  3. val result1 = in.first(3)
  4. // grouped data set
  5. val result2 = in.groupBy(0).first(3)
  6. // grouped-sorted data set
  7. val result3 = in.groupBy(0).sortGroup(1, Order.ASCENDING).first(3)

Project

Selects a subset of fields from tuples.

Java

  1. DataSet<Tuple3<Integer, Double, String>> in = // [...]
  2. DataSet<Tuple2<String, Integer>> out = in.project(2,0);

Scala

This feature is not available in the Scala API

MinBy / MaxBy

Selects a tuple from a group of tuples whose values of one or more fields are minimum (maximum). The fields which are used for comparison must be valid key fields, i.e., comparable. If multiple tuples have minimum (maximum) field values, an arbitrary tuple of these tuples is returned. MinBy (MaxBy) may be applied on a full data set or a grouped data set.

Java

  1. DataSet<Tuple3<Integer, Double, String>> in = // [...]
  2. // a DataSet with a single tuple with minimum values for the Integer and String fields.
  3. DataSet<Tuple3<Integer, Double, String>> out = in.minBy(0, 2);
  4. // a DataSet with one tuple for each group with the minimum value for the Double field.
  5. DataSet<Tuple3<Integer, Double, String>> out2 = in.groupBy(2)
  6. .minBy(1);

Scala

  1. val in: DataSet[(Int, Double, String)] = // [...]
  2. // a data set with a single tuple with minimum values for the Int and String fields.
  3. val out: DataSet[(Int, Double, String)] = in.minBy(0, 2)
  4. // a data set with one tuple for each group with the minimum value for the Double field.
  5. val out2: DataSet[(Int, Double, String)] = in.groupBy(2)
  6. .minBy(1)

Specifying Keys

Some transformations (join, coGroup, groupBy) require that a key be defined on a collection of elements. Other transformations (Reduce, GroupReduce, Aggregate) allow data being grouped on a key before they are applied.

A DataSet is grouped as

  1. DataSet<...> input = // [...]
  2. DataSet<...> reduced = input
  3. .groupBy(/*define key here*/)
  4. .reduceGroup(/*do something*/);

The data model of Flink is not based on key-value pairs. Therefore, you do not need to physically pack the data set types into keys and values. Keys are “virtual”: they are defined as functions over the actual data to guide the grouping operator.

Define keys for Tuples

The simplest case is grouping Tuples on one or more fields of the Tuple:

Java

  1. DataSet<Tuple3<Integer,String,Long>> input = // [...]
  2. UnsortedGrouping<Tuple3<Integer,String,Long>,Tuple> keyed = input.groupBy(0)

Scala

  1. val input: DataSet[(Int, String, Long)] = // [...]
  2. val keyed = input.groupBy(0)

Tuples are grouped on the first field (the one of Integer type).

Java

  1. DataSet<Tuple3<Integer,String,Long>> input = // [...]
  2. UnsortedGrouping<Tuple3<Integer,String,Long>,Tuple> keyed = input.groupBy(0,1)

Scala

  1. val input: DataSet[(Int, String, Long)] = // [...]
  2. val grouped = input.groupBy(0,1)

Here, we group the tuples on a composite key consisting of the first and the second field.

A note on nested Tuples: If you have a DataSet with a nested tuple, such as:

  1. DataSet<Tuple3<Tuple2<Integer, Float>,String,Long>> ds;

Specifying groupBy(0) will cause the system to use the full Tuple2 as a key (with the Integer and Float being the key). If you want to “navigate” into the nested Tuple2, you have to use field expression keys which are explained below.

Define keys using Field Expressions

You can use String-based field expressions to reference nested fields and define keys for grouping, sorting, joining, or coGrouping. Field expressions make it very easy to select fields in (nested) composite types such as Tuple and POJO types.

In the example below, we have a WC POJO with two fields “word” and “count”. To group by the field word, we just pass its name to the groupBy() function.

Java

  1. // some ordinary POJO (Plain old Java Object)
  2. public class WC {
  3. public String word;
  4. public int count;
  5. }
  6. DataSet<WC> words = // [...]
  7. DataSet<WC> wordCounts = words.groupBy("word")

Scala

  1. // some ordinary POJO (Plain old Java Object)
  2. class WC(var word: String, var count: Int) {
  3. def this() { this("", 0L) }
  4. }
  5. val words: DataSet[WC] = // [...]
  6. val wordCounts = words.groupBy("word")
  7. // or, as a case class, which is less typing
  8. case class WC(word: String, count: Int)
  9. val words: DataSet[WC] = // [...]
  10. val wordCounts = words.groupBy("word")

Field Expression Syntax:

  • Select POJO fields by their field name. For example “user” refers to the “user” field of a POJO type.

  • Select Tuple fields by their 1-offset field name or 0-offset field index. For example “_1” and “5” refer to the first and sixth field of a Scala Tuple type, respectively.

  • You can select nested fields in POJOs and Tuples. For example “user.zip” refers to the “zip” field of a POJO which is stored in the “user” field of a POJO type. Arbitrary nesting and mixing of POJOs and Tuples is supported such as “_2.user.zip” or “user._4.1.zip”.

  • You can select the full type using the “_” wildcard expressions. This does also work for types which are not Tuple or POJO types.

Field Expression Example:

Java

  1. public static class WC {
  2. public ComplexNestedClass complex; //nested POJO
  3. private int count;
  4. // getter / setter for private field (count)
  5. public int getCount() {
  6. return count;
  7. }
  8. public void setCount(int c) {
  9. this.count = c;
  10. }
  11. }
  12. public static class ComplexNestedClass {
  13. public Integer someNumber;
  14. public float someFloat;
  15. public Tuple3<Long, Long, String> word;
  16. public IntWritable hadoopCitizen;
  17. }

Scala

  1. class WC(var complex: ComplexNestedClass, var count: Int) {
  2. def this() { this(null, 0) }
  3. }
  4. class ComplexNestedClass(
  5. var someNumber: Int,
  6. someFloat: Float,
  7. word: (Long, Long, String),
  8. hadoopCitizen: IntWritable) {
  9. def this() { this(0, 0, (0, 0, ""), new IntWritable(0)) }
  10. }

These are valid field expressions for the example code above:

  • “count”: The count field in the WC class.

  • “complex”: Recursively selects all fields of the field complex of POJO type ComplexNestedClass.

  • “complex.word.f2”: Selects the last field of the nested Tuple3.

  • “complex.hadoopCitizen”: Selects the Hadoop IntWritable type.

Define keys using Key Selector Functions

An additional way to define keys are “key selector” functions. A key selector function takes a single element as input and returns the key for the element. The key can be of any type and be derived from deterministic computations.

The following example shows a key selector function that simply returns the field of an object:

Java

  1. // some ordinary POJO
  2. public class WC {public String word; public int count;}
  3. DataSet<WC> words = // [...]
  4. UnsortedGrouping<WC> keyed = words
  5. .groupBy(new KeySelector<WC, String>() {
  6. public String getKey(WC wc) { return wc.word; }
  7. });

Scala

  1. // some ordinary case class
  2. case class WC(word: String, count: Int)
  3. val words: DataSet[WC] = // [...]
  4. val keyed = words.groupBy( _.word )

Data Sources

Data sources create the initial data sets, such as from files or from Java collections. The general mechanism of creating data sets is abstracted behind an InputFormat. Flink comes with several built-in formats to create data sets from common file formats. Many of them have shortcut methods on the ExecutionEnvironment.

File-based:

  • readTextFile(path) / TextInputFormat - Reads files line wise and returns them as Strings.

  • readTextFileWithValue(path) / TextValueInputFormat - Reads files line wise and returns them as StringValues. StringValues are mutable strings.

  • readCsvFile(path) / CsvInputFormat - Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types.

  • readFileOfPrimitives(path, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer.

  • readFileOfPrimitives(path, delimiter, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer using the given delimiter.

Collection-based:

  • fromCollection(Collection) - Creates a data set from a Java.util.Collection. All elements in the collection must be of the same type.

  • fromCollection(Iterator, Class) - Creates a data set from an iterator. The class specifies the data type of the elements returned by the iterator.

  • fromElements(T …) - Creates a data set from the given sequence of objects. All objects must be of the same type.

  • fromParallelCollection(SplittableIterator, Class) - Creates a data set from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator.

  • generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel.

Generic:

  • readFile(inputFormat, path) / FileInputFormat - Accepts a file input format.

  • createInput(inputFormat) / InputFormat - Accepts a generic input format.

    Java

  1. ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  2. // read text file from local files system
  3. DataSet<String> localLines = env.readTextFile("file:///path/to/my/textfile");
  4. // read text file from an HDFS running at nnHost:nnPort
  5. DataSet<String> hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
  6. // read a CSV file with three fields
  7. DataSet<Tuple3<Integer, String, Double>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
  8. .types(Integer.class, String.class, Double.class);
  9. // read a CSV file with five fields, taking only two of them
  10. DataSet<Tuple2<String, Double>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
  11. .includeFields("10010") // take the first and the fourth field
  12. .types(String.class, Double.class);
  13. // read a CSV file with three fields into a POJO (Person.class) with corresponding fields
  14. DataSet<Person>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
  15. .pojoType(Person.class, "name", "age", "zipcode");
  16. // read a file from the specified path of type SequenceFileInputFormat
  17. DataSet<Tuple2<IntWritable, Text>> tuples =
  18. env.createInput(HadoopInputs.readSequenceFile(IntWritable.class, Text.class, "hdfs://nnHost:nnPort/path/to/file"));
  19. // creates a set from some given elements
  20. DataSet<String> value = env.fromElements("Foo", "bar", "foobar", "fubar");
  21. // generate a number sequence
  22. DataSet<Long> numbers = env.generateSequence(1, 10000000);
  23. // Read data from a relational database using the JDBC input format
  24. DataSet<Tuple2<String, Integer> dbData =
  25. env.createInput(
  26. JdbcInputFormat.buildJdbcInputFormat()
  27. .setDrivername("org.apache.derby.jdbc.EmbeddedDriver")
  28. .setDBUrl("jdbc:derby:memory:persons")
  29. .setQuery("select name, age from persons")
  30. .setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO))
  31. .finish()
  32. );
  33. // Note: Flink's program compiler needs to infer the data types of the data items which are returned
  34. // by an InputFormat. If this information cannot be automatically inferred, it is necessary to
  35. // manually provide the type information as shown in the examples above.

Scala

  1. val env = ExecutionEnvironment.getExecutionEnvironment
  2. // read text file from local files system
  3. val localLines = env.readTextFile("file:///path/to/my/textfile")
  4. // read text file from an HDFS running at nnHost:nnPort
  5. val hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile")
  6. // read a CSV file with three fields
  7. val csvInput = env.readCsvFile[(Int, String, Double)]("hdfs:///the/CSV/file")
  8. // read a CSV file with five fields, taking only two of them
  9. val csvInput = env.readCsvFile[(String, Double)](
  10. "hdfs:///the/CSV/file",
  11. includedFields = Array(0, 3)) // take the first and the fourth field
  12. // CSV input can also be used with Case Classes
  13. case class MyCaseClass(str: String, dbl: Double)
  14. val csvInput = env.readCsvFile[MyCaseClass](
  15. "hdfs:///the/CSV/file",
  16. includedFields = Array(0, 3)) // take the first and the fourth field
  17. // read a CSV file with three fields into a POJO (Person) with corresponding fields
  18. val csvInput = env.readCsvFile[Person](
  19. "hdfs:///the/CSV/file",
  20. pojoFields = Array("name", "age", "zipcode"))
  21. // create a set from some given elements
  22. val values = env.fromElements("Foo", "bar", "foobar", "fubar")
  23. // generate a number sequence
  24. val numbers = env.generateSequence(1, 10000000)
  25. // read a file from the specified path of type SequenceFileInputFormat
  26. val tuples = env.createInput(HadoopInputs.readSequenceFile(classOf[IntWritable], classOf[Text],
  27. "hdfs://nnHost:nnPort/path/to/file"))

Configuring CSV Parsing

Flink offers a number of configuration options for CSV parsing:

  • types(Class … types) specifies the types of the fields to parse. It is mandatory to configure the types of the parsed fields. In case of the type class Boolean.class, “True” (case-insensitive), “False” (case-insensitive), “1” and “0” are treated as booleans.

  • lineDelimiter(String del) specifies the delimiter of individual records. The default line delimiter is the new-line character ‘\n’.

  • fieldDelimiter(String del) specifies the delimiter that separates fields of a record. The default field delimiter is the comma character ‘,’.

  • includeFields(boolean … flag), includeFields(String mask), or includeFields(long bitMask) defines which fields to read from the input file (and which to ignore). By default the first n fields (as defined by the number of types in the types() call) are parsed.

  • parseQuotedStrings(char quoteChar) enables quoted string parsing. Strings are parsed as quoted strings if the first character of the string field is the quote character (leading or tailing whitespaces are not trimmed). Field delimiters within quoted strings are ignored. Quoted string parsing fails if the last character of a quoted string field is not the quote character or if the quote character appears at some point which is not the start or the end of the quoted string field (unless the quote character is escaped using ‘’). If quoted string parsing is enabled and the first character of the field is not the quoting string, the string is parsed as unquoted string. By default, quoted string parsing is disabled.

  • ignoreComments(String commentPrefix) specifies a comment prefix. All lines that start with the specified comment prefix are not parsed and ignored. By default, no lines are ignored.

  • ignoreInvalidLines() enables lenient parsing, i.e., lines that cannot be correctly parsed are ignored. By default, lenient parsing is disabled and invalid lines raise an exception.

  • ignoreFirstLine() configures the InputFormat to ignore the first line of the input file. By default no line is ignored.

Recursive Traversal of the Input Path Directory

For file-based inputs, when the input path is a directory, nested files are not enumerated by default. Instead, only the files inside the base directory are read, while nested files are ignored. Recursive enumeration of nested files can be enabled through the recursive.file.enumeration configuration parameter, like in the following example.

Java

  1. // enable recursive enumeration of nested input files
  2. ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  3. // create a configuration object
  4. Configuration parameters = new Configuration();
  5. // set the recursive enumeration parameter
  6. parameters.setBoolean("recursive.file.enumeration", true);
  7. // pass the configuration to the data source
  8. DataSet<String> logs = env.readTextFile("file:///path/with.nested/files")
  9. .withParameters(parameters);

Scala

  1. // enable recursive enumeration of nested input files
  2. val env = ExecutionEnvironment.getExecutionEnvironment
  3. // create a configuration object
  4. val parameters = new Configuration
  5. // set the recursive enumeration parameter
  6. parameters.setBoolean("recursive.file.enumeration", true)
  7. // pass the configuration to the data source
  8. env.readTextFile("file:///path/with.nested/files").withParameters(parameters)

Read Compressed Files

Flink currently supports transparent decompression of input files if these are marked with an appropriate file extension. In particular, this means that no further configuration of the input formats is necessary and any FileInputFormat support the compression, including custom input formats. Please notice that compressed files might not be read in parallel, thus impacting job scalability.

The following table lists the currently supported compression methods.

Compressed MethodFile ExtensionsParallelizable
DEFLATE.deflateno
GZip.gz, .gzipno
Bzip2.bz2no
XZ.xzno
ZStandart.zstno

Data Sinks

Data sinks consume DataSets and are used to store or return them. Data sink operations are described using an OutputFormat. Flink comes with a variety of built-in output formats that are encapsulated behind operations on the DataSet:

  • writeAsText() / TextOutputFormat - Writes elements line-wise as Strings. The Strings are obtained by calling the toString() method of each element.
  • writeAsFormattedText() / TextOutputFormat - Write elements line-wise as Strings. The Strings are obtained by calling a user-defined format() method for each element.
  • writeAsCsv(…) / CsvOutputFormat - Writes tuples as comma-separated value files. Row and field delimiters are configurable. The value for each field comes from the toString() method of the objects. print() / printToErr() / print(String msg) / printToErr(String msg) - Prints the toString() value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between different calls to print. If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output.
  • write() / FileOutputFormat - Method and base class for custom file outputs. Supports custom object-to-bytes conversion.
  • output()/ OutputFormat - Most generic output method, for data sinks that are not file based (such as storing the result in a database).

A DataSet can be input to multiple operations. Programs can write or print a data set and at the same time run additional transformations on them.

Java

  1. // text data
  2. DataSet<String> textData = // [...]
  3. // write DataSet to a file on the local file system
  4. textData.writeAsText("file:///my/result/on/localFS");
  5. // write DataSet to a file on an HDFS with a namenode running at nnHost:nnPort
  6. textData.writeAsText("hdfs://nnHost:nnPort/my/result/on/localFS");
  7. // write DataSet to a file and overwrite the file if it exists
  8. textData.writeAsText("file:///my/result/on/localFS", WriteMode.OVERWRITE);
  9. // tuples as lines with pipe as the separator "a|b|c"
  10. DataSet<Tuple3<String, Integer, Double>> values = // [...]
  11. values.writeAsCsv("file:///path/to/the/result/file", "\n", "|");
  12. // this writes tuples in the text formatting "(a, b, c)", rather than as CSV lines
  13. values.writeAsText("file:///path/to/the/result/file");
  14. // this writes values as strings using a user-defined TextFormatter object
  15. values.writeAsFormattedText("file:///path/to/the/result/file",
  16. new TextFormatter<Tuple2<Integer, Integer>>() {
  17. public String format (Tuple2<Integer, Integer> value) {
  18. return value.f1 + " - " + value.f0;
  19. }
  20. });

Scala

  1. // text data
  2. val textData: DataSet[String] = // [...]
  3. // write DataSet to a file on the local file system
  4. textData.writeAsText("file:///my/result/on/localFS")
  5. // write DataSet to a file on an HDFS with a namenode running at nnHost:nnPort
  6. textData.writeAsText("hdfs://nnHost:nnPort/my/result/on/localFS")
  7. // write DataSet to a file and overwrite the file if it exists
  8. textData.writeAsText("file:///my/result/on/localFS", WriteMode.OVERWRITE)
  9. // tuples as lines with pipe as the separator "a|b|c"
  10. val values: DataSet[(String, Int, Double)] = // [...]
  11. values.writeAsCsv("file:///path/to/the/result/file", "\n", "|")
  12. // this writes tuples in the text formatting "(a, b, c)", rather than as CSV lines
  13. values.writeAsText("file:///path/to/the/result/file")
  14. // this writes values as strings using a user-defined formatting
  15. values map { tuple => tuple._1 + " - " + tuple._2 }
  16. .writeAsText("file:///path/to/the/result/file")

Or with a custom output format:

  1. DataSet<Tuple3<String, Integer, Double>> myResult = [...]
  2. // write Tuple DataSet to a relational database
  3. myResult.output(
  4. // build and configure OutputFormat
  5. JdbcOutputFormat.buildJdbcOutputFormat()
  6. .setDrivername("org.apache.derby.jdbc.EmbeddedDriver")
  7. .setDBUrl("jdbc:derby:memory:persons")
  8. .setQuery("insert into persons (name, age, height) values (?,?,?)")
  9. .finish()
  10. );

Locally Sorted Output

The output of a data sink can be locally sorted on specified fields in specified orders using tuple field positions or field expressions. This works for every output format.

The following examples show how to use this feature:

  1. DataSet<Tuple3<Integer, String, Double>> tData = // [...]
  2. DataSet<Tuple2<BookPojo, Double>> pData = // [...]
  3. DataSet<String> sData = // [...]
  4. // sort output on String field in ascending order
  5. tData.sortPartition(1, Order.ASCENDING).print();
  6. // sort output on Double field in descending and Integer field in ascending order
  7. tData.sortPartition(2, Order.DESCENDING).sortPartition(0, Order.ASCENDING).print();
  8. // sort output on the "author" field of nested BookPojo in descending order
  9. pData.sortPartition("f0.author", Order.DESCENDING).writeAsText(...);
  10. // sort output on the full tuple in ascending order
  11. tData.sortPartition("*", Order.ASCENDING).writeAsCsv(...);
  12. // sort atomic type (String) output in descending order
  13. sData.sortPartition("*", Order.DESCENDING).writeAsText(...);

Globally sorted output is not supported.

Iteration Operators

Iterations implement loops in Flink programs. The iteration operators encapsulate a part of the program and execute it repeatedly, feeding back the result of one iteration (the partial solution) into the next iteration. There are two types of iterations in Flink: BulkIteration and DeltaIteration.

This section provides quick examples on how to use both operators. Check out the Introduction to Iterations page for a more detailed introduction.

Java

Bulk Iterations

To create a BulkIteration call the iterate(int) method of the DataSet the iteration should start at. This will return an IterativeDataSet, which can be transformed with the regular operators. The single argument to the iterate call specifies the maximum number of iterations.

To specify the end of an iteration call the closeWith(DataSet) method on the IterativeDataSet to specify which transformation should be fed back to the next iteration. You can optionally specify a termination criterion with closeWith(DataSet, DataSet), which evaluates the second DataSet and terminates the iteration, if this DataSet is empty. If no termination criterion is specified, the iteration terminates after the given maximum number iterations.

The following example iteratively estimates the number Pi. The goal is to count the number of random points, which fall into the unit circle. In each iteration, a random point is picked. If this point lies inside the unit circle, we increment the count. Pi is then estimated as the resulting count divided by the number of iterations multiplied by 4.

  1. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  2. // Create initial IterativeDataSet
  3. IterativeDataSet<Integer> initial = env.fromElements(0).iterate(10000);
  4. DataSet<Integer> iteration = initial.map(new MapFunction<Integer, Integer>() {
  5. @Override
  6. public Integer map(Integer i) throws Exception {
  7. double x = Math.random();
  8. double y = Math.random();
  9. return i + ((x * x + y * y < 1) ? 1 : 0);
  10. }
  11. });
  12. // Iteratively transform the IterativeDataSet
  13. DataSet<Integer> count = initial.closeWith(iteration);
  14. count.map(new MapFunction<Integer, Double>() {
  15. @Override
  16. public Double map(Integer count) throws Exception {
  17. return count / (double) 10000 * 4;
  18. }
  19. }).print();
  20. env.execute("Iterative Pi Example");

Delta Iterations

Delta iterations exploit the fact that certain algorithms do not change every data point of the solution in each iteration.

In addition to the partial solution that is fed back (called workset) in every iteration, delta iterations maintain state across iterations (called solution set), which can be updated through deltas. The result of the iterative computation is the state after the last iteration. Please refer to the Introduction to Iterations for an overview of the basic principle of delta iterations.

Defining a DeltaIteration is similar to defining a BulkIteration. For delta iterations, two data sets form the input to each iteration (workset and solution set), and two data sets are produced as the result (new workset, solution set delta) in each iteration.

To create a DeltaIteration call the iterateDelta(DataSet, int, int) (or iterateDelta(DataSet, int, int[]) respectively). This method is called on the initial solution set. The arguments are the initial delta set, the maximum number of iterations and the key positions. The returned DeltaIteration object gives you access to the DataSets representing the workset and solution set via the methods iteration.getWorkset() and iteration.getSolutionSet().

Below is an example for the syntax of a delta iteration

  1. // read the initial data sets
  2. DataSet<Tuple2<Long, Double>> initialSolutionSet = // [...]
  3. DataSet<Tuple2<Long, Double>> initialDeltaSet = // [...]
  4. int maxIterations = 100;
  5. int keyPosition = 0;
  6. DeltaIteration<Tuple2<Long, Double>, Tuple2<Long, Double>> iteration = initialSolutionSet
  7. .iterateDelta(initialDeltaSet, maxIterations, keyPosition);
  8. DataSet<Tuple2<Long, Double>> candidateUpdates = iteration.getWorkset()
  9. .groupBy(1)
  10. .reduceGroup(new ComputeCandidateChanges());
  11. DataSet<Tuple2<Long, Double>> deltas = candidateUpdates
  12. .join(iteration.getSolutionSet())
  13. .where(0)
  14. .equalTo(0)
  15. .with(new CompareChangesToCurrent());
  16. DataSet<Tuple2<Long, Double>> nextWorkset = deltas
  17. .filter(new FilterByThreshold());
  18. iteration.closeWith(deltas, nextWorkset)
  19. .writeAsCsv(outputPath);

Scala

Bulk Iterations

To create a BulkIteration call the iterate(int) method of the DataSet the iteration should start at and also specify a step function. The step function gets the input DataSet for the current iteration and must return a new DataSet. The parameter of the iterate call is the maximum number of iterations after which to stop.

There is also the iterateWithTermination(int) function that accepts a step function that returns two DataSets: The result of the iteration step and a termination criterion. The iterations are stopped once the termination criterion DataSet is empty.

The following example iteratively estimates the number Pi. The goal is to count the number of random points, which fall into the unit circle. In each iteration, a random point is picked. If this point lies inside the unit circle, we increment the count. Pi is then estimated as the resulting count divided by the number of iterations multiplied by 4.

  1. val env = ExecutionEnvironment.getExecutionEnvironment()
  2. // Create initial DataSet
  3. val initial = env.fromElements(0)
  4. val count = initial.iterate(10000) { iterationInput: DataSet[Int] =>
  5. val result = iterationInput.map { i =>
  6. val x = Math.random()
  7. val y = Math.random()
  8. i + (if (x * x + y * y < 1) 1 else 0)
  9. }
  10. result
  11. }
  12. val result = count map { c => c / 10000.0 * 4 }
  13. result.print()
  14. env.execute("Iterative Pi Example")

Delta Iterations

Delta iterations exploit the fact that certain algorithms do not change every data point of the solution in each iteration.

In addition to the partial solution that is fed back (called workset) in every iteration, delta iterations maintain state across iterations (called solution set), which can be updated through deltas. The result of the iterative computation is the state after the last iteration. Please refer to the Introduction to Iterations for an overview of the basic principle of delta iterations.

Defining a DeltaIteration is similar to defining a BulkIteration. For delta iterations, two data sets form the input to each iteration (workset and solution set), and two data sets are produced as the result (new workset, solution set delta) in each iteration.

To create a DeltaIteration call the iterateDelta(initialWorkset, maxIterations, key) on the initial solution set. The step function takes two parameters: (solutionSet, workset), and must return two values: (solutionSetDelta, newWorkset).

Below is an example for the syntax of a delta iteration

  1. // read the initial data sets
  2. val initialSolutionSet: DataSet[(Long, Double)] = // [...]
  3. val initialWorkset: DataSet[(Long, Double)] = // [...]
  4. val maxIterations = 100
  5. val keyPosition = 0
  6. val result = initialSolutionSet.iterateDelta(initialWorkset, maxIterations, Array(keyPosition)) {
  7. (solution, workset) =>
  8. val candidateUpdates = workset.groupBy(1).reduceGroup(new ComputeCandidateChanges())
  9. val deltas = candidateUpdates.join(solution).where(0).equalTo(0)(new CompareChangesToCurrent())
  10. val nextWorkset = deltas.filter(new FilterByThreshold())
  11. (deltas, nextWorkset)
  12. }
  13. result.writeAsCsv(outputPath)
  14. env.execute()

Operating on Data Objects in Functions

Flink’s runtime exchanges data with user functions in form of Java objects. Functions receive input objects from the runtime as method parameters and return output objects as result. Because these objects are accessed by user functions and runtime code, it is very important to understand and follow the rules about how the user code may access, i.e., read and modify, these objects.

User functions receive objects from Flink’s runtime either as regular method parameters (like a MapFunction) or through an Iterable parameter (like a GroupReduceFunction). We refer to objects that the runtime passes to a user function as input objects. User functions can emit objects to the Flink runtime either as a method return value (like a MapFunction) or through a Collector (like a FlatMapFunction). We refer to objects which have been emitted by the user function to the runtime as output objects.

Flink’s DataSet API features two modes that differ in how Flink’s runtime creates or reuses input objects. This behavior affects the guarantees and constraints for how user functions may interact with input and output objects. The following sections define these rules and give coding guidelines to write safe user function code.

Object-Reuse Disabled (DEFAULT)

By default, Flink operates in object-reuse disabled mode. This mode ensures that functions always receive new input objects within a function call. The object-reuse disabled mode gives better guarantees and is safer to use. However, it comes with a certain processing overhead and might cause higher Java garbage collection activity. The following table explains how user functions may access input and output objects in object-reuse disabled mode.

OperationGuarantees and Restrictions
Reading Input ObjectsWithin a method call it is guaranteed that the value of an input object does not change. This includes objects served by an Iterable. For example it is safe to collect input objects served by an Iterable in a List or Map. Note that objects may be modified after the method call is left. It is not safe to remember objects across function calls.
Modifying Input ObjectsYou may modify input objects.
Emitting Input ObjectsYou may emit input objects. The value of an input object may have changed after it was emitted. It is not safe to read an input object after it was emitted.
Reading Output ObjectsAn object that was given to a Collector or returned as method result might have changed its value. It is not safe to read an output object.
Modifying Output ObjectsYou may modify an object after it was emitted and emit it again.

Coding guidelines for the object-reuse disabled (default) mode:

  • Do not remember the read input objects across method calls.
  • Do not read objects after you emitted them.

Object-Reuse Enabled

In object-reuse enabled mode, Flink’s runtime minimizes the number of object instantiations. This can improve the performance and can reduce the Java garbage collection pressure. The object-reuse enabled mode is activated by calling ExecutionConfig.enableObjectReuse(). The following table explains how user functions may access input and output objects in object-reuse enabled mode.

OperationGuarantees and Restrictions
Reading input objects received as regular method parametersInput objects received as regular method arguments are not modified within a function call. Objects may be modified after method call is left. It is not safe to remember objects across function calls.
Reading input objects received from an Iterable parameterInput objects received from an Iterable are only valid until the next() method is called. An Iterable or Iterator may serve the same object instance multiple times. It is not safe to remember input objects received from an Iterable, e.g., by putting them in a List or Map.
Modifying Input ObjectsYou must not modify input objects, except for input objects of MapFunction, FlatMapFunction, MapPartitionFunction, GroupReduceFunction, GroupCombineFunction, CoGroupFunction, and InputFormat.next(reuse).
Emitting Input ObjectsYou must not emit input objects, except for input objects of MapFunction, FlatMapFunction, MapPartitionFunction, GroupReduceFunction, GroupCombineFunction, CoGroupFunction, and InputFormat.next(reuse).
Reading output ObjectsAn object that was given to a Collector or returned as method result might have changed its value. It is not safe to read an output object.
Modifying Output ObjectsYou may modify an output object and emit it again.

Coding guidelines for object-reuse enabled:

  • Do not remember input objects received from an Iterable.
  • Do not remember and read input objects across method calls.
  • Do not modify or emit input objects, except for input objects of MapFunction, FlatMapFunction, MapPartitionFunction, GroupReduceFunction, GroupCombineFunction, CoGroupFunction, and InputFormat.next(reuse).
  • To reduce object instantiations, you can always emit a dedicated output object which is repeatedly modified but never read.

Debugging

Before running a data analysis program on a large data set in a distributed cluster, it is a good idea to make sure that the implemented algorithm works as desired. Hence, implementing data analysis programs is usually an incremental process of checking results, debugging, and improving.

Flink provides a few nice features to significantly ease the development process of data analysis programs by supporting local debugging from within an IDE, injection of test data, and collection of result data. This section give some hints how to ease the development of Flink programs.

Local Execution Envronment

A LocalEnvironment starts a Flink system within the same JVM process it was created in. If you start the LocalEnvironment from an IDE, you can set breakpoints in your code and easily debug your program.

A LocalEnvironment is created and used as follows:

Java

  1. final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
  2. DataSet<String> lines = env.readTextFile(pathToTextFile);
  3. // build your program
  4. env.execute();

Scala

  1. val env = ExecutionEnvironment.createLocalEnvironment()
  2. val lines = env.readTextFile(pathToTextFile)
  3. // build your program
  4. env.execute()

Collection Data Sources and Sinks

Providing input for an analysis program and checking its output is cumbersome when done by creating input files and reading output files. Flink features special data sources and sinks which are backed by Java collections to ease testing. Once a program has been tested, the sources and sinks can be easily replaced by sources and sinks that read from / write to external data stores such as HDFS.

Collection data sources can be used as follows:

Java

  1. final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
  2. // Create a DataSet from a list of elements
  3. DataSet<Integer> myInts = env.fromElements(1, 2, 3, 4, 5);
  4. // Create a DataSet from any Java collection
  5. List<Tuple2<String, Integer>> data = ...
  6. DataSet<Tuple2<String, Integer>> myTuples = env.fromCollection(data);
  7. // Create a DataSet from an Iterator
  8. Iterator<Long> longIt = ...
  9. DataSet<Long> myLongs = env.fromCollection(longIt, Long.class);

Scala

  1. val env = ExecutionEnvironment.createLocalEnvironment()
  2. // Create a DataSet from a list of elements
  3. val myInts = env.fromElements(1, 2, 3, 4, 5)
  4. // Create a DataSet from any Collection
  5. val data: Seq[(String, Int)] = ...
  6. val myTuples = env.fromCollection(data)
  7. // Create a DataSet from an Iterator
  8. val longIt: Iterator[Long] = ...
  9. val myLongs = env.fromCollection(longIt)

Note: Currently, the collection data source requires that data types and iterators implement Serializable. Furthermore, collection data sources can not be executed in parallel ( parallelism = 1).

Broadcast Variables

Broadcast variables allow you to make a data set available to all parallel instances of an operation, in addition to the regular input of the operation. This is useful for auxiliary data sets, or data-dependent parameterization. The data set will then be accessible at the operator as a Collection.

  • Broadcast: broadcast sets are registered by name via withBroadcastSet(DataSet, String), and
  • Access: accessible via getRuntimeContext().getBroadcastVariable(String) at the target operator.

    Java

  1. Java
  2. Scala
  3. // 1. The DataSet to be broadcast
  4. DataSet<Integer> toBroadcast = env.fromElements(1, 2, 3);
  5. DataSet<String> data = env.fromElements("a", "b");
  6. data.map(new RichMapFunction<String, String>() {
  7. @Override
  8. public void open(Configuration parameters) throws Exception {
  9. // 3. Access the broadcast DataSet as a Collection
  10. Collection<Integer> broadcastSet = getRuntimeContext().getBroadcastVariable("broadcastSetName");
  11. }
  12. @Override
  13. public String map(String value) throws Exception {
  14. ...
  15. }
  16. }).withBroadcastSet(toBroadcast, "broadcastSetName"); // 2. Broadcast the DataSet

Scala

  1. // 1. The DataSet to be broadcast
  2. val toBroadcast = env.fromElements(1, 2, 3)
  3. val data = env.fromElements("a", "b")
  4. data.map(new RichMapFunction[String, String]() {
  5. var broadcastSet: Traversable[String] = null
  6. override def open(config: Configuration): Unit = {
  7. // 3. Access the broadcast DataSet as a Collection
  8. broadcastSet = getRuntimeContext().getBroadcastVariable[String]("broadcastSetName").asScala
  9. }
  10. def map(in: String): String = {
  11. ...
  12. }
  13. }).withBroadcastSet(toBroadcast, "broadcastSetName") // 2. Broadcast the DataSet

Make sure that the names (broadcastSetName in the previous example) match when registering and accessing broadcast data sets. For a complete example program, have a look at K-Means Algorithm.

Note: As the content of broadcast variables is kept in-memory on each node, it should not become too large. For simpler things like scalar values you can simply make parameters part of the closure of a function, or use the withParameters(…) method to pass in a configuration.

Distributed Cache

Flink offers a distributed cache, similar to Apache Hadoop, to make files locally accessible to parallel instances of user functions. This functionality can be used to share files that contain static external data such as dictionaries or machine-learned regression models.

The cache works as follows. A program registers a file or directory of a local or remote filesystem such as HDFS or S3 under a specific name in its ExecutionEnvironment as a cached file. When the program is executed, Flink automatically copies the file or directory to the local filesystem of all workers. A user function can look up the file or directory under the specified name and access it from the worker’s local filesystem.

The distributed cache is used as follows:

Java

  1. ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  2. // register a file from HDFS
  3. env.registerCachedFile("hdfs:///path/to/your/file", "hdfsFile")
  4. // register a local executable file (script, executable, ...)
  5. env.registerCachedFile("file:///path/to/exec/file", "localExecFile", true)
  6. // define your program and execute
  7. ...
  8. DataSet<String> input = ...
  9. DataSet<Integer> result = input.map(new MyMapper());
  10. ...
  11. env.execute();

Scala

  1. val env = ExecutionEnvironment.getExecutionEnvironment
  2. // register a file from HDFS
  3. env.registerCachedFile("hdfs:///path/to/your/file", "hdfsFile")
  4. // register a local executable file (script, executable, ...)
  5. env.registerCachedFile("file:///path/to/exec/file", "localExecFile", true)
  6. // define your program and execute
  7. ...
  8. val input: DataSet[String] = ...
  9. val result: DataSet[Integer] = input.map(new MyMapper())
  10. ...
  11. env.execute()

Access the cached file in a user function (here a MapFunction). The function must extend a RichFunction class because it needs access to the RuntimeContext.

Java

  1. // extend a RichFunction to have access to the RuntimeContext
  2. public final class MyMapper extends RichMapFunction<String, Integer> {
  3. @Override
  4. public void open(Configuration config) {
  5. // access cached file via RuntimeContext and DistributedCache
  6. File myFile = getRuntimeContext().getDistributedCache().getFile("hdfsFile");
  7. // read the file (or navigate the directory)
  8. ...
  9. }
  10. @Override
  11. public Integer map(String value) throws Exception {
  12. // use content of cached file
  13. ...
  14. }
  15. }

Scala

  1. // extend a RichFunction to have access to the RuntimeContext
  2. class MyMapper extends RichMapFunction[String, Int] {
  3. override def open(config: Configuration): Unit = {
  4. // access cached file via RuntimeContext and DistributedCache
  5. val myFile: File = getRuntimeContext.getDistributedCache.getFile("hdfsFile")
  6. // read the file (or navigate the directory)
  7. ...
  8. }
  9. override def map(value: String): Int = {
  10. // use content of cached file
  11. ...
  12. }
  13. }

Passing Parameters to Functions

Parameters can be passed to functions using either the constructor or the withParameters(Configuration) method. The parameters are serialized as part of the function object and shipped to all parallel task instances.

Via Constructor

Java

  1. DataSet<Integer> toFilter = env.fromElements(1, 2, 3);
  2. toFilter.filter(new MyFilter(2));
  3. private static class MyFilter implements FilterFunction<Integer> {
  4. private final int limit;
  5. public MyFilter(int limit) {
  6. this.limit = limit;
  7. }
  8. @Override
  9. public boolean filter(Integer value) throws Exception {
  10. return value > limit;
  11. }
  12. }

Scala

  1. val toFilter = env.fromElements(1, 2, 3)
  2. toFilter.filter(new MyFilter(2))
  3. class MyFilter(limit: Int) extends FilterFunction[Int] {
  4. override def filter(value: Int): Boolean = {
  5. value > limit
  6. }
  7. }

Via withParameters(Configuration)

Java

  1. DataSet<Integer> toFilter = env.fromElements(1, 2, 3);
  2. Configuration config = new Configuration();
  3. config.setInteger("limit", 2);
  4. toFilter.filter(new RichFilterFunction<Integer>() {
  5. private int limit;
  6. @Override
  7. public void open(Configuration parameters) throws Exception {
  8. limit = parameters.getInteger("limit", 0);
  9. }
  10. @Override
  11. public boolean filter(Integer value) throws Exception {
  12. return value > limit;
  13. }
  14. }).withParameters(config);

Scala

  1. val toFilter = env.fromElements(1, 2, 3)
  2. val c = new Configuration()
  3. c.setInteger("limit", 2)
  4. toFilter.filter(new RichFilterFunction[Int]() {
  5. var limit = 0
  6. override def open(config: Configuration): Unit = {
  7. limit = config.getInteger("limit", 0)
  8. }
  9. def filter(in: Int): Boolean = {
  10. in > limit
  11. }
  12. }).withParameters(c)

Globally via the ExecutionConfig

Flink also allows to pass custom configuration values to the ExecutionConfig interface of the environment. Since the execution config is accessible in all (rich) user functions, the custom configuration will be available globally in all functions.

Setting a custom global configuration

Java

  1. Configuration conf = new Configuration();
  2. conf.setString("mykey","myvalue");
  3. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  4. env.getConfig().setGlobalJobParameters(conf);

Scala

  1. val env = ExecutionEnvironment.getExecutionEnvironment
  2. val conf = new Configuration()
  3. conf.setString("mykey", "myvalue")
  4. env.getConfig.setGlobalJobParameters(conf)

Please note that you can also pass a custom class extending the ExecutionConfig.GlobalJobParameters class as the global job parameters to the execution config. The interface allows to implement the Map<String, String> toMap() method which will in turn show the values from the configuration in the web frontend.

Accessing values from the global configuration

  1. public static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> {
  2. private String mykey;
  3. @Override
  4. public void open(Configuration parameters) throws Exception {
  5. ExecutionConfig.GlobalJobParameters globalParams = getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
  6. Configuration globConf = (Configuration) globalParams;
  7. mykey = globConf.getString("mykey", null);
  8. }