Hadoop Compatibility Beta
Flink is compatible with Apache Hadoop MapReduce interfaces and therefore allowsreusing code that was implemented for Hadoop MapReduce.
You can:
- use Hadoop’s
Writable
data types in Flink programs. - use any Hadoop
InputFormat
as a DataSource. - use any Hadoop
OutputFormat
as a DataSink. - use a Hadoop
Mapper
as FlatMapFunction. - use a Hadoop
Reducer
as GroupReduceFunction.This document shows how to use existing Hadoop MapReduce code with Flink. Please refer to theConnecting to other systems guide for reading from Hadoop supported file systems.
Project Configuration
Support for Hadoop input/output formats is part of the flink-java
andflink-scala
Maven modules that are always required when writing Flink jobs.The code is located in org.apache.flink.api.java.hadoop
andorg.apache.flink.api.scala.hadoop
in an additional sub-package for themapred
and mapreduce
API.
Support for Hadoop Mappers and Reducers is contained in the flink-hadoop-compatibility
Maven module.This code resides in the org.apache.flink.hadoopcompatibility
package.
Add the following dependency to your pom.xml
if you want to reuse Mappersand Reducers.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-compatibility_2.11</artifactId>
<version>1.9.0</version>
</dependency>
Using Hadoop Data Types
Flink supports all Hadoop Writable
and WritableComparable
data typesout-of-the-box. You do not need to include the Hadoop Compatibility dependency,if you only want to use your Hadoop data types. See theProgramming Guide for more details.
Using Hadoop InputFormats
To use Hadoop InputFormats
with Flink the format must first be wrappedusing either readHadoopFile
or createHadoopInput
of theHadoopInputs
utility class.The former is used for input formats derivedfrom FileInputFormat
while the latter has to be used for general purposeinput formats.The resulting InputFormat
can be used to create a data source by usingExecutionEnvironmen#createInput
.
The resulting DataSet
contains 2-tuples where the first fieldis the key and the second field is the value retrieved from the HadoopInputFormat.
The following example shows how to use Hadoop’s TextInputFormat
.
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<LongWritable, Text>> input =
env.createInput(HadoopInputs.readHadoopFile(new TextInputFormat(),
LongWritable.class, Text.class, textPath));
// Do something with the data.
[...]
val env = ExecutionEnvironment.getExecutionEnvironment
val input: DataSet[(LongWritable, Text)] =
env.createInput(HadoopInputs.readHadoopFile(
new TextInputFormat, classOf[LongWritable], classOf[Text], textPath))
// Do something with the data.
[...]
Using Hadoop OutputFormats
Flink provides a compatibility wrapper for Hadoop OutputFormats
. Any classthat implements org.apache.hadoop.mapred.OutputFormat
or extendsorg.apache.hadoop.mapreduce.OutputFormat
is supported.The OutputFormat wrapper expects its input data to be a DataSet containing2-tuples of key and value. These are to be processed by the Hadoop OutputFormat.
The following example shows how to use Hadoop’s TextOutputFormat
.
// Obtain the result we want to emit
DataSet<Tuple2<Text, IntWritable>> hadoopResult = [...]
// Set up the Hadoop TextOutputFormat.
HadoopOutputFormat<Text, IntWritable> hadoopOF =
// create the Flink wrapper.
new HadoopOutputFormat<Text, IntWritable>(
// set the Hadoop OutputFormat and specify the job.
new TextOutputFormat<Text, IntWritable>(), job
);
hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(job, new Path(outputPath));
// Emit data using the Hadoop TextOutputFormat.
hadoopResult.output(hadoopOF);
// Obtain your result to emit.
val hadoopResult: DataSet[(Text, IntWritable)] = [...]
val hadoopOF = new HadoopOutputFormat[Text,IntWritable](
new TextOutputFormat[Text, IntWritable],
new JobConf)
hadoopOF.getJobConf.set("mapred.textoutputformat.separator", " ")
FileOutputFormat.setOutputPath(hadoopOF.getJobConf, new Path(resultPath))
hadoopResult.output(hadoopOF)
Using Hadoop Mappers and Reducers
Hadoop Mappers are semantically equivalent to Flink’s FlatMapFunctions and Hadoop Reducers are equivalent to Flink’s GroupReduceFunctions. Flink provides wrappers for implementations of Hadoop MapReduce’s Mapper
and Reducer
interfaces, i.e., you can reuse your Hadoop Mappers and Reducers in regular Flink programs. At the moment, only the Mapper and Reduce interfaces of Hadoop’s mapred API (org.apache.hadoop.mapred
) are supported.
The wrappers take a DataSet<Tuple2<KEYIN,VALUEIN>>
as input and produce a DataSet<Tuple2<KEYOUT,VALUEOUT>>
as output where KEYIN
and KEYOUT
are the keys and VALUEIN
and VALUEOUT
are the values of the Hadoop key-value pairs that are processed by the Hadoop functions. For Reducers, Flink offers a wrapper for a GroupReduceFunction with (HadoopReduceCombineFunction
) and without a Combiner (HadoopReduceFunction
). The wrappers accept an optional JobConf
object to configure the Hadoop Mapper or Reducer.
Flink’s function wrappers are
org.apache.flink.hadoopcompatibility.mapred.HadoopMapFunction
,org.apache.flink.hadoopcompatibility.mapred.HadoopReduceFunction
, andorg.apache.flink.hadoopcompatibility.mapred.HadoopReduceCombineFunction
.and can be used as regular Flink FlatMapFunctions or GroupReduceFunctions.
The following example shows how to use Hadoop Mapper
and Reducer
functions.
// Obtain data to process somehow.
DataSet<Tuple2<LongWritable, Text>> text = [...]
DataSet<Tuple2<Text, LongWritable>> result = text
// use Hadoop Mapper (Tokenizer) as MapFunction
.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(
new Tokenizer()
))
.groupBy(0)
// use Hadoop Reducer (Counter) as Reduce- and CombineFunction
.reduceGroup(new HadoopReduceCombineFunction<Text, LongWritable, Text, LongWritable>(
new Counter(), new Counter()
));
Please note: The Reducer wrapper works on groups as defined by Flink’s groupBy() operation. It does not consider any custom partitioners, sort or grouping comparators you might have set in the JobConf
.
Complete Hadoop WordCount Example
The following example shows a complete WordCount implementation using Hadoop data types, Input- and OutputFormats, and Mapper and Reducer implementations.
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// Set up the Hadoop TextInputFormat.
Job job = Job.getInstance();
HadoopInputFormat<LongWritable, Text> hadoopIF =
new HadoopInputFormat<LongWritable, Text>(
new TextInputFormat(), LongWritable.class, Text.class, job
);
TextInputFormat.addInputPath(job, new Path(inputPath));
// Read data using the Hadoop TextInputFormat.
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopIF);
DataSet<Tuple2<Text, LongWritable>> result = text
// use Hadoop Mapper (Tokenizer) as MapFunction
.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(
new Tokenizer()
))
.groupBy(0)
// use Hadoop Reducer (Counter) as Reduce- and CombineFunction
.reduceGroup(new HadoopReduceCombineFunction<Text, LongWritable, Text, LongWritable>(
new Counter(), new Counter()
));
// Set up the Hadoop TextOutputFormat.
HadoopOutputFormat<Text, LongWritable> hadoopOF =
new HadoopOutputFormat<Text, LongWritable>(
new TextOutputFormat<Text, LongWritable>(), job
);
hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(job, new Path(outputPath));
// Emit data using the Hadoop TextOutputFormat.
result.output(hadoopOF);
// Execute Program
env.execute("Hadoop WordCount");