Hadoop 集成

Referencing a Hadoop configuration

You can reference a Hadoop configuration by setting the environment variable HADOOP_CONF_DIR.

  1. HADOOP_CONF_DIR=/path/to/etc/hadoop

Referencing the HDFS configuration in the Flink configuration is deprecated.

Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details below.

Providing Hadoop classes

In order to use Hadoop features (e.g., YARN, HDFS) it is necessary to provide Flink with the required Hadoop classes,as these are not bundled by default.

This can be done by 1) Adding the Hadoop classpath to Flink2) Putting the required jar files into /lib directory of the Flink distributionOption 1) requires very little work, integrates nicely with existing Hadoop setups and should be thepreferred approach.However, Hadoop has a large dependency footprint that increases the risk for dependency conflicts to occur.If this happens, please refer to option 2).

The following subsections explains these approaches in detail.

Adding Hadoop Classpaths

Flink will use the environment variable HADOOP_CLASSPATH to augment theclasspath that is used when starting Flink components such as the Client,JobManager, or TaskManager. Most Hadoop distributions and cloud environmentswill not set this variable by default so if the Hadoop classpath should bepicked up by Flink the environment variable must be exported on all machinesthat are running Flink components.

When running on YARN, this is usually not a problem because the componentsrunning inside YARN will be started with the Hadoop classpaths, but it canhappen that the Hadoop dependencies must be in the classpath when submitting ajob to YARN. For this, it’s usually enough to do a

  1. export HADOOP_CLASSPATH=`hadoop classpath`

in the shell. Note that hadoop is the hadoop binary and that classpath is an argument that will make it print the configured Hadoop classpath.

Putting the Hadoop configuration in the same class path as the Hadoop libraries makes Flink pick up that configuration.

Adding Hadoop to /lib

The Flink project releases Hadoop distributions for specific versions, that relocate or exclude several dependenciesto reduce the risk of dependency clashes.These can be found in the Additional Components section of the download page.For these versions it is sufficient to download the corresponding Pre-bundled Hadoop component and putting it intothe /lib directory of the Flink distribution.

If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),then it is necessary to build flink-shaded against this version.You can find the source code for this project in the Additional Components section of the download page.

Note If you want to build flink-shaded against a vendor specific Hadoop version, you first have to configure thevendor-specific maven repository in your local maven setup as described here.

Run the following command to build and install flink-shaded against your desired Hadoop version (e.g., for version 2.6.5-custom):

  1. mvn clean install -Dhadoop.version=2.6.5-custom

After this step is complete, put the flink-shaded-hadoop-2-uber jar into the /lib directory of the Flink distribution.

Running a job locally

To run a job locally as one JVM process using the mini cluster, the required hadoop dependencies have to be explicitlyadded to the classpath of the started JVM process.

To run an application using maven (also from IDE as a maven project), the required hadoop dependencies can be addedas provided to the pom.xml, e.g.:

  1. <dependency>
  2. <groupId>org.apache.hadoop</groupId>
  3. <artifactId>hadoop-client</artifactId>
  4. <version>2.8.3</version>
  5. <scope>provided</scope>
  6. </dependency>

This way it should work both in local and cluster run where the provided dependencies are added elsewhere as described before.

To run or debug an application in IntelliJ Idea the provided dependencies can be included to the class pathin the “Run|Edit Configurations” window.