Apache Spark and Hop
Due to conflicting libraries it is currently not possible to support all Beam runtime engines out of the box from the GUI and Hop server. This is a known problem and we are looking for a solution to support all runners out of the box. The good news is that the Apache Flink and Google Dataflow will work, but for spark some changes to the installation are needed. This guide will explain what you need to change to a Hop 2.0 installation to get Spark running.
When applying these changes the Flink and Google Dataflow runner will stop working. |
Installation guide
The first step is to download Apache Spark, this can be done from their official download page. Make sure you fetch the version that is listed as supported in our documentation we will be using this package to add jars to the Hop installation. You can also fetch these jars from your Spark cluster if needed.
Cleaning up Hop
For this to work we will both need to remove and add some files to the Beam plugin located under /plugins/engines/beam/lib
folder
Files to remove
Following files have to be removed from the /plugins/engines/beam/lib
folder
flink-shaded-jackson-*
jackson-module-scala*
scala-java8-compat*
scala-library*
scala-parser-combinators*
Files to add
Following files have to be added to the /plugins/engines/beam/lib
folder, you can find them in <spark installation>/jars folder
json4s-ast*
json4s-core*
json4s-jackson*
json4s-scalap*
log4j*
scala-compiler*
scala-library*
scala-parser-combinators*
scala-reflect*
scala-xml*
spark-unsafe*
xbean-asm7-shaded*
Files to fetch from Maven Central
Due to version conflicts following file has to be fetched from maven central
- jackson-module-scala* (Direct link)
Rebuild the fat jar
The final step would be to build the fat-jar this can be done from the GUI Tools → Generate a Hop fat jar
or you can use the command line ./hop-conf.sh -fj <location>/fat-jar.jar