Getting Started with PySpark
MLeap PySpark integration provides serialization of PySpark-trained MLpipelines to MLeap Bundles. MLeap also providesseveral extensions to Spark, including enhanced one hot encoding and one vsrest models. Unlike Mleap<>Spark integration, MLeap doesn’t yet provide PySparkintegration with Spark Extensions transformers.
Adding MLeap Spark to Your Project
Before adding MLeap Pyspark to your project, you first have to compile andadd MLeap Spark.
MLeap PySpark is available in the combust/mleap github repository in thepython package.
To add MLeap to your PySpark project, just clone the git repo, add the mleap/pyhton
path, and import mleap.pyspark
git clone git@github.com:combust/mleap.git
Then in your python environment do:
import sys
sys.path.append('<git directory>/mleap/python')
import mleap.pyspark
Note: the import of mleap.pyspark
needs to happen before any other PySparklibraries are imported.
Note: If you are working from a notebook environment, be sure to take a look atinstructions of how to set up MLeap PySpark with:
Using PIP
Alternatively, there is PIP support for PySpark available under: https://pypi.python.org/pypi/mleap.
To use MLeap extensions to PySpark:
- See build instructions to build MLeap from source.
- See core concepts for an overview of ML pipelines.
- See Spark documentation to learn how to train ML pipelines in Spark.
- See Demo notebook on how to use PySpark and MLeap to serialize your pipeline to Bundle.ml