MLeap Bundles
MLeap Bundles are a graph-based, portable file format for serializing andde-serializing:
- Machine learning data pipelines - any transformer-based data pipeline
- Algorithms (Regressions, Tree-Based models, Bayesian models, Neural Nets, Clustering)
Bundles make it very easy to share the results of your training pipeline, simply generate a bundle file andsend it through email to a colleague or just view the metadata of your data pipeline and algorithm!
Bundles also make deployments simple: just export your bundle and load it into your Spark,Scikit-learn, or MLeap-based application.
Features of MLeap Bundles
- Serialize to a directory or a zip file
- Entirely JSON and Protobuf-based format
- Serialize as pure JSON, pure Protobuf, or mixed mode
- Highly extensible, including easy integration with new transformers
Common Format For Spark, Scikit-Learn, TensorFlow
MLeap provides a serialization format for common transformers that are found in Spark, Scikit and TF. For example, consider the Standard Scaler trasnformer (tf.random_normal_initializer
in TensorFlow). It performs the same opperation on all three platforms so in theory can be serialized, deserialized and used interchangeably between them.
Bundle Structure
At its root directory, a bundle has a bundle.json
file, which providesbasic meta data about the serialization of the bundle. It also has aroot/
directory, which contains the root transformer of the MLpipeline. The root transformer can be any type of transformer supportedby MLeap, but is most commonly going to be a Pipeline
transformer.
Let’s take a look at an example MLeap Bundle. The pipeline consists ofstring indexing a set of categorical features, followed by one hotencoding them, assembling the results into a feature vector and finallyexecuting a linear regression on the features. Here is what the bundlelooks like:
├── bundle.json
└── root
├── linReg_7a946be681a8.node
│ ├── model.json
│ └── node.json
├── model.json
├── node.json
├── oneHot_4b815730d602.node
│ ├── model.json
│ └── node.json
├── strIdx_ac9c3f9c6d3a.node
│ ├── model.json
│ └── node.json
└── vecAssembler_9eb71026cd11.node
├── model.json
└── node.json
bundle.json
{
"uid": "7b4eaab4-7d84-4f52-9351-5de98f9d5d04",
"name": "pipeline_43ec54dff5b2",
"timestamp": "2017-09-03T17:41:25.206",
"format": "json",
"version": "0.14.0"
}
uid
is a Java UUID that is automatically generated as a unique IDfor the bundlename
is theuid
of the root transformerformat
is the serialization format used to serialize this bundleversion
is a reference to the version of MLeap used to serializethe bundletimestamp
defines when the bundle was serialized
model.json
For the pipeline:
{
"op": "pipeline",
"attributes": {
"nodes": {
"type": "list",
"string": ["strIdx_ac9c3f9c6d3a", "oneHot_4b815730d602", "vecAssembler_9eb71026cd11", "linReg_7a946be681a8"]
}
}
For the linear regression:
{
"op": "linear_regression",
"attributes": {
"coefficients": {
"double": [7274.194347379634, 4326.995162668048, 9341.604695180558, 1691.794448740186, 2162.2199731255423, 2342.150297286721, 0.18287261938061752],
"shape": {
"dimensions": [{
"size": 7,
"name": ""
}]
},
"type": "tensor"
},
"intercept": {
"double": 8085.6026142683095
}
}
op
specifies the operation to be executed, there is one op name foreach transformer supported by MLeapattributes
contains the values needed by the operation in order toexecute
node.json
For the one hot encoder:
{
"name": "oneHot_4b815730d602",
"shape": {
"inputs": [{
"name": "fico_index",
"port": "input"
}],
"outputs": [{
"name": "fico",
"port": "output"
}]
}
}
name
specifies the name of the node in the execution graphshape
specifies the inputs and outputs of the node, and how theyare to be used internally by the operation
In this case, the fico_index
column is to be used as the input columnof the one hot encoder, and fico
will be the result column.
MLeap Bundle Examples
Here are some examples of serialized bundle files. They are not meant tobe useful pipelines, but rather to illustrate what these files actuallylook like. The pipelines were generated when running our Spark paritytests, which ensure that MLeap transformers and Spark transformersproduce exactly the same outputs.
MLeap/Spark Parity Bundle Examples
NOTE: right click and “Save As…”, Gitbook prevents directly clickingon the link.