MLeap Bundles

MLeap Bundles are a graph-based, portable file format for serializing andde-serializing:

  1. Machine learning data pipelines - any transformer-based data pipeline
  2. Algorithms (Regressions, Tree-Based models, Bayesian models, Neural Nets, Clustering)

Bundles make it very easy to share the results of your training pipeline, simply generate a bundle file andsend it through email to a colleague or just view the metadata of your data pipeline and algorithm!

Bundles also make deployments simple: just export your bundle and load it into your Spark,Scikit-learn, or MLeap-based application.

Features of MLeap Bundles

  1. Serialize to a directory or a zip file
  2. Entirely JSON and Protobuf-based format
  3. Serialize as pure JSON, pure Protobuf, or mixed mode
  4. Highly extensible, including easy integration with new transformers

Common Format For Spark, Scikit-Learn, TensorFlow

MLeap provides a serialization format for common transformers that are found in Spark, Scikit and TF. For example, consider the Standard Scaler trasnformer (tf.random_normal_initializer in TensorFlow). It performs the same opperation on all three platforms so in theory can be serialized, deserialized and used interchangeably between them.

Common Serialization

Bundle Structure

At its root directory, a bundle has a bundle.json file, which providesbasic meta data about the serialization of the bundle. It also has aroot/ directory, which contains the root transformer of the MLpipeline. The root transformer can be any type of transformer supportedby MLeap, but is most commonly going to be a Pipeline transformer.

Let’s take a look at an example MLeap Bundle. The pipeline consists ofstring indexing a set of categorical features, followed by one hotencoding them, assembling the results into a feature vector and finallyexecuting a linear regression on the features. Here is what the bundlelooks like:

  1. ├── bundle.json
  2. └── root
  3. ├── linReg_7a946be681a8.node
  4. ├── model.json
  5. └── node.json
  6. ├── model.json
  7. ├── node.json
  8. ├── oneHot_4b815730d602.node
  9. ├── model.json
  10. └── node.json
  11. ├── strIdx_ac9c3f9c6d3a.node
  12. ├── model.json
  13. └── node.json
  14. └── vecAssembler_9eb71026cd11.node
  15. ├── model.json
  16. └── node.json

bundle.json

  1. {
  2. "uid": "7b4eaab4-7d84-4f52-9351-5de98f9d5d04",
  3. "name": "pipeline_43ec54dff5b2",
  4. "timestamp": "2017-09-03T17:41:25.206",
  5. "format": "json",
  6. "version": "0.14.0"
  7. }
  1. uid is a Java UUID that is automatically generated as a unique IDfor the bundle
  2. name is the uid of the root transformer
  3. format is the serialization format used to serialize this bundle
  4. version is a reference to the version of MLeap used to serializethe bundle
  5. timestamp defines when the bundle was serialized

model.json

For the pipeline:

  1. {
  2. "op": "pipeline",
  3. "attributes": {
  4. "nodes": {
  5. "type": "list",
  6. "string": ["strIdx_ac9c3f9c6d3a", "oneHot_4b815730d602", "vecAssembler_9eb71026cd11", "linReg_7a946be681a8"]
  7. }
  8. }

For the linear regression:

  1. {
  2. "op": "linear_regression",
  3. "attributes": {
  4. "coefficients": {
  5. "double": [7274.194347379634, 4326.995162668048, 9341.604695180558, 1691.794448740186, 2162.2199731255423, 2342.150297286721, 0.18287261938061752],
  6. "shape": {
  7. "dimensions": [{
  8. "size": 7,
  9. "name": ""
  10. }]
  11. },
  12. "type": "tensor"
  13. },
  14. "intercept": {
  15. "double": 8085.6026142683095
  16. }
  17. }
  1. op specifies the operation to be executed, there is one op name foreach transformer supported by MLeap
  2. attributes contains the values needed by the operation in order toexecute

node.json

For the one hot encoder:

  1. {
  2. "name": "oneHot_4b815730d602",
  3. "shape": {
  4. "inputs": [{
  5. "name": "fico_index",
  6. "port": "input"
  7. }],
  8. "outputs": [{
  9. "name": "fico",
  10. "port": "output"
  11. }]
  12. }
  13. }
  1. name specifies the name of the node in the execution graph
  2. shape specifies the inputs and outputs of the node, and how theyare to be used internally by the operation

In this case, the fico_index column is to be used as the input columnof the one hot encoder, and fico will be the result column.

MLeap Bundle Examples

Here are some examples of serialized bundle files. They are not meant tobe useful pipelines, but rather to illustrate what these files actuallylook like. The pipelines were generated when running our Spark paritytests, which ensure that MLeap transformers and Spark transformersproduce exactly the same outputs.

MLeap/Spark Parity Bundle Examples

NOTE: right click and “Save As…”, Gitbook prevents directly clickingon the link.