Get the Model Metadata When Generating Prediction Workflow Codes

Background

Currently, we have saved the model structure and weights after training in the SQLFlow. However, there are some requirements that we have to save the model metadata too.

Let us take an example of the TO TRAIN and TO PREDICT statements.

  1. SELECT * FROM my_db.train_table
  2. TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
  3. ...
  4. LABEL class
  5. INTO my_model;
  6. SELECT * FROM my_db.test_table
  7. TO PREDICT my_db.test_table_prediction.class
  8. USING my_model;
  • We should save the docker image name which is used in the TO TRAIN statement so that we can use the same docker image in the TO PREDICT statement.

    The TO PREDICT statement should use the same docker image as the TO TRAIN statement, i.e., my_docker_registry/my_docker_image:latest in the above example. Therefore, we should save the docker image name used in the TO TRAIN statement at the end of the training.

  • When running TO PREDICT statement, we should know whether the trained model is a TensorFlow or XGBoost model, so that we can use different ways to generate the Python code.

    The code generation may be quite different for TensorFlow and XGBoost models. When running the TO TRAIN statement, we can use the estimator name (i.e., MyDNNClassifier in the above example) to distinguish whether the model to train is a TensorFlow or XGBoost model. But when running the TO PREDICT statement, we can only get the trained model name (i.e. my_model in the above example) but not the estimator name. Therefore, we do not know whether the model is a TensorFlow or XGBoost model, and we cannot know how to generate the prediction Python code. As a result, we should save the estimator name at the end of the training.

What Data Should Be Saved As the Model Metadata

We propose to save all fields in the ir.TrainStmt, so that we can get all necessary metadata of the TO TRAIN statement.

How to Save the Model Metadata

Open Source Version

In the open source version, we can save the metadata in the DBMS along with the model structure and weights. Suppose that users write the following SQL statement:

  1. SELECT * FROM my_db.train_table
  2. TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
  3. ...
  4. LABEL class
  5. INTO my_db.my_trained_dnn_model;

We should save the metadata, model structure and weights in the DBMS table my_db.my_trained_dnn_model together.

In the current implementation, we had saved model structure and weights in the format of:

  1. +-----+-----------------------------+
  2. | id | block |
  3. +-----|-----------------------------+
  4. | 0 | |
  5. | 1 | model structure and weights |
  6. | ... | |
  7. +-----+-----------------------------+

The model structure and weights were deserialized into byte stream and saved in multiple rows inside the DBMS table.

In the new design, we propose the saved data in the DBMS table is in the format of:

  1. +-----+----------------------------------------------------------+
  2. | id | block |
  3. +-----|----------------------------------------------------------+
  4. | 0 | |
  5. | 1 | (metadata_length, metadata, model structure and weights) |
  6. | ... | |
  7. +-----+----------------------------------------------------------+

The first 64 bit is the metadata’s length, then the second field is metadata, and the last field is model structure and weights. This design is almost the same with the current implementation except for the leading metadata fields. In this way, we can get metadata easily without loading the entire model.

PAI Platform Version

In the PAI platform version, we can save the metadata in the OSS bucket along with the model structure and weights. Suppose that users write the following SQL statement:

  1. SELECT * FROM my_db.train_table
  2. TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
  3. ...
  4. LABEL class
  5. INTO my_pai_trained_dnn_model;

We propose to save the model metadata in the OSS bucket oss://sqlflow-models/user_id/my_pai_trained_dnn_model/metadata.json, and to save the model structure and weights in the OSS bucket oss://sqlflow-models/user_id/my_pai_trained_dnn_model/model_save.

How to Get the Model Metadata in Prediction Workflow Codegen

There are 2 situations when we do prediction:

  • Case 1: we have trained a model beforehand, and we only run one TO PREDICT statement. That is to say, the total SQL statements to run contain only one SQL statement:

    1. SELECT * FROM my_db.test_table
    2. TO PREDICT my_db.test_table_prediction.class
    3. USING my_model;

    Since we have trained the model my_model beforehand, we can get the metadata from the DBMS table or OSS bucket when running these statements above.

  • Case 2: the TO PREDICT statement uses the trained model from the previous workflow step. That is to say, the total SQL statements to run contain two SQL statements:

    1. SELECT * FROM my_db.train_table
    2. TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
    3. ...
    4. LABEL class
    5. INTO my_model;
    6. SELECT * FROM my_db.test_table
    7. TO PREDICT my_db.test_table_prediction.class
    8. USING my_model;

    Since the the trained model my_model will be only generated after the first workflow step runs, we cannot get the model metadata from the DBMS or OSS when we generate the workflow codes for the TO PREDICT statement. In this case, we should do some dependency analysis for the SQL statements. That is to say, we should check if there is any TO TRAIN statement that will generates the trained model, and get the model metadata from that TO TRAIN statement.

In conclusion, the way we try to get the model metadata when generating the workflow codes for the TO PREDICT statement is:

  • Check if there is any TO TRAIN statement that generates the trained model used by the TO PREDICT statement.
  • If yes, use the metadata from the TO TRAIN statement directly.
  • If no, try to get the model metadata from the DBMS table or OSS bucket.