Explain the Machine Learning Model in SQLFlow

Concept

Although the machine learning model is widely used in many fields, it remains mostly a black box. SHAP is widely used by data scientists to explain the output of any machine learning model.

This design doc introduces how to support the Explain SQL in SQLFlow with SHAP as the backend and display the visualization image to the user.

User Interface

Users usually use a TO TRAIN SQL to train a model and then explain the model using an TO EXPLAIN SQL, the simple pipeline like:

Train SQL:

  1. SELECT * FROM train_table
  2. TO TRAIN xgboost.Estimator
  3. WITH
  4. train.objective = "reg:linear"
  5. COLUMN x
  6. LABEL y
  7. INTO my_model;

Explain SQL:

  1. SELECT * FROM train_table
  2. TO EXPLAIN my_model
  3. WITH
  4. plots = force
  5. USING TreeExplainer

where:

  • train_table is the table of training data.
  • my_model is the trained model.
  • force and summary is the visualized method.
  • TreeExplainer is the explain type.

The Explain SQL would display the visualization image on Jupyter like: Explain the Machine Learning Model in SQLFlow - 图1

Implement Details

  • Enhance the SQLFlow parser to support the Explain keyword.
  • Implement the codegen_shap.go to generate a SHAP Python program. The Python program would be executed by SQLFlow Executor module and prints the visualization image in HTML format to stdout. The stdout will be captured by the Go program using CombinedOutput.
  • For each Explain SQL request from the SQLFlow magic command, the SQLFlow server would response the HTML text as a single message, and then display the visualization image on Jupyter Notebook

Note

  • For the current milestone, SQLFlow only supports DeepExplainer for the Keras Model, and TreeExplainer for XGBoost, more abundant Explainer and Model type will be supported in the future.
  • We don’t use the more relevant keyword Explain just because Explain is used throughout various SQL databases.