Explaining Model on SQLFlow Tutorial

Open In PAI-DSW

The Explainer is designed to explain the machine learning model in SQLFlow. In this tutorial, you will learn how to,

You can find more SQLFlow usages from the Language Guide.

We implement the explainer based on SHAP. By SQLFlow, an EXPLAIN SQL will be translated to the SHAP code. SQLFlow enables the code to read the dataset and load the trained model, then draws a figure to explain the model. At this stage, SQLFlow supports using the TreeExplainer to draw a summary plot.

Syntax

  1. SELECT * FROM tbl
  2. TO EXPLAIN a_trained_model
  3. WITH
  4. summary.parameter_i=value_i ...
  5. USING TreeExplainer;
  • A standard SQL statement used to specify the dataset.
  • TO EXPLAIN introduces the model to explain.
  • USING TreeExplainer restricts the explainer.
  • By WITH, we specify the parameters to summary_plot with a prefix summary. like: summary.plot_type=\"bar\".

The Dataset

We use the boston housing as the demonstration dataset. First, we train a model to fit the dataset. Next, we write an EXPLAIN SQL to get an overview of which features are most important for the model.

Train a Model

  1. %%sqlflow
  2. SELECT * FROM boston.train
  3. TO TRAIN xgboost.gbtree
  4. WITH
  5. objective="reg:squarederror",
  6. train.num_boost_round = 30
  7. COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
  8. LABEL medv
  9. INTO sqlflow_models.my_xgb_regression_model;

Explain the Model

We can plot the SHAP values of every feature for every sample.

  1. %%sqlflow
  2. SELECT *
  3. FROM boston.train
  4. TO EXPLAIN sqlflow_models.my_xgb_regression_model
  5. WITH
  6. summary.plot_type="dot",
  7. summary.alpha=1,
  8. summary.sort=True
  9. USING TreeExplainer;

Explaining Model on SQLFlow Tutorial - 图2

The plot above sorts features by the sum of SHAP value magnitudes over all samples, and use SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature values(red high, blue low). This reveals for example that a high LSTAT(% lower status of the population) lowers the predicted home price.

We can also just take the mean absolute value of the SHAP values for each feature to get a standard bar plot:

  1. %%sqlflow
  2. SELECT *
  3. FROM boston.train
  4. TO EXPLAIN sqlflow_models.my_xgb_regression_model
  5. WITH
  6. summary.plot_type="bar",
  7. summary.alpha=1,
  8. summary.sort=True
  9. USING TreeExplainer;

Explaining Model on SQLFlow Tutorial - 图3