Feature Column Support in SQLFlow XGBoost models
Overall
SQLFlow extends SQL grammar to support data pre-processing using COLUMN
clauses. For example, we can use CATEGORY_HASH
to parse a string column to an integer column, which is usually a common data pre-processing operation in NLP tasks.
SELECT string_column1, int_column2, class FROM xgboost.gbtree
TO TRAIN XGBoostModel
COLUMN INDICATOR(CATEGORY_HASH(string_column1, 10)), int_column2
LABEL class
INTO sqlflow_xgboost_model.my_model;
Currently, COLUMN
clauses are supported in SQLFlow TensorFlow models. The COLUMN
clauses are transformed into TensorFlow feature column API calls inside SQLFlow codegen implementation.
However, XGBoost has no similar feature column APIs as TensorFlow. Currently, XGBoost models can only support simple column names like c1, c2, c3
in COLUMN
clauses, and any data pre-processing is not supported. It makes that we cannot use XGBoost to train models which accept string column as their input.
This design explains how SQLFlow supports feature columns in XGBoost model.
Supported feature column in TensorFlow models
SQLFlow COLUMN
clauses support the following listed feature columns, and they are implemented by TensorFlow APIs.
SQLFlow keywords | TensorFlow API | Description |
---|---|---|
DENSE | tf.feature_column.numeric_column | Raw numeric feature column without any pre-processing |
BUCKET | tf.feature_column.bucketized_column | Transform input integer to be the bucket id divided by boundaries |
CATEGORY_ID | tf.feature_column.categorical_column_with_identity | Identity mapping of integer feature column |
CATEGORY_HASH | tf.feature_column.categorical_column_with_hash_bucket | Using hash algorithm to map string or integer to category id |
SEQ_CATEGORY_ID | tf.feature_column.sequence_categorical_column_with_identity | Sequence data version of CATEGORY_ID |
CROSS | tf.feature_column.crossed_column | Combine multiple category features using hash algorithm |
INDICATOR | tf.feature_column.indicator_column | Transform category id to multi-hot representation |
EMBEDDING | tf.feature_column.embedding_column | Transform category id to embedding representation |
Feature column design in XGBoost models
The training process of XGBoost model inside SQLFlow are as follows:
- Step 1: Read data from database. Call the
db.db_generator()
method to return a Python generator which yields each row in database. - Step 2: Dump SVM file. Call the
dump_dmatrix()
method to write the raw data into an SVM file. This file is ready to be loaded as XGBoost DMatrix. - Step 3: Training. Load the dumped file to be XGBoost DMatrix and start to train. The training process is performed by calling
xgboost.train
APIs.
As discussed in COLUMN clause for XGBoost, there are 3 candidate ways to support feature column in XGBoost models:
Method 1. Perform feature column transformation during step 1 and step 2. Data pre-processing can be done before dumping to SVM file. This method is suitable for offline training (both standalone and distributed), prediction and evaluation, since same transformation Python codes can be generated in both training, prediction and evaluation. But it is not suitable for online serving, because online serving usually uses other libraries or languages (like C++/Java), which does not support the transformation codes we generate in SQLFlow Python codes.
Method 2. Modify the training iteration of XGBoost, and insert transformation codes during each iteration. But it is not easy to modify the training iteration of XGBoost. Moreover, it is also not suitable for online serving for the same reason as method 1.
Method 3. Combine data pre-processing and model training as a sklearn pipeline. Since sklearn pipeline can be saved as a PMML file by sklearn2pmml or Nyoka, this method is suitable for standalone training, offline prediction, offline evaluation and online serving. Distributed training of sklearn pipeline can be performed using Dask. However, distributed training pipeline cannot be saved as a PMML file directly. It is because Dask does not use but mocks native sklearn APIs to build a pipeline, and these mocked APIs cannot be saved. Another problem is that sklearn pipeline only supports very few data pre-processing transformers. For example, hashing a string to a single integer is not supported in sklearn. Of course, we can add more data pre-processing transformers, but these new added transformers cannot be saved as a PMML file.
In summary, one of the most critical things is how data pre-processing transformers can be saved for online serving. After investigating the online serving platform in the company (Arks, etc), data pre-processing steps are usually not saved in PMML or Treelite files. The online serving platform provides plugins for users to choose their data pre-processing steps instead of loading them from PMML or Treelite files. Therefore, we prefer to choose Method 1 to implement feature column in XGBoost models.
The feature column transformers in Python can be implemented as:
class BaseFeatureColumnTransformer(object):
def __call__(self, inputs):
raise NotImplementedError()
def set_column_names(self, column_names):
self.column_names = column_names
class NumericColumnTransformer(BaseFeatureColumnTransformer):
# `key` is the column name inside `SELECT` statement
def __init__(self, key):
self.key = key
def set_column_names(self, column_names):
BaseFeatureColumnTransformer.set_column_names(column_names)
self.index = self.column_names.index(self.key)
# `inputs` are all raw column data
# NumericColumnTransformer would only take the column indicated by `index`
def __call__(self, inputs):
return inputs[self.index]
# CategoricalColumnTransformer is the base class of all category columns
# This base class is design to do some check. For example, `INDICATOR`
# would only accept category column as its input.
class CategoricalColumnTransformer(BaseFeatureColumnTransformer): pass
class BucketizedColumnTransformer(CategoricalColumnTransformer):
def __init__(self, key, boundaries, default_value=None):
self.key = key
self.boundaries = boundaries
self.default_value = default_value
def set_column_names(self, column_names):
BaseFeatureColumnTransformer.set_column_names(column_names)
self.index = self.column_names.index(self.key)
def __call__(self, inputs):
input = inputs[self.index]
if input < boundaries[0]:
return 0
for idx, b in enumarate(boudaries):
if input >= b
return idx
return len(boundaries)
class CrossedColumnTransformer(BaseFeatureColumnTransformer):
def __init__(self, keys, hash_bucket_size):
self.keys = keys
self.hash_bucket_size = hash_bucket_size
def set_column_names(self, column_names):
BaseFeatureColumnTransformer.set_column_names(column_names)
self.column_indices = [self.column_names.index(key) for key in self.keys]
def _cross(self, transformed_inputs): ...
def __call__(self, inputs):
selected_inputs = [inputs[idx] for idx in self.column_indices]
self._cross(selected_inputs)
class ComposedFeatureColumnTransformer(BaseFeatureColumnTransformer):
def __init__(self, *transformers):
self.transformers = transformers
def set_column_names(self, column_names):
BaseFeatureColumnTransformer.set_column_names(column_names)
for t in self.transformers:
t.set_column_names(column_names)
def __call__(self, inputs):
return [t(inputs) for t in self.transformers]
For example, the column clause COLUMN INDICATOR(CATEGORY_HASH(string_column1, 10)), int_column2
would be finally transformed into Python calls:
transform_fn = ComposedFeatureColumnTransformer(
IndicatorColumnTransformer(CategoryColumnWithHashBucketTransformer(key="string_column1", hash_bucket_size=10)),
NumericColumnTransformer(key="int_column2")
)
Then we pass transform_fn
to runtime.xgboost.train
method. Inside runtime.xgboost.train
, we transform the raw data from db.db_generator(...)
by calling transform_fn.__call__
method. Method set_column_names
would be called once when the table schema is obtained in runtime, so that the index of key
can be inferred in Python runtime. The transformed data would be writen into SVM file, then it can be loaded in the following train step.
Another concern is that we should perform the same data pre-processing in prediction/evaluation stage. So we should save the feature columns of training, so that it can be loaded in prediction/evaluation stage. Besides, the codegen during prediction/evaluation stage should also generate the same transformation codes as training stage.
It should be noticed that EMBEDDING
is not supported in this design doc. It is because that the EMBEDDING
feature column may contain trainable parameters, and these parameters cannot be updated in XGBoost training process.
Export the XGBoost models to PMML/Treelite file
XGBoost supports 2 kinds of APIs to train a model:
xgboost.train
. We use this API in our current implementation. The returned Booster can be saved to the format that can be loaded by Treelite APIs but not by PMML APIs.xgboost.XGBClassifier/XGBRegressor/XGBRanker
. Sklearn2pmml or Nyoka can only export models built by these APIs to PMML format. But this APIs may be not very easy to use, because:- We must distinguish whether the model is a classifier/regressor/ranker beforehand.
- The constructors of
xgboost.XGBClassifier/XGBRegressor/XGBRanker
mix up the Booster parameters and training parameters together. For example,booster
is one of the Booster parameters, andn_estimators
is one of the training parameters, but both of them appear in the constructors. This makes us hard to distinguish model parameters and training parameters in SQLFlow codes. - Names of some of the parameters in
xgboost.XGBClassifier/XGBRegressor/XGBRanker
are different fromxgboost.train
. For example,n_estimators
inxgboost.XGBClassifier/XGBRegressor/XGBRanker
is the same asnum_boost_round
inxgboost.train
.
Therefore, we prefer to use xgboost.train
API to perform training in this design, and export PMML/Treelite files in the following ways:
- PMML file can be exported by:
- Call
Booster.load_model()
to load trained model. - Check the Booster objective to build one of
xgboost.XGBClassifier/XGBRegressor/XGBRanker
. - Call
xgboost.XGBClassifier/XGBRegressor/XGBRanker.load_model()
to load the trained model again. - Build a sklearn pipeline using the pre-built
xgboost.XGBClassifier/XGBRegressor/XGBRanker
. - Save the pipeline using Sklearn2pmml or Nyoka as a PMML file.
- Call
- Treelite file can be exported by Model.from_xgboost using the Booster saved by
Booster.save_model()
.