ALPS Submitter

ALPS (Ant Learning and Prediction Suite) provides a common algorithm-driven framework in Ant Group, focusing on providing users with an efficient and easy-to-use machine learning programming framework and a financial learning machine learning algorithm solution.

This module is used to submit ALPS machine learning training tasks in SQLFlow.

Precondition

  1. For machine learning models, we only consider TensorFlow premade estimator.
  2. To simplify the design, we only execute training without evaluation in the estimator.
  3. If a table cell is encoded, we assume the user always provides enough decoding information such as dense/sparse, shape via expression such as DENSE, SPARSE

Data Pipeline

Standard Select -> Train Input Table -> Decoding -> Input Fn and TrainSpec

The standard select query is executed in SQL Engine like ODPS, SparkSQL, we take the result table as the input of training.

If a table cell is encoded, we assume the user always provides enough decoding information such as dense/sparse, shape, delimiter via expression such as DENSE, SPARSE.

The decode expression must exist in COLUMN block.

Dense Format

It is the dense type of encoded data if we have multiple numeric features in a single table cell.

For example, we have numeric features such as price, count and frequency, splice into a string using a comma delimiter.

In this situation, the DENSE expression should be used to declare the decoding information such as the shape, delimiter.

  1. DENSE(table_column, shape, dtype, delimiter)
  2. Args:
  3. table_column: column name
  4. shape(int): shape of dense feature
  5. delimiter(string): delimiter of encoded feature

Sparse Format

It is the sparse type of encoded data if we not only have multiple features in a single table cell but also mapping the feature to sparse format.

For example, we have features such as city, gender and interest which has multiple values for each feature.

The values of city are beijing, hangzhou, chengdu.

The values of gender are male and female.

The values of interest are book and movie.

Each of these values has been mapped to an integer and associate with some group.

  1. `beijing` -> group 1, value 1
  2. `hangzhou` -> group 1, value 2
  3. `chengdu` -> group 1, value 3
  4. `male` -> group 1, value 4
  5. `female` -> group 1, value 5
  6. `book` -> group 2, value 1
  7. `movie` -> group 2, value 2

If we use colon as the group/value delimiter and use comma as the feature delimiter, “3:1, 4:1, 2:2” means “chengdu, male, movie”.

  1. SPARSE(table_column, shape, dtype, delimiter, group_delimiter)
  2. Args:
  3. table_column: column name
  4. shape(list): list of embedding shape for each group
  5. delimiter(string): delimiter of feature
  6. group_delimiter(string): delimiter of value and group

Decoding

The actual decoding action is not happened in this submitter but in ALPS inside.

What we should do here is just generate the configuration file of ALPS.

Let’s take an example for training a classifier model for credit card fraud case.

Table Data

column/rowc1c2c3
r110,2,43:1,4:1,2:20
r2500,20,102:1,5:1,2:11

The column c1 is dense encoded and c2 is sparse encoded, c3 is label column.

SQL

  1. select
  2. c1, c2, c3 as class
  3. from kaggle_credit_fraud_training_data
  4. TO TRAIN DNNClassifier
  5. WITH
  6. ...
  7. COLUMN
  8. DENSE(c1, shape=3, dtype=float, delimiter=',')
  9. SPARSE(c2, shape=[10, 10], dtype=float, delimiter=',', group_delimiter=':')
  10. LABEL class

ALPS Configuration

  1. # graph.conf
  2. ...
  3. schema = [c1, c2, c3]
  4. x = [
  5. {feature_name: c1, type: dense, shape:[3], dtype: float, separator:","},
  6. {feature_name: c2, type: sparse, shape:[10,10], dtype: float, separator:",", group_separator:":"}
  7. ]
  8. y = {feature_name: c3, type: dense, shape:[1], dtype: int}
  9. ...

Feature Pipeline

Feature Expr -> Semantic Analyze -> Feature Columns Code Generation -> Estimator

Feature Expressions

In SQLFlow, we use Feature Expressions to represent the feature engineering process and convert it into the code snippet using TensorFlow Feature Column API.

Feature Expressions

  1. DENSE(key, shape)
  2. BUCKETIZED(source_column, boundaries)
  3. CATEGORICAL_IDENTITY(key, num_buckets, default_value)
  4. CATEGORICAL_HASH(key, hash_bucket_size, dtype)
  5. CATEGORICAL_VOCABULARY_LIST(key, vocabulary_list, dtype, default_value, num_oov_buckets)
  6. CATEGORICAL_VOCABULARY_FILE(key, vocabulary_file, vocabulary_size, num_oov_buckets, default_value, dtype)
  7. CROSS(keys, hash_bucket_size, hash_key)

The feature expressions must exist in COLUMN block.

Here is an example which do BUCKETIZED on c2 then CROSS with c1.

  1. select
  2. c1, c2, c3 as class
  3. from kaggle_credit_fraud_training_data
  4. TO TRAIN DNNClassifier
  5. WITH
  6. ...
  7. COLUMN
  8. CROSS([DENSE(c1), BUCKETIZED(DENSE(c2), [0, 10, 100])])
  9. LABEL class

Semantic Analyze

Feature Expressions except for TensorFlow Feature Column API should raise an error.

  1. /* Not supported */
  2. select * from kaggle_credit_fraud_training_data
  3. TO TRAIN DNNClassifier
  4. WITH
  5. ...
  6. COLUMN
  7. DENSE(f1 * 10)

Feature Columns Code Generation

We transform feature columns expression to a code snippet and wrap it as a CustomFCBuilder which extends from alps.feature.FeatureColumnsBuilder.

Review the above example, the generated code snippet is this:

  1. from alps.feature import FeatureColumnsBuilder
  2. class CustomFCBuilder(FeatureColumnsBuilder):
  3. def build_feature_columns():
  4. fc1 = tf.feature_column.numeric_column('c1')
  5. fc2 = tf.feature_column.numeric_column('c2')
  6. fc3 = tf.feature_column.bucketized_column(fc2, boundaries = [0, 10, 100])
  7. fc4 = tf.feature_column.crossed_column([fc2, fc3])
  8. return [fc4]

ALPS framework will execute this code snippet and pass the result to the constructor method of the estimator.

Parameters

We use WITH block to set the parameters of training.

If the name is prefixed with estimator, it is the parameter of the constructor method of the Estimator.

If the name is prefixed with train_spec, it is the parameter of the constructor method of the TrainSpec.

If the name is prefixed with input_fn, it is the parameter of the input_fn.

Let’s create a DNNClassifier example, the minimum parameters of the constructor method are hidden_units and feature_columns.

  1. select
  2. c1, c2, c3 as class
  3. from kaggle_credit_fraud_training_data
  4. TO TRAIN DNNClassifier
  5. WITH
  6. estimator.hidden_units = [10, 20],
  7. train_spec.max_steps = 2000,
  8. input_fn.batch_size = 512
  9. COLUMN
  10. CROSS([DENSE(c1), BUCKETIZED(DENSE(c2), [0, 10, 100])])
  11. LABEL class
  12. ...

For now, we will pass the result of snippet code as feature_columns parameters and it will raise an error if the estimator expects it as a different name until AS syntax is supported in SQLFlow.

  1. select
  2. c1, c2, c3, c4, c5 as class
  3. from kaggle_credit_fraud_training_data
  4. TO TRAIN DNNLinearCombinedClassifier
  5. WITH
  6. linear_feature_columns = [fc1, fc2]
  7. dnn_feature_columns = [fc3]
  8. ...
  9. COLUMN
  10. DENSE(f1) as fc1,
  11. BUCKETIZED(fc1, [0, 10, 100]) as fc2,
  12. CROSS([fc1, fc2, f3]) as fc3
  13. LABEL class
  14. ...