Information Necessary for Code Generators in SQLFlow

SQLFlow extends the syntax of the SELECT statement of SQL to support training a model:

  1. SELECT * FROM kaggle_credit_fraud_training_data
  2. LIMIT 1000
  3. TO TRAIN DNNClassifier /* a pre-defined TensorFlow estimator, tf.estimator.DNNClassifier */
  4. WITH layers=[100, 200], /* a parameter of the Estimator class constructor */
  5. train.batch_size = 8 /* a parameter of the Estimator.train method */
  6. COLUMN *, /* all columns as raw features */
  7. cross(v1, v9, v28) /* plus a derived (crossed) column */
  8. LABEL class
  9. INTO sqlflow_models.my_model_table; /* saves trained model parameters and features into a table */

Currently, we have the following syntax allowing users to provide necessary information for the training.

  1. SELECT STATEMENT
  2. TO TRAIN ...
  3. WITH ...
  4. COLUMN ... [FOR ...]
  5. LABEL ...
  6. INTO ...

SQLFlow server passes the above information to code generators like sql/codegen.go and sql/codegen_alps.go, which generates the training program may be using different and even conflicting rules.

Things will be even more difficult if we got other kinds of sql/codegen_**.go in the future.

In this document, we summarize information necessary for the code generators.

Necessary Information for Training

Model Name

Model Name is a string written after the keyword of TO TRAIN, which can be the name of a TensorFlow pre-made estimator or the full package path of a customized Estimator/KerasModel.

If the model name is the full package path of a customized model, the model should be installed according to How to install the customized model in SQLFlow.

Model Constructor Parameters

The model will be constructed by codegen according to the module name, and the constructor parameters can be set in the WITH block.

Here are some rules:

  1. Name of a parameter must begin with model. and we take the rest of it as the real name.
  2. To simplify implementation, the value of a parameter must be a type of numeric, string or list.

Take an example for DNNClassifier

  1. TO TRAIN DNNClassifier
  2. WITH
  3. model.hidden_units = [10, 20, 10]

Training Hyper-Parameters

The training hyper-parameters can be also set in the WITH block.

Here is the list of hyper-parameters planning to support.

 parameter nametype of valuedefault valuecomment
batch sizetrain.batch_sizeinteger512 
drop remaindertrain.drop_remainderbooltrue 
cachetrain.cachebool/stringfalsebool: enable cache in memory or not. string: enable cache and the name of a directory on the filesystem to caching tensors.
epochtrain.epochinteger1 
shardtrain.shardinteger1distributed training if greater than 1. Fixed size of PS btw.
shuffletrain.shufflebool/integerfalsebool: enable shuffle or not. integer: shuffle buffer size.
max_stepstrain.max_stepsintegerNone 
eval stepseval.stepsintegerNone 
start delay secseval.start_delay_secsinteger120 
throttle_secseval.throttle_secsinteger600 

COLUMN FOR

The expressions in COLUMN contains the Feature Columns information. There could be multiple COLUMN blocks in the SQL.

The value of FOR keyword represents which parameter of the constructor method the feature columns assigning to.

For example, the following SQL

  1. ...
  2. TO TRAIN
  3. DNNLinearCombinedClassifier
  4. WITH
  5. ...
  6. COLUMN
  7. DENSE(...) FOR linear_feature_columns
  8. COLUMN
  9. BUCKET(...) FOR dnn_feature_columns
  10. ...

will be translated to

  1. estimator = tf.estimator. DNNLinearCombinedClassifier(
  2. linear_feature_columns = [tf.feature_column.numeric(...)],
  3. dnn_feature_columns = [tf.feature_column.bucket(...)],
  4. ...
  5. )

They were not only translated into TensorFlow Feature Columns, but also contains the encoding information of data such as DENSE or SPARSE.

For example, the following SQL

  1. SELECT
  2. ...
  3. COLUMN
  4. DENSE(c1, ...)
  5. SPARSE(c2, ...)

represents that the c1 field is the dense format and the c2 field is the sparse format.

Here is a list of supported expression:

expressionarguments of expressionexample
dense1. field name (str)
2. dense shape (list of integer)
3. separator (str)
dense(c1, [100, 200], comma)
sparse1. field name (str)
2. sparse shape (integer)
3. separator (str)
sparse(c2, 10000, comma)
bucket1. key (numeric)
2. bucket size (integer)
bucket(numeric(c1, 100), 20)
cat_id1. field name (str)
2. bucket size (integer)
cat_id(c2, 10000)
embedding1. key (cat_id)
2. dimension (integer)
3. combiner (str)
embedding(cat_id(c2, 10000), mean)

Independent Module for Resolving of Training Parameters

Since the format of training parameters has been unified, it’s better to have an independent module in SQLFlow to do the resolving according to the rules instead of doing it in each codegen_**.go.

The advantage of an independent module contains

  1. Unifying resolving process to avoid differences in processing between codegen modules.
  2. Fast fail during resolving before the code generation.

Here we propose a module such as resolving.go which take the trainClause as input and outputs a struct named resolvedTrainClause for all kinds of codegen_**.go to do code generation.

The struct named resolvedTrainClause looks like this

  1. package sql
  2. type resolvedTrainClause struct {
  3. IsPreMadeModel bool
  4. ModelName string
  5. ModelConstructorParameters map[string]interface{}
  6. BatchSize int
  7. DropRemainder bool
  8. EnableCache bool
  9. CachePath string
  10. Epoch int
  11. Shard int
  12. EnableShuffle bool
  13. ShuffleBufferSize int
  14. MaxSteps int
  15. EvalSteps int
  16. EvalStartDelay int
  17. EvalThrottle int
  18. FeatureColumns map[string][]featureColumn
  19. FeatureSpecs map[string][]featureSpec
  20. }