Intermediate Representation

Overview

As SQLFlow is supporting more and more machine learning toolkits, the corresponding code generation logics are better being organized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core sql package.

The core sql package should include the following functionalities:

  1. The entry point of running extended SQL statements.
  2. The parsing of extended SQL statements.
  3. The verification of extended SQL statements, including verifying the syntax, the existence of the selected fields.
  4. The feature derivation, including name, type, shape, and preprocessing method of the select fields.
  5. The training data and validation data split.

With these functionalities, the sql package çan translate user typed extended SQL statements to an IR as an exposed Go struct. The codegen package takes the IR and returns a generated Python program for the sql package to execute.

Code Structure

We propose the following code structures.

  1. sql/
  2. ...
  3. codegen/
  4. feature_column.go
  5. intermediate_representation.go
  6. tensorflow/
  7. ...
  8. xgboost/
  9. ...

The IR and feature column definition will resides in codegen. Each code generator package forms a subdirectory in codegen like codegen/tensorflow/.

Intermediate Representation

Please refer to codegen/intermediate_representation.go and codegen/feature_column.go for implementation details.