SQLFlow Code Generator
SQLFlow is a compiler that compiles a SQL program to an Argo workflow as the following pipeline:
parser -> AST -> sematic -> IR -> optimizer -> code generator
↑ ↓
sql program .YAML
The Argo controller running on Kubernetes is the executor that executes the workflow. This is a design doc about how to implement the code generator.
The High-level Design of the Code Generator
As mentioned above, SQLFlow compiler generates the .YAML
file as the following, you can check more detail about SQLFlow workflow from here.
steps:
name: step-1
command: ["python", "-c"]
args: |
from runtime import tensorflow
tensorflow.train(....)
env:
name: SQLFLOW_OSS_AK
value: "xxxxxx"
From the above workflow .YAML
file, each workflow step contains three parts:
- The execution command as the
command
spec to execute the program. - The execution program, which can be written in Python, R, or Bash. The program submits an AI task on an AI platform .e.g, ElasticDL, Alibaba PAI or just runs on a host by involving the SQLFlow
runtime
library. - The runtime environment variables with the
env
spec.
SQLFlow compiler provides the code generator component to generate the step program, the code generation is divided into the following stages:
- Target Submitter Registry, register a Code Generator in SQLFlow compiler.
- CodeGenerator Interface is a Go interface that all code generators should implement.
- Code Generation provides an assembler API to generate a step program.
Target Submitter Register
For a new code generator, develops should register it in SQLFlow compiler as the following pseudo-code:
cgMapping = map[string]CodeGenerator {
"paiTensorFlow": PAITensorFlow{},
"paiXGBoost", PAIXGBoost{},
...
}
Code Generator Interface
For each code generator implementation, you should care about all IR types, different IR types have different behaviors and generate different submitter program. Each code generator owns an ExecutionCtx
instance to tell Argo workflow on how to execute the target code.
type ExecutionCtx struct {
ExecCommand []string // How to execute the target code, .e.g ["python" "-c"]
Env map[string]string // The environment variables for execution
}
type CodeGenerator interface {
GenerateExecCtx(*ir.SQLStmt) ExecutionCtx
EmitNormal(*ir.NormalStmt) (string, error)
EmitTrain(*ir.TrainStmt) (string, error)
EmitPredict(*ir.PredictStmt) (string, error)
EmitExplain(*ir.ExplainStmt) (string, error)
EmitEvaluate(*ir.EvaluateStmt) (string, error)
EmitShowTrain(*ir.ShowTrainStmt) (string, error)
EmitOptimize(*ir.OptimizeStmt) (string, error)
EmitRun(*ir.RunStmt) (string, error)
}
Code Generation
The code generation phase is responsible for generating target code from a SQL statement IR, this is an assembler API that routes to a specified code generator, the pseudo-code is as the following:
func Generate(session *pb.Session, stmt *ir.SQLStatement) (string, error) {
// routing to a specified code generator from session.submitter
cf := cgMapping[session.submitter]
switch v := stmt.(type) {
case *ir.TrainStmt:
return cg.EmitTrain(stmt.(*ir.TrainStmt)), cg.GenerateExecCtx(), nil
case *ir.PredictStmt:
return cg.EmitPredict(stmt.(*ir.TrainStmt)), cg.GenerateExecCtx(), nil
...
}
}