Redesign the Numeric Column Clause in SQLFlow
Definitions
Numeric Column
A numeric column is a kind of DBMS table columns, whose cell value is:
- a single number. It includes an integer, a floating-point number, or a string that can be directly converted to an integer or floating-point number. For example,
1
,3.5
or"-102"
. - a string that is encoded from a numeric vector. For example:
- the string
"1.2,3.4,5.6"
is encoded from the numeric dense vector[1.2, 3.4, 5.6]
. - the string
"1:1.2 3:3.4 5:5.6"
is encoded from the numeric sparse vector[0, 1.2, 0, 3.4, 0, 5.6]
(if the length is 6).
- the string
Numeric Column Clause
A numeric column clause is a kind of SQLFlow COLUMN
clauses, which is used to load the data from a numeric column. For example,
SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
COLUMN NUMERIC(c1), NUMERIC(SPARSE(c2)) -- This line is the numeric column clause
LABEL label
INTO result_table;
In this document, we would discuss why and how we should redesign the numeric column clause in SQLFlow.
Background: Why Redesign the Numeric Column Clause
Currently, SQLFlow uses the following methods to support loading data from a numeric column:
If each cell value is a single number, there are 3 equivalent ways to write the
TO TRAIN
SQL statement:Do not write the
COLUMN
clause.SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
INTO result_table;
Write the column name directly in the
COLUMN
clause.SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
COLUMN c1, c2
LABEL label
INTO result_table;
Write
NUMERIC()
orNUMERIC(DENSE())
in theCOLUMN
clause.SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
COLUMN NUMERIC(c1), NUMERIC(DENSE(c2))
LABEL label
INTO result_table;
If each cell value is a string that is encoded from a numeric vector, the decoded vector of the cell value may be dense or sparse. For example, the string
"1,3,5"
may represent (1) a dense vector[1, 3, 5]
, or (2) a sparse vectorx
, in whichx[1] = x[3] = x[5] = 1
and the other values ofx
are all zeros. Users should specify whether the decoded vector is dense or sparse explicitly in theCOLUMN
clause.If the decoded vector is dense, there are 2 equivalent ways to write the
TO TRAIN
SQL statement:Use the
NUMERIC()
in theCOLUMN
clause and specify the length of the decoded vector inNUMERIC()
.SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN NUMERIC(c1, 10), NUMERIC(c2, 10) -- 10 is the length of the decoded dense vector
INTO result_table;
Use the
NUMERIC(DENSE())
in theCOLUMN
clause and specify the length of the decoded vector inDENSE()
.SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN NUMERIC(DENSE(c1, 10)), NUMERIC(DENSE(c2, 10)) -- 10 is the length of the decoded dense vector
INTO result_table;
If the decoded vector is sparse, there is only one way to write the
TO TRAIN
SQL statement.SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN NUMERIC(SPARSE(c1, 10)), NUMERIC(SPARSE(c2, 10)) -- 10 is the dense length of the decoded sparse vector
INTO result_table;
There are some problems in the current design:
- Both
NUMERIC
andDENSE/SPARSE
have thelength/shape
parameter. For example, the expressionCOLUMN NUMERIC(DENSE(c1, 10), 20)
is confused to users. Users do not know whether the length ofc1
is 10 (thelength/shape
parameter ofDENSE
) or 20 (thelength/shape
parameter ofNUMERIC
). NUMERIC
is unnecessary in theCOLUMN
clause.- For the column whose cell value is a single number, we can write
COLUMN c1
,COLUMN NUMERIC(c1)
orCOLUMN NUMERIC(DENSE(c1))
, andCOLUMN c1
is simpler. - For the column whose cell value is a string that is encoded from a numeric vector:
- For dense data, we can write simpler expression
COLUMN DENSE(c1, length)
instead ofCOLUMN NUMERIC(DENSE(c1, length))
. - For sparse data, we can write simpler expression
COLUMN SPARSE(c1, length)
instead ofCOLUMN NUMERIC(SPARSE(c1, length))
.
- For dense data, we can write simpler expression
- For the column whose cell value is a single number, we can write
In conclusion, we do not need the NUMERIC
in the COLUMN
clause. We can load the data from the numeric column just using DENSE
and SPARSE
in the COLUMN
clause.
Design
Changes on the APIs
In the new design, we would remove NUMERIC
in the COLUMN
clause. We can use DENSE/SPARSE
directly in the SQL statements.
For the column whose cell value is a single number, there would be still 3 equivalent ways to write the TO TRAIN
SQL statement.
SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
INTO result_table;
or
SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
COLUMN c1, c2
LABEL label
INTO result_table;
or
SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
COLUMN DENSE(c1), DENSE(c2)
LABEL label
INTO result_table;
For the column whose cell value is a string that is encoded from a numeric vector, there would be only one way to write the TO TRAIN
SQL statement.
For dense data, the SQL statement would be:
SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN DENSE(c1, 10), DENSE(c2, 10)
INTO result_table;
For sparse data, the SQL statement would be:
SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN SPARSE(c1, 10), SPARSE(c2, 10)
INTO result_table;
In the previous design, NUMERIC
can be used with other COLUMN
clauses together in SQL statements, like BUCKET(NUMERIC())
, EMBEDDING(NUMERIC())
, etc. In the new design, the NUMERIC
in these SQL statements can be replaced with the DENSE
or SPARSE
.
Example 1: we want to load the dense data from a numeric column and then transform the data using a
BUCKET
clause. The previous SQL statement would be:SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN BUCKET(NUMERIC(c1, 10), 100)
INTO result_table;
In the new design, the SQL statement would be (just replace
NUMERIC
withDENSE
):SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN BUCKET(DENSE(c1, 10), 100)
INTO result_table;
Example 2: we want to load the sparse data from a numeric column and then transform the data using an
EMBEDDING
clause. The previous SQL statement would be:SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN EMBEDDING(NUMERIC(SPARSE(c1, 10)), 128)
INTO result_table;
In the new design, the SQL statement would be (just remove
NUMERIC
):SELECT c1, c2, label FROM train_table
TO TRAIN DNNRegressor
LABEL label
COLUMN EMBEDDING(SPARSE(c1, 10), 128)
INTO result_table;
Changes on the Implementation
Although we would remove NUMERIC
in the APIs, we can still unify both the DENSE
and SPARSE
feature columns as NumericColumn in Go side, because both of them are numeric features. Moreover, we can still distinguish whether the cell value of the column is dense or sparse by the FieldDesc.IsSparse:
type FieldDesc struct {
Name string
DType int
Delimiter string
Shape []int
IsSparse bool // indicates whether the cell value of the column is dense or sparse
Vocabulary map[string]string
MaxID int64
}
What we need to do is to rewrite the parser codes to parse DENSE
and SPARSE
, and SQLFlow should support some commonly used dense/sparse data formats. We would discuss which data formats should be supported in SQLFlow in future designs.