Compatibility with Various SQL Engines
SQLFlow interacts with SQL engines like MySQL and Hive, while different SQL engines use variants of SQL syntax, it is important for SQLFlow to have an abstraction layer that hides such differences.
SQLFlow calls Go’s standard database API. The submitter programs generated by SQLFlow call Python’s database API. Both APIs abstract the interface to various SQL engines; however, they are insufficient for SQLFlow to work. In this document, we examine all interactions between SQLFlow and the SQL engine so to identify what SQLFlow authors have to abstract in addition to calling Go’s and Python’s database APIs.
Data Operations in Go
Data Retrieval
The basic idea of SQLFlow is to extend the SELECT statement of SQL to have the TRAIN and PREDICT clauses. For more discussion, please refer to the syntax design. SQLFlow translates such “extended SQL statements” into submitter programs, which forward the part from SELECT to TRAIN or PREDICT, which we call the “standard part”, to the SQL engine. SQLFlow also accepts the SELECT statement without TRAIN or PREDICt clauses and would forward such “standard statements” to the engine. It is noticeable that the “standard part” or “standard statements” are not standardized. For example, various engines use different syntax for FULL OUTER JOIN
.
- Hive supports
FULL OUTER JOIN
directly. - MySQL doesn’t have
FULL OUTER JOIN
. However, a user can emulatesFULL OUTER JOIN
usingLEFT JOIN
,UNION
andRIGHT JOIN
.Fortunately, as SQLFlow forwards the above parts to the engine, it doesn’t have to care much about the differences above.
Metadata Retrieval
To verify the semantics of users’ inputs, SQLFlow needs to retrieve the schema of tables. For example, the input might be
SELECT
name,
age,
income
FROM employee TRAIN DNNRegressor
WITH hidden_layers=[10,50,10]
COLUMN name, agee LABEL income;
In the above example, the user misspelled the field name age
in the COLUMN clause as “agee”. SQLFlow must be able to find that out.
To do that, SQLFlow needs to query the field names from the SQL engine. However, different engines use various syntax. For example:
- MySQL:
DESCRIBE/DESC employee;
- Hive:
DESCRIBE FORMATTED employee;
- ODPS:
DESC employee;
- SQLite:
PRAGMA table_info([employee]);
The returned data format varies too. Our solution to avoid such differences is not-to-use-them; instead, SQLFlow retrieves the table schema by running a query likeSELECT * FROM employee LIMIT 1;
and inferring field types using the mechanism called DatabaseTypeName provided by SQL engines drivers beneath the Go’s standard database API.
Prepare Prediction Table
A SQLFlow prediction job writes its prediction results into a table. It prepares the prediction table by
- Dropping previous prediction table
DROP TABLE IF EXISTS my_table;
- Creating table with schema
CREATE TABLE my_table (name1, type1, name2 type2);
Most SQL engines, including MySQL, Hive, ODPS, SQLite, support both statements.
Translate Database Column Type to TensorFlow Feature Column Type
After retrieving database column type name through DatabaseTypeName, we can derive TensorFlow’s feature column type via a mapping such as {"FLOAT", "DOUBLE"} -> tf.numeric_column
.
Save Model
SQLFlow saves trained ML model by dumping the serialized the model directory into a table. It first creates a table by CREATE TABLE IF NOT EXISTS %s (id INT AUTO_INCREMENT, block BLOB, PRIMARY KEY (id))
and insert blobs by INSERT INTO %s (block) VALUES(?)
.
Note that Hive and ODPS doesn’t have BLOB
type, we need to use BINARY
(docs at ODPS, Hive) instead.
Also, note that Hive and ODPS doesn’t support AUTO_INCREMENT
, we need to implemented auto increment logic in sqlfs
.
Load Model
SQLFlow loads trained ML model by reading rows in a table and deserializing the blob to a model directory.
It reads rows by running SELECT block FROM %s ORDER BY id
, which is supported by most databases.
Data Operations in Python
Connect to SQL Engines
Thanks to the Python database API, connecting to different databases follows a similar API.
conn = mysql.connector.connect(user='scott', password='password',
host='127.0.0.1',
database='employees')
conn = sqlite3.connect('path/to/your/sqlite/file')
conn = pyhive.connect('localhost')
cursor = conn.cursor()
cursor.execute('select * from my_table;')
Insert Prediction Result into Prediction Table
Python database API provides execute_many(sql, value)
to insert multiple values at once. So one can prepare the following insertion statement. Please be aware that MySQL and SQLite use INSERT INTO
to insert rows while Hive and ODPS use INSERT INTO TABLE
.
-- MySQL, SQLite
INSERT INTO table_name VALUES (value1, value2, value3, ...);
-- Hive, ODPS
INSERT INTO TABLE table_name VALUES (value1, value2, value3, ...);