File Analysis
SinceVersion 1.2.0
With the Table Value Function feature, Doris is able to query files in object storage or HDFS as simply as querying Tables. In addition, it supports automatic column type inference.
Usage
For more usage details, please see the documentation:
The followings illustrate how file analysis is conducted with the example of S3 Table Value Function.
Automatic Column Type Inference
MySQL [(none)]> DESC FUNCTION s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"ACCESS_KEY"= "minioadmin",
"SECRET_KEY" = "minioadmin",
"Format" = "parquet",
"use_path_style"="true");
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+
An S3 Table Value Function is defined as follows:
s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"ACCESS_KEY"= "minioadmin",
"SECRET_KEY" = "minioadmin",
"Format" = "parquet",
"use_path_style"="true")
It specifies the file path, connection, and authentication.
After defining, you can view the schema of this file using the DESC FUNCTION
statement.
As can be seen, Doris is able to automatically infer column types based on the metadata of the Parquet file.
Besides Parquet, Doris supports analysis and auto column type inference of ORC, CSV, and Json files.
Query and Analysis
You can conduct queries and analysis on this Parquet file using any SQL statements:
SELECT * FROM s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"ACCESS_KEY"= "minioadmin",
"SECRET_KEY" = "minioadmin",
"Format" = "parquet",
"use_path_style"="true")
LIMIT 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
You can put the Table Value Function anywhere that you used to put Table in the SQL, such as in the WITH or FROM clause in CTE. In this way, you can treat the file as a normal table and conduct analysis conveniently.
Data Ingestion
Users can ingest files into Doris tables via INSERT INTO SELECT
for faster file analysis:
// 1. Create Doris internal table
CREATE TABLE IF NOT EXISTS test_table
(
id int,
name varchar(50),
age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");
// 2. Insert data using S3 Table Value Function
INSERT INTO test_table (id,name,age)
SELECT cast(id as INT) as id, name, cast (age as INT) as age
FROM s3(
"uri" = "${uri}",
"ACCESS_KEY"= "${ak}",
"SECRET_KEY" = "${sk}",
"format" = "${format}",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style" = "true");