S3

Name

SinceVersion 1.2

s3

description

S3表函数(table-valued-function,tvf),可以让用户像访问关系表格式数据一样,读取并访问 S3 兼容的对象存储上的文件内容。目前支持csv/csv_with_names/csv_with_names_and_types/json/parquet/orc文件格式。

语法

  1. s3(
  2. "uri" = "..",
  3. "s3.access_key" = "...",
  4. "s3.secret_key" = "...",
  5. "s3.region" = "...",
  6. "format" = "csv",
  7. "keyn" = "valuen",
  8. ...
  9. );

参数说明

S3 tvf中的每一个参数都是一个 "key"="value" 对。 访问S3相关参数:

  • uri: (必填) 访问S3的uri,S3表函数会根据 use_path_style 参数来决定是否使用 path style 访问方式,默认为 virtual-hosted style 方式
  • s3.access_key: (必填)
  • s3.secret_key: (必填)
  • s3.region: (选填)。如果Minio服务设置了其他的region,那么必填,否则默认使用us-east-1
  • s3.session_token: (选填)
  • use_path_style:(选填) 默认为false 。S3 SDK 默认使用 virtual-hosted style 方式。但某些对象存储系统可能没开启或没支持virtual-hosted style 方式的访问,此时我们可以添加 use_path_style 参数来强制使用 path style 方式。比如 minio默认情况下只允许path style访问方式,所以在访问minio时要加上use_path_style=true

注意:uri目前支持三种schema:http://, https:// 和 s3://

  1. 如果使用http://或https://, 则会根据 ‘use_path_style’ 参数来决定是否使用’path style’方式访问s3
  2. 如果使用s3://, 则都使用 ‘virtual-hosted style’ 方式访问s3, ‘use_path_style’参数无效。
  3. 如果uri路径不存在或文件都是空文件,s3 tvf将返回空集合

详细使用案例可以参考最下方 Best Practice。

文件格式参数:

  • format:(必填) 目前支持 csv/csv_with_names/csv_with_names_and_types/json/parquet/orc

  • column_separator:(选填) 列分割符, 默认为,

  • line_delimiter:(选填) 行分割符,默认为\n

  • compress_type: (选填) 目前支持 UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE。 默认值为 UNKNOWN, 将会根据 uri 的后缀自动推断类型。

    下面6个参数是用于json格式的导入,具体使用方法可以参照:Json Load

  • read_json_by_line: (选填) 默认为 "true"

  • strip_outer_array: (选填) 默认为 "false"

  • json_root: (选填) 默认为空

  • jsonpaths: (选填) 默认为空

  • num_as_string: (选填) 默认为 false

  • fuzzy_parse: (选填) 默认为 false

    SinceVersion dev下面2个参数是用于csv格式的导入

  • trim_double_quotes: 布尔类型,选填,默认值为 false,为 true 时表示裁剪掉 csv 文件每个字段最外层的双引号

  • skip_lines: 整数类型,选填,默认值为0,含义为跳过csv文件的前几行。当设置format设置为 csv_with_namescsv_with_names_and_types 时,该参数会失效

其他参数:

  • path_partition_keys:(选填)指定文件路径中携带的分区列名,例如/path/to/city=beijing/date=”2023-07-09”, 则填写path_partition_keys="city,date",将会自动从路径中读取相应列名和列值进行导入。

Example

读取并访问 S3 兼容的对象存储上的csv格式文件

  1. select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
  2. "s3.access_key"= "minioadmin",
  3. "s3.secret_key" = "minioadmin",
  4. "format" = "csv",
  5. "use_path_style" = "true") order by c1;

可以配合desc function使用

  1. MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
  2. "s3.access_key"= "minioadmin",
  3. "s3.secret_key" = "minioadmin",
  4. "format" = "csv",
  5. "use_path_style" = "true");

Keywords

  1. s3, table-valued-function, tvf

Best Practice

不同url schema的写法 http:// 、https:// 使用示例:

  1. // 注意URI bucket写法以及use_path_style参数设置,http同理。
  2. // 由于设置了"use_path_style"="true", 所以将采用path style方式访问s3。
  3. select * from s3(
  4. "uri" = "https://endpoint/bucket/file/student.csv",
  5. "s3.access_key"= "ak",
  6. "s3.secret_key" = "sk",
  7. "format" = "csv",
  8. "use_path_style"="true");
  9. // 注意URI bucket写法以及use_path_style参数设置,http同理。
  10. // 由于设置了"use_path_style"="false", 所以将采用virtual-hosted style方式访问s3。
  11. select * from s3(
  12. "uri" = "https://bucket.endpoint/bucket/file/student.csv",
  13. "s3.access_key"= "ak",
  14. "s3.secret_key" = "sk",
  15. "format" = "csv",
  16. "use_path_style"="false");
  17. // 阿里云oss和腾讯云cos采用virtual-hosted style方式访问s3。
  18. // OSS
  19. select * from s3(
  20. "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
  21. "s3.access_key"= "ak",
  22. "s3.secret_key" = "sk",
  23. "s3.region" = "oss-cn-beijing",
  24. "format" = "parquet",
  25. "use_path_style" = "false");
  26. // COS
  27. select * from s3(
  28. "uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
  29. "s3.access_key"= "ak",
  30. "s3.secret_key" = "sk",
  31. "s3.region" = "ap-hongkong",
  32. "format" = "parquet",
  33. "use_path_style" = "false");
  34. // 百度云bos采用兼容s3协议的virtual-hosted style方式访问s3。
  35. // BOS
  36. select * from s3(
  37. "uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
  38. "s3.access_key"= "ak",
  39. "s3.secret_key" = "sk",
  40. "s3.region" = "bj",
  41. "format" = "parquet",
  42. "use_path_style" = "false");

s3:// 使用示例:

  1. // 注意URI bucket写法, 无需设置use_path_style参数。
  2. // 将采用virtual-hosted style方式访问s3。
  3. select * from s3(
  4. "uri" = "s3://bucket.endpoint/file/student.csv",
  5. "s3.access_key"= "ak",
  6. "s3.secret_key" = "sk",
  7. "format" = "csv");

csv format 由于S3 table-valued-function事先并不知道table schema,所以会先读一遍文件来解析出table schema。

csv 格式: S3 table-valued-function 读取S3上的文件并当作csv文件来处理,读取文件中的第一行用于解析table schema。文件第一行的列个数n将作为table schema的列个数,table schema的列名则自动取名为c1, c2, ..., cn ,列类型都设置为 String, 举例:

student1.csv文件内容为:

  1. 1,ftw,12
  2. 2,zs,18
  3. 3,ww,20

使用S3 tvf

  1. MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
  2. -> "s3.access_key"= "minioadmin",
  3. -> "s3.secret_key" = "minioadmin",
  4. -> "format" = "csv",
  5. -> "use_path_style" = "true") order by c1;
  6. +------+------+------+
  7. | c1 | c2 | c3 |
  8. +------+------+------+
  9. | 1 | ftw | 12 |
  10. | 2 | zs | 18 |
  11. | 3 | ww | 20 |
  12. +------+------+------+

可以配合 desc function S3() 来查看table schema

  1. MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
  2. -> "s3.access_key"= "minioadmin",
  3. -> "s3.secret_key" = "minioadmin",
  4. -> "format" = "csv",
  5. -> "use_path_style" = "true");
  6. +-------+------+------+-------+---------+-------+
  7. | Field | Type | Null | Key | Default | Extra |
  8. +-------+------+------+-------+---------+-------+
  9. | c1 | TEXT | Yes | false | NULL | NONE |
  10. | c2 | TEXT | Yes | false | NULL | NONE |
  11. | c3 | TEXT | Yes | false | NULL | NONE |
  12. +-------+------+------+-------+---------+-------+

csv_with_names format csv_with_names格式:解析文件的第一行作为table schema的列个数和列名,列类型则都设置为 String, 举例:

student_with_names.csv文件内容为

  1. id,name,age
  2. 1,ftw,12
  3. 2,zs,18
  4. 3,ww,20

使用S3 tvf

  1. MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
  2. -> "s3.access_key"= "minioadmin",
  3. -> "s3.secret_key" = "minioadmin",
  4. -> "format" = "csv_with_names",
  5. -> "use_path_style" = "true") order by id;
  6. +------+------+------+
  7. | id | name | age |
  8. +------+------+------+
  9. | 1 | ftw | 12 |
  10. | 2 | zs | 18 |
  11. | 3 | ww | 20 |
  12. +------+------+------+

同样配合desc function S3() 可查看table schema

  1. MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
  2. -> "s3.access_key"= "minioadmin",
  3. -> "s3.secret_key" = "minioadmin",
  4. -> "format" = "csv_with_names",
  5. -> "use_path_style" = "true");
  6. +-------+------+------+-------+---------+-------+
  7. | Field | Type | Null | Key | Default | Extra |
  8. +-------+------+------+-------+---------+-------+
  9. | id | TEXT | Yes | false | NULL | NONE |
  10. | name | TEXT | Yes | false | NULL | NONE |
  11. | age | TEXT | Yes | false | NULL | NONE |
  12. +-------+------+------+-------+---------+-------+

csv_with_names_and_types foramt

csv_with_names_and_types格式:目前暂不支持从csv文件中解析出column type。使用该format时,S3 tvf会解析文件的第一行作为table schema的列个数和列名,列类型则都设置为 String,同时将忽略该文件的第二行。

student_with_names_and_types.csv文件内容为

  1. id,name,age
  2. INT,STRING,INT
  3. 1,ftw,12
  4. 2,zs,18
  5. 3,ww,20

使用S3 tvf

  1. MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
  2. -> "s3.access_key"= "minioadmin",
  3. -> "s3.secret_key" = "minioadmin",
  4. -> "format" = "csv_with_names_and_types",
  5. -> "use_path_style" = "true") order by id;
  6. +------+------+------+
  7. | id | name | age |
  8. +------+------+------+
  9. | 1 | ftw | 12 |
  10. | 2 | zs | 18 |
  11. | 3 | ww | 20 |
  12. +------+------+------+

同样配合desc function S3() 可查看table schema

  1. MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
  2. -> "s3.access_key"= "minioadmin",
  3. -> "s3.secret_key" = "minioadmin",
  4. -> "format" = "csv_with_names_and_types",
  5. -> "use_path_style" = "true");
  6. +-------+------+------+-------+---------+-------+
  7. | Field | Type | Null | Key | Default | Extra |
  8. +-------+------+------+-------+---------+-------+
  9. | id | TEXT | Yes | false | NULL | NONE |
  10. | name | TEXT | Yes | false | NULL | NONE |
  11. | age | TEXT | Yes | false | NULL | NONE |
  12. +-------+------+------+-------+---------+-------+

json format

json 格式:json格式涉及到较多的可选参数,各个参数的意义可以参考:Json Load。 S3 tvf查询json格式文件时根据 json_rootjsonpaths 参数定位到一个json对象,将该对象的中的key 作为table schema的列名,列类型都设置为String。举例:

data.json文件

  1. [{"id":1, "name":"ftw", "age":18}]
  2. [{"id":2, "name":"xxx", "age":17}]
  3. [{"id":3, "name":"yyy", "age":19}]

使用S3 tvf查询

  1. MySQL [(none)]> select * from s3(
  2. "uri" = "http://127.0.0.1:9312/test2/data.json",
  3. "s3.access_key"= "minioadmin",
  4. "s3.secret_key" = "minioadmin",
  5. "format" = "json",
  6. "strip_outer_array" = "true",
  7. "read_json_by_line" = "true",
  8. "use_path_style"="true");
  9. +------+------+------+
  10. | id | name | age |
  11. +------+------+------+
  12. | 1 | ftw | 18 |
  13. | 2 | xxx | 17 |
  14. | 3 | yyy | 19 |
  15. +------+------+------+
  16. MySQL [(none)]> select * from s3(
  17. "uri" = "http://127.0.0.1:9312/test2/data.json",
  18. "s3.access_key"= "minioadmin",
  19. "s3.secret_key" = "minioadmin",
  20. "format" = "json",
  21. "strip_outer_array" = "true",
  22. "jsonpaths" = "[\"$.id\", \"$.age\"]",
  23. "use_path_style"="true");
  24. +------+------+
  25. | id | age |
  26. +------+------+
  27. | 1 | 18 |
  28. | 2 | 17 |
  29. | 3 | 19 |
  30. +------+------+

parquet format

parquet 格式:S3 tvf支持从parquet文件中解析出table schema的列名、列类型。举例:

  1. MySQL [(none)]> select * from s3(
  2. "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
  3. "s3.access_key"= "minioadmin",
  4. "s3.secret_key" = "minioadmin",
  5. "format" = "parquet",
  6. "use_path_style"="true") limit 5;
  7. +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  8. | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
  9. +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  10. | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
  11. | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
  12. | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
  13. | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
  14. | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
  15. +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  1. MySQL [(none)]> desc function s3(
  2. "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
  3. "s3.access_key"= "minioadmin",
  4. "s3.secret_key" = "minioadmin",
  5. "format" = "parquet",
  6. "use_path_style"="true");
  7. +---------------+--------------+------+-------+---------+-------+
  8. | Field | Type | Null | Key | Default | Extra |
  9. +---------------+--------------+------+-------+---------+-------+
  10. | p_partkey | INT | Yes | false | NULL | NONE |
  11. | p_name | TEXT | Yes | false | NULL | NONE |
  12. | p_mfgr | TEXT | Yes | false | NULL | NONE |
  13. | p_brand | TEXT | Yes | false | NULL | NONE |
  14. | p_type | TEXT | Yes | false | NULL | NONE |
  15. | p_size | INT | Yes | false | NULL | NONE |
  16. | p_container | TEXT | Yes | false | NULL | NONE |
  17. | p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
  18. | p_comment | TEXT | Yes | false | NULL | NONE |
  19. +---------------+--------------+------+-------+---------+-------+

orc format

orc 格式:和parquet format使用方法一致, 将format参数设置为orc。

  1. MySQL [(none)]> select * from s3(
  2. "uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
  3. "s3.access_key"= "minioadmin",
  4. "s3.secret_key" = "minioadmin",
  5. "format" = "orc",
  6. "use_path_style"="true") limit 5;
  7. +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  8. | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
  9. +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  10. | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
  11. | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
  12. | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
  13. | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
  14. | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
  15. +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+

avro format

avro 格式:S3 tvf支持从avro文件中解析出table schema的列名、列类型。举例:

  1. select * from s3(
  2. "uri" = "http://127.0.0.1:9312/test2/person.avro",
  3. "ACCESS_KEY" = "ak",
  4. "SECRET_KEY" = "sk",
  5. "FORMAT" = "avro");
  6. +--------+--------------+-------------+-----------------+
  7. | name | boolean_type | double_type | long_type |
  8. +--------+--------------+-------------+-----------------+
  9. | Alyssa | 1 | 10.0012 | 100000000221133 |
  10. | Ben | 0 | 5555.999 | 4009990000 |
  11. | lisi | 0 | 5992225.999 | 9099933330 |
  12. +--------+--------------+-------------+-----------------+

uri包含通配符

uri可以使用通配符来读取多个文件。注意:如果使用通配符要保证各个文件的格式是一致的(尤其是csv/csv_with_names/csv_with_names_and_types算做不同的格式),S3 tvf用第一个文件来解析出table schema。 如下两个csv文件:

  1. // file1.csv
  2. 1,aaa,18
  3. 2,qqq,20
  4. 3,qwe,19
  5. // file2.csv
  6. 5,cyx,19
  7. 6,ftw,21

可以在uri上使用通配符来导入。

  1. MySQL [(none)]> select * from s3(
  2. "uri" = "http://127.0.0.1:9312/test2/file*.csv",
  3. "s3.access_key"= "minioadmin",
  4. "s3.secret_key" = "minioadmin",
  5. "format" = "csv",
  6. "use_path_style"="true");
  7. +------+------+------+
  8. | c1 | c2 | c3 |
  9. +------+------+------+
  10. | 1 | aaa | 18 |
  11. | 2 | qqq | 20 |
  12. | 3 | qwe | 19 |
  13. | 5 | cyx | 19 |
  14. | 6 | ftw | 21 |
  15. +------+------+------+

配合 insert intocast 使用 S3 tvf

  1. // 创建doris内部表
  2. CREATE TABLE IF NOT EXISTS ${testTable}
  3. (
  4. id int,
  5. name varchar(50),
  6. age int
  7. )
  8. COMMENT "my first table"
  9. DISTRIBUTED BY HASH(id) BUCKETS 32
  10. PROPERTIES("replication_num" = "1");
  11. // 使用S3插入数据
  12. insert into ${testTable} (id,name,age)
  13. select cast (id as INT) as id, name, cast (age as INT) as age
  14. from s3(
  15. "uri" = "${uri}",
  16. "s3.access_key"= "${ak}",
  17. "s3.secret_key" = "${sk}",
  18. "format" = "${format}",
  19. "strip_outer_array" = "true",
  20. "read_json_by_line" = "true",
  21. "use_path_style" = "true");