Export Query Result

This document describes how to use the SELECT INTO OUTFILE command to export query results.

Syntax

The SELECT INTO OUTFILE statement can export the query results to a file. Currently supports export to remote storage through Broker process, or directly through S3, HDFS protocol such as HDFS, S3, BOS and COS(Tencent Cloud) through the Broker process. The syntax is as follows:

  1. query_stmt
  2. INTO OUTFILE "file_path"
  3. [format_as]
  4. [properties]
  • file_path

    file_path specify the file path and file name prefix. Like: hdfs://path/to/my_file_.

    The final file name will be assembled as my_file_, file seq no and the format suffix. File seq no starts from 0, determined by the number of split.

    1. my_file_abcdefg_0.csv
    2. my_file_abcdefg_1.csv
    3. my_file_abcdegf_2.csv
  • [format_as]

    1. FORMAT AS CSV

    Specify the export format. The default is CSV.

  • [properties]

    Specify the relevant attributes. Currently it supports exporting through the Broker process, or through the S3, HDFS protocol.

    • Broker related attributes need to be prefixed with broker.. For details, please refer to Broker Document.
    • HDFS protocal can directly execute HDFS protocal configuration. hdfs.fs.defaultFS is used to fill in the namenode address and port. It is required.
    • S3 protocol can directly execute S3 protocol configuration.
    1. PROPERTIES
    2. ("broker.prop_key" = "broker.prop_val", ...)
    3. or
    4. ("hdfs.fs.defaultFS" = "xxx", "hdfs.user" = "xxx")
    5. or
    6. ("AWS_ENDPOINT" = "xxx", ...)

    Other properties

    1. PROPERTIES
    2. ("key1" = "val1", "key2" = "val2", ...)

    currently supports the following properties:

    • column_separator: Column separator, only applicable to CSV format. The default is \t.
    • line_delimiter: Line delimiter, only applicable to CSV format. The default is \n.
    • max_file_size: The max size of a single file. Default is 1GB. Range from 5MB to 2GB. Files exceeding this size will be splitted.
    • schema: schema infomation for PARQUET, only applicable to PARQUET format. If the exported file format is PARQUET, schema must be specified.

Concurrent export

By default, the export of the query result set is non-concurrent, that is, a single point of export. If the user wants the query result set to be exported concurrently, the following conditions need to be met:

  1. session variable ‘enable_parallel_outfile’ to enable concurrent export: set enable_parallel_outfile = true;
  2. The export method is S3, HDFS instead of using a broker
  3. The query can meet the needs of concurrent export, for example, the top level does not contain single point nodes such as sort. (I will give an example later, which is a query that does not export the result set concurrently)

If the above three conditions are met, the concurrent export query result set can be triggered. Concurrency = be_instacne_num * parallel_fragment_exec_instance_num

How to verify that the result set is exported concurrently

After the user enables concurrent export through the session variable setting, if you want to verify whether the current query can be exported concurrently, you can use the following method.

  1. explain select xxx from xxx where xxx into outfile "s3://xxx" format as csv properties ("AWS_ENDPOINT" = "xxx", ...);

After explaining the query, Doris will return the plan of the query. If you find that RESULT FILE SINK appears in PLAN FRAGMENT 1, it means that the export concurrency has been opened successfully. If RESULT FILE SINK appears in PLAN FRAGMENT 0, it means that the current query cannot be exported concurrently (the current query does not satisfy the three conditions of concurrent export at the same time).

  1. Planning example for concurrent export:
  2. +-----------------------------------------------------------------------------+
  3. | Explain String |
  4. +-----------------------------------------------------------------------------+
  5. | PLAN FRAGMENT 0 |
  6. | OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5> |
  7. | PARTITION: UNPARTITIONED |
  8. | |
  9. | RESULT SINK |
  10. | |
  11. | 1:EXCHANGE |
  12. | |
  13. | PLAN FRAGMENT 1 |
  14. | OUTPUT EXPRS:`k1` + `k2` |
  15. | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` |
  16. | |
  17. | RESULT FILE SINK |
  18. | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ |
  19. | STORAGE TYPE: S3 |
  20. | |
  21. | 0:OlapScanNode |
  22. | TABLE: multi_tablet |
  23. +-----------------------------------------------------------------------------+

Usage example

  1. Example 1

    Export simple query results to the file hdfs://path/to/result.txt. Specify the export format as CSV. Use my_broker and set kerberos authentication information. Specify the column separator as , and the line delimiter as \n.

    1. SELECT * FROM tbl
    2. INTO OUTFILE "hdfs://path/to/result_"
    3. FORMAT AS CSV
    4. PROPERTIES
    5. (
    6. "broker.name" = "my_broker",
    7. "broker.hadoop.security.authentication" = "kerberos",
    8. "broker.kerberos_principal" = "doris@YOUR.COM",
    9. "broker.kerberos_keytab" = "/home/doris/my.keytab",
    10. "column_separator" = ",",
    11. "line_delimiter" = "\n",
    12. "max_file_size" = "100MB"
    13. );

    If the result is less than 100MB, file will be: result_0.csv.

    If larger than 100MB, may be: result_0.csv, result_1.csv, ....

  2. Example 2

    Export simple query results to the file hdfs://path/to/result.parquet. Specify the export format as PARQUET. Use my_broker and set kerberos authentication information.

    1. SELECT c1, c2, c3 FROM tbl
    2. INTO OUTFILE "hdfs://path/to/result_"
    3. FORMAT AS PARQUET
    4. PROPERTIES
    5. (
    6. "broker.name" = "my_broker",
    7. "broker.hadoop.security.authentication" = "kerberos",
    8. "broker.kerberos_principal" = "doris@YOUR.COM",
    9. "broker.kerberos_keytab" = "/home/doris/my.keytab",
    10. "schema"="required,int32,c1;required,byte_array,c2;required,byte_array,c2"
    11. );

    If the exported file format is PARQUET, schema must be specified.

  3. Example 3

    Export the query result of the CTE statement to the file hdfs://path/to/result.txt. The default export format is CSV. Use my_broker and set hdfs high availability information. Use the default column separators and line delimiter.

    1. WITH
    2. x1 AS
    3. (SELECT k1, k2 FROM tbl1),
    4. x2 AS
    5. (SELECT k3 FROM tbl2)
    6. SELEC k1 FROM x1 UNION SELECT k3 FROM x2
    7. INTO OUTFILE "hdfs://path/to/result_"
    8. PROPERTIES
    9. (
    10. "broker.name" = "my_broker",
    11. "broker.username"="user",
    12. "broker.password"="passwd",
    13. "broker.dfs.nameservices" = "my_ha",
    14. "broker.dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
    15. "broker.dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
    16. "broker.dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
    17. "broker.dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
    18. );

    If the result is less than 1GB, file will be: result_0.csv.

    If larger than 1GB, may be: result_0.csv, result_1.csv, ....

  4. Example 4

    Export the query results of the UNION statement to the file bos://bucket/result.parquet. Specify the export format as PARQUET. Use my_broker and set hdfs high availability information. PARQUET format does not need to specify the column separator and line delimiter.

    1. SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
    2. INTO OUTFILE "bos://bucket/result_"
    3. FORMAT AS PARQUET
    4. PROPERTIES
    5. (
    6. "broker.name" = "my_broker",
    7. "broker.bos_endpoint" = "http://bj.bcebos.com",
    8. "broker.bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
    9. "broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyyy",
    10. "schema"="required,int32,k1;required,byte_array,k2"
    11. );
  5. Example 5

    Export simple query results to the file cos://${bucket_name}/path/result.txt. Specify the export format as CSV. And create a mark file after export finished.

    1. select k1,k2,v1 from tbl1 limit 100000
    2. into outfile "s3a://my_bucket/export/my_file_"
    3. FORMAT AS CSV
    4. PROPERTIES
    5. (
    6. "broker.name" = "hdfs_broker",
    7. "broker.fs.s3a.access.key" = "xxx",
    8. "broker.fs.s3a.secret.key" = "xxxx",
    9. "broker.fs.s3a.endpoint" = "https://cos.xxxxxx.myqcloud.com/",
    10. "column_separator" = ",",
    11. "line_delimiter" = "\n",
    12. "max_file_size" = "1024MB",
    13. "success_file_name" = "SUCCESS"
    14. )

    If the result is less than 1GB, file will be: my_file_0.csv.

    If larger than 1GB, may be: my_file_0.csv, result_1.csv, ....

    Please Note:

    1. Paths that do not exist are automatically created.
    2. These parameters(access.key/secret.key/endpointneed) need to be confirmed with Tecent Cloud COS. In particular, the value of endpoint does not need to be filled in bucket_name.
  6. Example 6

    Use the s3 protocol to export to bos, and concurrent export is enabled.

    1. set enable_parallel_outfile = true;
    2. select k1 from tb1 limit 1000
    3. into outfile "s3://my_bucket/export/my_file_"
    4. format as csv
    5. properties
    6. (
    7. "AWS_ENDPOINT" = "http://s3.bd.bcebos.com",
    8. "AWS_ACCESS_KEY" = "xxxx",
    9. "AWS_SECRET_KEY" = "xxx",
    10. "AWS_REGION" = "bd"
    11. )

    The final generated file prefix is my_file_{fragment_instance_id}_

  7. Example 7

    Use the s3 protocol to export to bos, and enable concurrent export of session variables.

    1. set enable_parallel_outfile = true;
    2. select k1 from tb1 order by k1 limit 1000
    3. into outfile "s3://my_bucket/export/my_file_"
    4. format as csv
    5. properties
    6. (
    7. "AWS_ENDPOINT" = "http://s3.bd.bcebos.com",
    8. "AWS_ACCESS_KEY" = "xxxx",
    9. "AWS_SECRET_KEY" = "xxx",
    10. "AWS_REGION" = "bd"
    11. )

    But because the query statement has a top-level sorting node, even if the query is enabled for concurrently exported session variables, it cannot be exported concurrently.

  8. Example 8

    Use hdfs to export and export the simple query results to the file hdfs://path/to/result.txt. Specify the export format as csv.

    1. select * from tbl
    2. into outfile "hdfs://path/to/result_"
    3. format as csv
    4. properties
    5. (
    6. "hdfs.fs.defaultfs" = "hdfs://namenode:port",
    7. );
  9. Example 9

    Export simple query results to the file hdfs://path/to/result.txt. Specify the export format as CSV. Use HDFS protocal directly and set kerberos authentication information.

    1. SELECT * FROM tbl
    2. INTO OUTFILE "hdfs://path/to/result_"
    3. FORMAT AS CSV
    4. PROPERTIES
    5. (
    6. "hdfs.fs.defaultFS" = "hdfs://namenode_ip:namenode_port",
    7. "hdfs.hadoop.security.authentication" = "kerberos",
    8. "hdfs.kerberos_principal" = "doris@YOUR.COM",
    9. "hdfs.kerberos_keytab" = "/home/doris/my.keytab",
    10. "max_file_size" = "100MB"
    11. );

    If the result is less than 100MB, file will be: result_0.csv.

    If larger than 100MB, may be: result_0.csv, result_1.csv, ....

Return result

The command is a synchronization command. The command returns, which means the operation is over. At the same time, a row of results will be returned to show the exported execution result.

If it exports and returns normally, the result is as follows:

  1. mysql> select * from tbl1 limit 10 into outfile "file:///home/work/path/result_";
  2. +------------+-----------+----------+--------------------------------------------------------------------+
  3. | FileNumber | TotalRows | FileSize | URL |
  4. +------------+-----------+----------+--------------------------------------------------------------------+
  5. | 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
  6. +------------+-----------+----------+--------------------------------------------------------------------+
  7. 1 row in set (0.05 sec)
  • FileNumber: The number of files finally generated.
  • TotalRows: The number of rows in the result set.
  • FileSize: The total size of the exported file. Unit byte.
  • URL: If it is exported to a local disk, the Compute Node to which it is exported is displayed here.

If a concurrent export is performed, multiple rows of data will be returned.

  1. +------------+-----------+----------+--------------------------------------------------------------------+
  2. | FileNumber | TotalRows | FileSize | URL |
  3. +------------+-----------+----------+--------------------------------------------------------------------+
  4. | 1 | 3 | 7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
  5. | 1 | 2 | 4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ |
  6. +------------+-----------+----------+--------------------------------------------------------------------+
  7. 2 rows in set (2.218 sec)

If the execution is incorrect, an error message will be returned, such as:

  1. mysql> SELECT * FROM tbl INTO OUTFILE ...
  2. ERROR 1064 (HY000): errCode = 2, detailMessage = Open broker writer failed ...

Notice

  • The CSV format does not support exporting binary types, such as BITMAP and HLL types. These types will be output as \N, which is null.
  • If you do not enable concurrent export, the query result is exported by a single BE node in a single thread. Therefore, the export time and the export result set size are positively correlated. Turning on concurrent export can reduce the export time.
  • The export command does not check whether the file and file path exist. Whether the path will be automatically created or whether the existing file will be overwritten is entirely determined by the semantics of the remote storage system.
  • If an error occurs during the export process, the exported file may remain on the remote storage system. Doris will not clean these files. The user needs to manually clean up.
  • The timeout of the export command is the same as the timeout of the query. It can be set by SET query_timeout = xxx.
  • For empty result query, there will be an empty file.
  • File spliting will ensure that a row of data is stored in a single file. Therefore, the size of the file is not strictly equal to max_file_size.
  • For functions whose output is invisible characters, such as BITMAP and HLL types, the output is \N, which is NULL.
  • At present, the output type of some geo functions, such as ST_Point is VARCHAR, but the actual output value is an encoded binary character. Currently these functions will output garbled characters. For geo functions, use ST_AsText for output.