Export Query Result

This document describes how to use the SELECT INTO OUTFILE command to export query results.

Syntax

The SELECT INTO OUTFILE statement can export the query results to a file. Currently supports export to remote storage through Broker process, or directly through S3, HDFS protocol such as HDFS, S3, BOS and COS(Tencent Cloud) through the Broker process. The syntax is as follows:

  1. query_stmt
  2. INTO OUTFILE "file_path"
  3. [format_as]
  4. [properties]
  • file_path

    file_path specify the file path and file name prefix. Like: hdfs://path/to/my_file_.

    The final file name will be assembled as my_file_, file seq no and the format suffix. File seq no starts from 0, determined by the number of split.

    1. my_file_abcdefg_0.csv
    2. my_file_abcdefg_1.csv
    3. my_file_abcdegf_2.csv
  • [format_as]

    1. FORMAT AS CSV

    Specify the export format. The default is CSV.

  • [properties]

    Specify the relevant attributes. Currently it supports exporting through the Broker process, or through the S3, HDFS protocol.

    • Broker related attributes need to be prefixed with broker.. For details, please refer to Broker Document.
    • HDFS protocal can directly execute HDFS protocal configuration. hdfs.fs.defaultFS is used to fill in the namenode address and port. It is required.
    • S3 protocol can directly execute S3 protocol configuration.
    1. PROPERTIES
    2. ("broker.prop_key" = "broker.prop_val", ...)
    3. or
    4. ("hdfs.fs.defaultFS" = "xxx", "hdfs.user" = "xxx")
    5. or
    6. ("AWS_ENDPOINT" = "xxx", ...)

    Other properties

    1. PROPERTIES
    2. ("key1" = "val1", "key2" = "val2", ...)

    currently supports the following properties:

    • column_separator: Column separator, only applicable to CSV format. The default is \t.
    • line_delimiter: Line delimiter, only applicable to CSV format. The default is \n.
    • max_file_size: The max size of a single file. Default is 1GB. Range from 5MB to 2GB. Files exceeding this size will be splitted.
    • schema: schema infomation for PARQUET, only applicable to PARQUET format. If the exported file format is PARQUET, schema must be specified.

Concurrent export

By default, the export of the query result set is non-concurrent, that is, a single point of export. If the user wants the query result set to be exported concurrently, the following conditions need to be met:

  1. session variable ‘enable_parallel_outfile’ to enable concurrent export: set enable_parallel_outfile = true;
  2. The export method is S3, HDFS instead of using a broker
  3. The query can meet the needs of concurrent export, for example, the top level does not contain single point nodes such as sort. (I will give an example later, which is a query that does not export the result set concurrently)

If the above three conditions are met, the concurrent export query result set can be triggered. Concurrency = be_instacne_num * parallel_fragment_exec_instance_num

How to verify that the result set is exported concurrently

After the user enables concurrent export through the session variable setting, if you want to verify whether the current query can be exported concurrently, you can use the following method.

  1. explain select xxx from xxx where xxx into outfile "s3://xxx" format as csv properties ("AWS_ENDPOINT" = "xxx", ...);

After explaining the query, Doris will return the plan of the query. If you find that RESULT FILE SINK appears in PLAN FRAGMENT 1, it means that the export concurrency has been opened successfully. If RESULT FILE SINK appears in PLAN FRAGMENT 0, it means that the current query cannot be exported concurrently (the current query does not satisfy the three conditions of concurrent export at the same time).

  1. Planning example for concurrent export:
  2. +-----------------------------------------------------------------------------+
  3. | Explain String |
  4. +-----------------------------------------------------------------------------+
  5. | PLAN FRAGMENT 0 |
  6. | OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5> |
  7. | PARTITION: UNPARTITIONED |
  8. | |
  9. | RESULT SINK |
  10. | |
  11. | 1:EXCHANGE |
  12. | |
  13. | PLAN FRAGMENT 1 |
  14. | OUTPUT EXPRS:`k1` + `k2` |
  15. | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` |
  16. | |
  17. | RESULT FILE SINK |
  18. | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ |
  19. | STORAGE TYPE: S3 |
  20. | |
  21. | 0:OlapScanNode |
  22. | TABLE: multi_tablet |
  23. +-----------------------------------------------------------------------------+

Usage example

For details, please refer to OUTFILE Document.

Return result

The command is a synchronization command. The command returns, which means the operation is over. At the same time, a row of results will be returned to show the exported execution result.

If it exports and returns normally, the result is as follows:

  1. mysql> select * from tbl1 limit 10 into outfile "file:///home/work/path/result_";
  2. +------------+-----------+----------+--------------------------------------------------------------------+
  3. | FileNumber | TotalRows | FileSize | URL |
  4. +------------+-----------+----------+--------------------------------------------------------------------+
  5. | 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
  6. +------------+-----------+----------+--------------------------------------------------------------------+
  7. 1 row in set (0.05 sec)
  • FileNumber: The number of files finally generated.
  • TotalRows: The number of rows in the result set.
  • FileSize: The total size of the exported file. Unit byte.
  • URL: If it is exported to a local disk, the Compute Node to which it is exported is displayed here.

If a concurrent export is performed, multiple rows of data will be returned.

  1. +------------+-----------+----------+--------------------------------------------------------------------+
  2. | FileNumber | TotalRows | FileSize | URL |
  3. +------------+-----------+----------+--------------------------------------------------------------------+
  4. | 1 | 3 | 7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
  5. | 1 | 2 | 4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ |
  6. +------------+-----------+----------+--------------------------------------------------------------------+
  7. 2 rows in set (2.218 sec)

If the execution is incorrect, an error message will be returned, such as:

  1. mysql> SELECT * FROM tbl INTO OUTFILE ...
  2. ERROR 1064 (HY000): errCode = 2, detailMessage = Open broker writer failed ...

Notice

  • The CSV format does not support exporting binary types, such as BITMAP and HLL types. These types will be output as \N, which is null.
  • If you do not enable concurrent export, the query result is exported by a single BE node in a single thread. Therefore, the export time and the export result set size are positively correlated. Turning on concurrent export can reduce the export time.
  • The export command does not check whether the file and file path exist. Whether the path will be automatically created or whether the existing file will be overwritten is entirely determined by the semantics of the remote storage system.
  • If an error occurs during the export process, the exported file may remain on the remote storage system. Doris will not clean these files. The user needs to manually clean up.
  • The timeout of the export command is the same as the timeout of the query. It can be set by SET query_timeout = xxx.
  • For empty result query, there will be an empty file.
  • File spliting will ensure that a row of data is stored in a single file. Therefore, the size of the file is not strictly equal to max_file_size.
  • For functions whose output is invisible characters, such as BITMAP and HLL types, the output is \N, which is NULL.
  • At present, the output type of some geo functions, such as ST_Point is VARCHAR, but the actual output value is an encoded binary character. Currently these functions will output garbled characters. For geo functions, use ST_AsText for output.