External storage data import

The following mainly introduces how to import data stored in an external system. For example (HDFS, All object stores that support the S3 protocol)

HDFS LOAD

Ready to work

Upload the files to be imported to HDFS. For specific commands, please refer to HDFS upload command

start import

Hdfs load creates an import statement. The import method is basically the same as Broker Load, only need to WITH BROKER broker_name () statement with the following

  1. LOAD LABEL db_name.label_name
  2. (data_desc, ...)
  3. WITH HDFS
  4. [PROPERTIES (key1=value1, ... )]
  1. create a table

    1. CREATE TABLE IF NOT EXISTS load_hdfs_file_test
    2. (
    3. id INT,
    4. age TINYINT,
    5. name VARCHAR(50)
    6. )
    7. unique key(id)
    8. DISTRIBUTED BY HASH(id) BUCKETS 3;
  2. Import data Execute the following command to import HDFS files:

    1. LOAD LABEL demo.label_20220402
    2. (
    3. DATA INFILE("hdfs://host:port/tmp/test_hdfs.txt")
    4. INTO TABLE `load_hdfs_file_test`
    5. COLUMNS TERMINATED BY "\t"
    6. (id,age,name)
    7. )
    8. with HDFS (
    9. "fs.defaultFS"="hdfs://testFs",
    10. "hdfs_user"="user"
    11. )
    12. PROPERTIES
    13. (
    14. "timeout"="1200",
    15. "max_filter_ratio"="0.1"
    16. );

    For parameter introduction, please refer to Broker Load, HA cluster creation syntax, view through HELP BROKER LOAD

  3. Check import status

    Broker load is an asynchronous import method. The specific import results can be accessed through SHOW LOAD command to view

    1. mysql> show load order by createtime desc limit 1\G;
    2. *************************** 1. row ***************************
    3. JobId: 41326624
    4. Label: broker_load_2022_04_15
    5. State: FINISHED
    6. Progress: ETL:100%; LOAD:100%
    7. Type: BROKER
    8. EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=27
    9. TaskInfo: cluster:N/A; timeout(s):1200; max_filter_ratio:0.1
    10. ErrorMsg: NULL
    11. CreateTime: 2022-04-01 18:59:06
    12. EtlStartTime: 2022-04-01 18:59:11
    13. EtlFinishTime: 2022-04-01 18:59:11
    14. LoadStartTime: 2022-04-01 18:59:11
    15. LoadFinishTime: 2022-04-01 18:59:11
    16. URL: NULL
    17. JobDetails: {"Unfinished backends":{"5072bde59b74b65-8d2c0ee5b029adc0":[]},"ScannedRows":27,"TaskNumber":1,"All backends":{"5072bde59b74b65-8d2c0ee5b029adc0":[36728051]},"FileNumber":1,"FileSize":5540}
    18. 1 row in set (0.01 sec)

S3 LOAD

Starting from version 0.14, Doris supports the direct import of data from online storage systems that support the S3 protocol through the S3 protocol.

This document mainly introduces how to import data stored in AWS S3. It also supports the import of other object storage systems that support the S3 protocol.

Applicable scenarios

  • Source data in S3 protocol accessible storage systems, such as S3.
  • Data volumes range from tens to hundreds of GB.

Preparing

  1. Standard AK and SK First, you need to find or regenerate AWS Access keys, you can find the generation method in My Security Credentials of AWS console, as shown in the following figure: AK_SK Select Create New Access Key and pay attention to save and generate AK and SK.
  2. Prepare REGION and ENDPOINT REGION can be selected when creating the bucket or can be viewed in the bucket list. ENDPOINT can be found through REGION on the following page AWS Documentation

Other cloud storage systems can find relevant information compatible with S3 in corresponding documents

Start Loading

Like Broker Load just replace WITH BROKER broker_name () with

  1. WITH S3
  2. (
  3. "AWS_ENDPOINT" = "AWS_ENDPOINT",
  4. "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
  5. "AWS_SECRET_KEY"="AWS_SECRET_KEY",
  6. "AWS_REGION" = "AWS_REGION"
  7. )

example:

  1. LOAD LABEL example_db.exmpale_label_1
  2. (
  3. DATA INFILE("s3://your_bucket_name/your_file.txt")
  4. INTO TABLE load_test
  5. COLUMNS TERMINATED BY ","
  6. )
  7. WITH S3
  8. (
  9. "AWS_ENDPOINT" = "AWS_ENDPOINT",
  10. "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
  11. "AWS_SECRET_KEY"="AWS_SECRET_KEY",
  12. "AWS_REGION" = "AWS_REGION"
  13. )
  14. PROPERTIES
  15. (
  16. "timeout" = "3600"
  17. );

FAQ

S3 SDK uses virtual-hosted style by default. However, some object storage systems may not be enabled or support virtual-hosted style access. At this time, we can add the use_path_style parameter to force the use of path style:

  1. WITH S3
  2. (
  3. "AWS_ENDPOINT" = "AWS_ENDPOINT",
  4. "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
  5. "AWS_SECRET_KEY"="AWS_SECRET_KEY",
  6. "AWS_REGION" = "AWS_REGION",
  7. "use_path_style" = "true"
  8. )