Loading Data - Loading Overview - 《Apache Doris v3.0 Documentation》

Supported Data Sources
- By Scene
- By Loading Method
Supported Data Formats
Import Instructions
Import Atomicity Guarantees
Synchronous and Asynchronous Imports
Import the Data of Array Types
Execution Engine Selected

Supported Data Sources

Doris provides a variety of data import solutions, and you can choose different data import methods for different data sources.

By Scene

Data Source	Loading Method
Object Storage (s3), HDFS	Loading data using Broker
Local file	Loading local data
Kafka	Subscribing to Kafka data
MySQL, PostgreSQL, Oracle, SQLServer	Sync data via external table
Loading via JDBC	Sync data using JDBC
Loading JSON format data	JSON format data Loading
AutoMQ	AutoMQ Load

By Loading Method

Loading method name	Use method
Broker Load	Import external storage data via Broker
Stream Load	Stream import data (local file and memory data)
Routine Load	Import Kafka data
Insert Into	External table imports data through INSERT
S3 Load	Object storage data import of S3 protocol
MySQL Load	Local data import of MySql protocol

Supported Data Formats

Different import methods support slightly different data formats.

Import Methods	Supported Formats
Broker Load	parquet, orc, csv, gzip
Stream Load	csv, json, parquet, orc
Routine Load	csv, json
MySql Load	csv

Import Instructions

The data import implementation of Apache Doris has the following common features, which are introduced here to help you better use the data import function

Import Atomicity Guarantees

Each import job of Doris, whether it is batch import using Broker Load or single import using INSERT statement, is a complete transaction operation. The import transaction can ensure that the data in a batch takes effect atomically, and there will be no partial data writing.

At the same time, an import job will have a Label. This Label is unique under a database (Database) and is used to uniquely identify an import job. Label can be specified by the user, and some import functions will also be automatically generated by the system.

Label is used to ensure that the corresponding import job can only be successfully imported once. A successfully imported Label, when used again, will be rejected with the error Label already used. Through this mechanism, At-Most-Once semantics can be implemented in Doris. If combined with the At-Least-Once semantics of the upstream system, the Exactly-Once semantics of imported data can be achieved.

For best practices on atomicity guarantees, see Importing Transactions and Atomicity.

Synchronous and Asynchronous Imports

Import methods are divided into synchronous and asynchronous. For the synchronous import method, the returned result indicates whether the import succeeds or fails. For the asynchronous import method, a successful return only means that the job was submitted successfully, not that the data was imported successfully. You need to use the corresponding command to check the running status of the import job.

Import the Data of Array Types

For example, in the following import, you need to cast columns b14 and a13 into array<string> type, and then use the array_union function.

LOAD LABEL label_03_14_49_34_898986_19090452100 ( 
  DATA INFILE("hdfs://test.hdfs.com:9000/user/test/data/sys/load/array_test.data") 
  INTO TABLE `test_array_table` 
  COLUMNS TERMINATED BY "|" (`k1`, `a1`, `a2`, `a3`, `a4`, `a5`, `a6`, `a7`, `a8`, `a9`, `a10`, `a11`, `a12`, `a13`, `b14`) 
  SET(a14=array_union(cast(b14 as array<string>), cast(a13 as array<string>))) WHERE size(a2) > 270) 
  WITH BROKER "hdfs" ("username"="test_array", "password"="") 
  PROPERTIES( "max_filter_ratio"="0.8" );

Execution Engine Selected

The Pipeline engine is turned off by default on import, and is enabled by the following two variables:

enable_pipeline_load in FE CONFIG enable_pipeline_load. When enabled, import tasks such as Streamload will try to use the Pipeline engine.
enable_nereids_dml_with_pipeline in Session Variable to enable insert into to try to use the Pipeline engine.

When the above variables are turned on, whether and which set of Pipeline engine is used still depends on the settings of the other two Session Variables enable_pipeline_engine and enable_pipeline_x_engine. When both are enabled, PipelineX is selected in preference to the Pipeline Engine. If neither is enabled, the import will not be executed using the Pipeline engine even if the above variables are set to true.