LOAD DATA
Description
The LOAD DATA statement reads rows from a text file into a table at a very high speed. The file can be read from the server host or a S3 compatible object storage. LOAD DATA
is the complement of SELECT … INTO OUTFILE. To write data from a table to a file, use SELECT ... INTO OUTFILE
. To read the file back into a table, use LOAD DATA. The syntax of the FIELDS
and LINES
clauses is the same for both statements.
Syntax
> LOAD DATA [LOCAL]
INFILE 'file_name'
INTO TABLE tbl_name
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[IGNORE number {LINES | ROWS}]
[SET column_name_1=nullif(column_name_1, expr1), column_name_2=nullif(column_name_2, expr2)...]
[PARALLEL {'TRUE' | 'FALSE'}]
Input File Location
LOAD DATA INFILE 'file_name'
: Indicates that the data file to be loaded is on the same machine as the MatrixOne host server.file_name
can be the relative path name of the storage location of the file, or it can be the absolute path name.LOAD DATA LOCAL INFILE 'file_name'
: indicates that the data file to be loaded is not on the same machine as the MatrixOne host server; that is, the data file is on the client server.file_name
can be the relative path name of the storage location of the file, or it can be the absolute path name.
IGNORE LINES
The IGNORE number LINES clause can be used to ignore lines at the start of the file. For example, you can use IGNORE 1 LINES
to skip an initial header line containing column names:
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE table1 IGNORE 1 LINES;
Field and Line Handling
For both the LOAD DATA and SELECT ... INTO OUTFILE
statements, the syntax of the FIELDS and LINES clauses is the same. Both clauses are optional, but FIELDS must precede LINES if both are specified.
If you specify a FIELDS
clause, each of its subclauses (TERMINATED BY
, [OPTIONALLY] ENCLOSED BY
) is also optional, except that you must specify at least one of them. Arguments to these clauses are permitted to contain only ASCII characters.
If you specify no FIELDS
or LINES
clause, the defaults are the same as if you had written this:
FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n'
Note
FIELDS TERMINATED BY ','
: with and only,
,|
or\t
as delimiters.ENCLOSED BY '"'
: with and only"
as the included character.LINES TERMINATED BY '\n'
: Use and only use\n
or\r\n
as the line separator.
FIELDS TERMINATED BY
FIELDS TERMINATED BY
specifies the delimiter for a field. The FIELDS TERMINATED BY
values can be more than one character.
For example, to read the comma-delimited file, the correct statement is:
LOAD DATA INFILE 'data.txt' INTO TABLE table1
FIELDS TERMINATED BY ',';
If instead you tried to read the file with the statement shown following, it would not work because it instructs LOAD DATA
to look for tabs between fields:
LOAD DATA INFILE 'data.txt' INTO TABLE table1
FIELDS TERMINATED BY '\t';
The likely result is that each input line would be interpreted as a single field. You may encounter an error of "ERROR 20101 (HY000): internal error: the table column is larger than input data column"
.
FIELDS ENCLOSED BY
FIELDS TERMINATED BY
option specifies the character enclose the input values. ENCLOSED BY
value must be a single character. If the input values are not necessarily enclosed within quotation marks, use OPTIONALLY
before the ENCLOSED BY
option.
For example, if some input values are enclosed within quotation marks, some are not:
LOAD DATA INFILE 'data.txt' INTO TABLE table1
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"';
LINES TERMINATED BY
LINES TERMINATED BY
specifies the delimiter for the a line. The LINES TERMINATED BY
values can be more than one character.
For example, if the lines in a csv file are terminated by carriage return/newline pairs, you can load it with LINES TERMINATED BY '\r\n'
:
LOAD DATA INFILE 'data.txt' INTO TABLE table1
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';
LINE STARTING BY
If all the input lines have a common prefix that you want to ignore, you can use LINES STARTING BY
‘prefix_string’ to skip the prefix and anything before it. If a line does not include the prefix, the entire line is skipped. Suppose that you issue the following statement:
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE table1
FIELDS TERMINATED BY ',' LINES STARTING BY 'xxx';
If the data file looks like this:
xxx"abc",1
something xxx"def",2
"ghi",3
The resulting rows are (“abc”,1) and (“def”,2). The third row in the file is skipped because it does not contain the prefix.
SET
MatrixOne only supports SET column_name=nullif(column_name,expr)
. That is, when column_name = expr
, it returns NULL
; otherwise, it returns the original value of column_name
. For example, SET a=nullif(a,1)
, if a=1, returns NULL
; otherwise, it returns the original value of column a.
By setting the parameter, you can use SET column_name=nullif(column_name,"null")
to return the NULL
value in the column when loading the file.
Example
The details of the local file
test.txt
are as follows:id,user_name,sex
1,"weder","man"
2,"tom","man"
null,wederTom,"man"
Create a table named
user
in MatrixOne:create database aaa;
use aaa;
CREATE TABLE `user` (`id` int(11) ,`user_name` varchar(255) ,`sex` varchar(255));
Load
test.txt
into the tableuser
:LOAD DATA INFILE '/tmp/test.txt' INTO TABLE user SET id=nullif(id,"null");
The result of the talbe is as below:
select * from user;
+------+-----------+------+
| id | user_name | sex |
+------+-----------+------+
| 1 | weder | man |
| 2 | tom | man |
| null | wederTom | man |
+------+-----------+------+
PARALLEL
For a sizeable well-formed file, such as a JSOLLines file or a CSV file with no line breaks in a line of data, you can use PARALLEL
to load the file in parallel to speed up the loading.
For example, for a large file of 2 G, use two threads to load; the second thread first splits and locates the 1G position, then reads and loads backward. In this way, two threads can read large files at the same time, and each thread can read 1G of data.
Enable/Disable Parallel Loading Command Line Example:
-- Enable Parallel Loading
load data infile 'file_name' into table tbl_name FIELDS TERMINATED BY '|' ENCLOSED BY '\"' LINES TERMINATED BY '\n' IGNORE 1 LINES PARALLEL 'TRUE';
-- Disable Parallel Loading
load data infile 'file_name' into table tbl_name FIELDS TERMINATED BY '|' ENCLOSED BY '\"' LINES TERMINATED BY '\n' IGNORE 1 LINES PARALLEL 'FALSE';
-- Parallel loading is disabled by default
load data infile 'file_name' into table tbl_name FIELDS TERMINATED BY '|' ENCLOSED BY '\"' LINES TERMINATED BY '\n' IGNORE 1 LINES;
Note
[PARALLEL {'TRUE' | 'FALSE'}]
currently only support TRUE
or FALSE
and are not case-sensitive.
Note: If the PARALLEL
field is not added in the LOAD
statement, for CSV files, parallel loading is disabled by default; for JSOLLines files, parallel loading is enabled by default. If there is a line terminator in the CSV file, such as ‘\n’, otherwise it may cause data errors when the file is loaded. If the file is too large, manually splitting the file from the ‘\n’ as the starting and ending point is recommended, then enabling parallel loading.
Supported file formats
In MatrixOne’s current release, LOAD DATA
supports CSV(comma-separated values) format and JSONLines format file. See full tutorials for loading csv and jsonline.
CSV format standard description
The CSV format loaded by MatrixOne conforms to the RFC4180 standard, and the CSV format is specified as follows:
Each record is on a separate line, separated by a newline character (CRLF):
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
Imported into the table as follows:
+————-+————-+————-+ | col1 | col2 | col3 | +————-+————-+————-+ | aaa | b bb | ccc | | zzz | yyy | xxx | +————-+————-+————-+
The last record in the file can have a terminating newline or no terminating newline (CRLF):
aaa,bbb,ccc CRLF
zzz,yyy,xxx
Imported into the table as follows:
+————-+————-+————-+ | col1 | col2 | col3 | +————-+————-+————-+ | aaa | b bb | ccc | | zzz | yyy | xxx | +————-+————-+————-+
An optional header line appears as the first line of the file and has the same format as a standard record line. For example:
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
Imported into the table as follows:
+——————+——————+——————+ | field_name | field_name | field_name | +——————+——————+——————+ | aaa | bbb | ccc | | zzz | yyy | xxx | +——————+——————+——————+
In the header and each record, there may be one or more fields separated by commas. Whitespace within a field is part of the field and should not be ignored. A comma cannot follow the last field in each record. For example:
aaa,bbb,ccc
Or:
a aa, bbb,cc c
Both examples are correct.
Imported into the table as follows:
+————-+————-+————-+ | col1 | col2 | col3 | +————-+————-+————-+ | aaa | bbb | ccc | +————-+————-+————-+
Or:
+————-+————-+————-+ | col1 | col2 | col3 | +————-+————-+————-+ | a aa | bbb | cc c | +————-+————-+————-+
Each field can be enclosed in double quotes or not. Double quotes cannot appear inside a field if the field is not enclosed in double-quotes. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Or:
"aaa","bbb",ccc CRLF
zzz,yyy,xxx
Both examples are correct.
Imported into the table as follows:
+————-+————-+————-+ | col1 | col2 | col3 | +————-+————-+————-+ | aaa | bbb | ccc | | zzz | yyy | xxx | +————-+————-+————-+
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
Imported into the table as follows:
+————-+————-+————-+ | col1 | col2 | col3 | +————-+————-+————-+ | aaa | b bb | ccc | | zzz | yyy | xxx | +————-+————-+————-+
If double quotation marks are used to enclose the field, then multiple double quotation marks appearing in the field must also be enclosed in double quotation marks; otherwise, the first quotation mark of two double quotation marks in the field will be parsed as an escape character, thus keep a single, double quote. For example:
"aaa","b","bb","ccc"
The above CSV will parse
"b""bb"
intob"bb
; if the correct field isb""bb
, then it should be written as:"aaa","b""""bb","ccc"
Or:
"aaa",b""bb,"ccc"
Imported into the table as follows:
+————-+————-+————-+ | col1 | col2 | col3 | +————-+————-+————-+ | aaa | b””bb | ccc | +————-+————-+————-+
Examples
The SSB Test is an example of LOAD DATA syntax. Complete a SSB Test with MatrixOne
> LOAD DATA INFILE '/ssb-dbgen-path/lineorder_flat.tbl ' INTO TABLE lineorder_flat;
The above statement means: load the lineorder_flat.tbl data set under the directory path /ssb-dbgen-path/ into the MatrixOne data table lineorder_flat.
You can also refer to the following syntax examples to quickly understand LOAD DATA
:
Example 1: LOAD CSV
Simple example
The data in the file locally named char_varchar.csv is as follows:
a|b|c|d
"a"|"b"|"c"|"d"
'a'|'b'|'c'|'d'
"'a'"|"'b'"|"'c'"|"'d'"
"aa|aa"|"bb|bb"|"cc|cc"|"dd|dd"
"aa|"|"bb|"|"cc|"|"dd|"
"aa|||aa"|"bb|||bb"|"cc|||cc"|"dd|||dd"
"aa'|'||aa"|"bb'|'||bb"|"cc'|'||cc"|"dd'|'||dd"
aa"aa|bb"bb|cc"cc|dd"dd
"aa"aa"|"bb"bb"|"cc"cc"|"dd"dd"
"aa""aa"|"bb""bb"|"cc""cc"|"dd""dd"
"aa"""aa"|"bb"""bb"|"cc"""cc"|"dd"""dd"
"aa""""aa"|"bb""""bb"|"cc""""cc"|"dd""""dd"
"aa""|aa"|"bb""|bb"|"cc""|cc"|"dd""|dd"
"aa""""|aa"|"bb""""|bb"|"cc""""|cc"|"dd""""|dd"
|||
||||
""|""|""|
""""|""""|""""|""""
""""""|""""""|""""""|""""""
Create a table named t1 in MatrixOne:
mysql> drop table if exists t1;
Query OK, 0 rows affected (0.01 sec)
mysql> create table t1(
-> col1 char(225),
-> col2 varchar(225),
-> col3 text,
-> col4 varchar(225)
-> );
Query OK, 0 rows affected (0.02 sec)
Load the data file into table t1:
load data infile '<your-local-file-path>/char_varchar.csv' into table t1 fields terminated by'|';
The query result is as follows:
mysql> select * from t1;
+-----------+-----------+-----------+-----------+
| col1 | col2 | col3 | col4 |
+-----------+-----------+-----------+-----------+
| a | b | c | d |
| a | b | c | d |
| 'a' | 'b' | 'c' | 'd' |
| 'a' | 'b' | 'c' | 'd' |
| aa|aa | bb|bb | cc|cc | dd|dd |
| aa| | bb| | cc| | dd| |
| aa|||aa | bb|||bb | cc|||cc | dd|||dd |
| aa'|'||aa | bb'|'||bb | cc'|'||cc | dd'|'||dd |
| aa"aa | bb"bb | cc"cc | dd"dd |
| aa"aa | bb"bb | cc"cc | dd"dd |
| aa"aa | bb"bb | cc"cc | dd"dd |
| aa""aa | bb""bb | cc""cc | dd""dd |
| aa""aa | bb""bb | cc""cc | dd""dd |
| aa"|aa | bb"|bb | cc"|cc | dd"|dd |
| aa""|aa | bb""|bb | cc""|cc | dd""|dd |
| | | | |
| | | | |
| | | | |
| " | " | " | " |
| "" | "" | "" | "" |
+-----------+-----------+-----------+-----------+
20 rows in set (0.00 sec)
Add conditional Example
Following the example above, you can modify the LOAD DATA
statement and add LINES STARTING BY 'aa' ignore 10 lines;
at the end of the statement to experience the difference:
delete from t1;
load data infile '<your-local-file-path>/char_varchar.csv' into table t1 fields terminated by'|' LINES STARTING BY 'aa' ignore 10 lines;
The query result is as follows:
mysql> select * from t1;
+---------+---------+---------+---------+
| col1 | col2 | col3 | col4 |
+---------+---------+---------+---------+
| aa"aa | bb"bb | cc"cc | dd"dd |
| aa""aa | bb""bb | cc""cc | dd""dd |
| aa""aa | bb""bb | cc""cc | dd""dd |
| aa"|aa | bb"|bb | cc"|cc | dd"|dd |
| aa""|aa | bb""|bb | cc""|cc | dd""|dd |
| | | | |
| | | | |
| | | | |
| " | " | " | " |
| "" | "" | "" | "" |
+---------+---------+---------+---------+
10 rows in set (0.00 sec)
As you can see, the query result ignores the first line and and ignores the common prefix aa.
For more information on loding csv, see Import the .csv data.
Example 2: LOAD JSONLines
Simple example
The data in the file locally named jsonline_array.jl is as follows:
[true,1,"var","2020-09-07","2020-09-07 00:00:00","2020-09-07 00:00:00","18",121.11,["1",2,null,false,true,{"q":1}],"1qaz",null,null]
["true","1","var","2020-09-07","2020-09-07 00:00:00","2020-09-07 00:00:00","18","121.11",{"c":1,"b":["a","b",{"q":4}]},"1aza",null,null]
Create a table named t1 in MatrixOne:
mysql> drop table if exists t1;
Query OK, 0 rows affected (0.01 sec)
mysql> create table t1(col1 bool,col2 int,col3 varchar(100), col4 date,col5 datetime,col6 timestamp,col7 decimal,col8 float,col9 json,col10 text,col11 json,col12 bool);
Query OK, 0 rows affected (0.03 sec)
Load the data file into table t1:
load data infile {'filepath'='<your-local-file-path>/jsonline_array.jl','format'='jsonline','jsondata'='array'} into table t1;
The query result is as follows:
mysql> select * from t1;
+------+------+------+------------+---------------------+---------------------+------+--------+---------------------------------------+-------+-------+-------+
| col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 | col11 | col12 |
+------+------+------+------------+---------------------+---------------------+------+--------+---------------------------------------+-------+-------+-------+
| true | 1 | var | 2020-09-07 | 2020-09-07 00:00:00 | 2020-09-07 00:00:00 | 18 | 121.11 | ["1", 2, null, false, true, {"q": 1}] | 1qaz | NULL | NULL |
| true | 1 | var | 2020-09-07 | 2020-09-07 00:00:00 | 2020-09-07 00:00:00 | 18 | 121.11 | {"b": ["a", "b", {"q": 4}], "c": 1} | 1aza | NULL | NULL |
+------+------+------+------------+---------------------+---------------------+------+--------+---------------------------------------+-------+-------+-------+
2 rows in set (0.00 sec)
Add conditional Example
Following the example above, you can modify the LOAD DATA
statement and add ignore 1 lines
at the end of the statement to experience the difference:
delete from t1;
load data infile {'filepath'='<your-local-file-path>/jsonline_array.jl','format'='jsonline','jsondata'='array'} into table t1 ignore 1 lines;
The query result is as follows:
mysql> select * from t1;
+------+------+------+------------+---------------------+---------------------+------+--------+-------------------------------------+-------+-------+-------+
| col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 | col11 | col12 |
+------+------+------+------------+---------------------+---------------------+------+--------+-------------------------------------+-------+-------+-------+
| true | 1 | var | 2020-09-07 | 2020-09-07 00:00:00 | 2020-09-07 00:00:00 | 18 | 121.11 | {"b": ["a", "b", {"q": 4}], "c": 1} | 1aza | NULL | NULL |
+------+------+------+------------+---------------------+---------------------+------+--------+-------------------------------------+-------+-------+-------+
1 row in set (0.00 sec)
As you can see, the query result ignores the first line.
For more information on loding JSONLines, see Import the JSONLines data.
Constraints
- The
REPLACE
andIGNORE
modifiers control handling of new (input) rows that duplicate existing table rows on unique key values (PRIMARY KEY
orUNIQUE index
values) are not supported in MatrixOne yet. - Input pre-pressing with
SET
is supported very limitedly. OnlySET columns_name=nullif(expr1,expr2)
is supported. - When enabling the parallel loading, it must be ensured that each row of data in the file does not contain the specified line terminator, such as ‘\n’; otherwise, it will cause data errors during file loading.
- The parallel loading of files requires that the files be in uncompressed format, and parallel loading of files in compressed form is not currently supported.
- When you use
load data local
, you need to use the command line to connect to the MatrixOne service host:mysql -h <mo-host -ip> -P 6001 -uroot -p111 --local-infile
. - MatrixOne does not support
ESCAPED BY
currently. Writing or reading special characters differs from MySQL to some extent.