Sequence Column
The Sequence Column currently only supports the Uniq model. The Uniq model is mainly for scenarios requiring a unique primary key, which can guarantee the uniqueness constraint of the primary key. However, due to the use of REPLACE aggregation, the replacement sequence is not guaranteed for data imported in the same batch, which can be described in detail here. If the order of substitution is not guaranteed, then the specific data that is finally imported into the table cannot be determined, and there is uncertainty.
To solve this problem, Doris supported a sequence column by allowing the user to specify the sequence column when importing. Under the same key column, columns of the REPLACE aggregate type will be replaced according to the value of the sequence column, larger values can be replaced with smaller values, and vice versa. In this method, the order is determined by the user, and the user controls the replacement order.
Principle
Implemented by adding a hidden column __DORIS_SEQUENCE_COL__
, the type of the column is specified by the user while create the table, determines the specific value of the column on import, and replaces the REPLACE column with that value.
Create Table
When you create the Uniq table, a hidden column __DORIS_SEQUENCE_COL__
is automatically added, depending on the type specified by the user
Import
When importing, fe sets the value of the hidden column during parsing to the value of the ‘order by’ expression (Broker Load and routine Load), or the value of the ‘function_column.sequence_col’ expression (stream load), and the value column will be replaced according to this value. The value of the hidden column __DORIS_SEQUENCE_COL__
can be set as a column in the source data or in the table structure.
Read
The request with the value column needs to read the additional column of __DORIS_SEQUENCE_COL__
, which is used as a basis for the order of replacement aggregation function replacement under the same key column, with the larger value replacing the smaller value and not the reverse.
Cumulative Compaction
Cumulative Compaction works in the same way as the reading process
Base Compaction
Base Compaction works in the same way as the reading process
Syntax
The syntax aspect of the table construction adds a property to the property identifying the type of __DORIS_SEQUENCE_COL__
. The syntax design aspect of the import is primarily the addition of a mapping from the sequence column to other columns, the settings of each import mode are described below
Create Table
When you create the Uniq table, you can specify the sequence column type
PROPERTIES (
"function_column.sequence_type" = 'Date',
);
The sequence_type is used to specify the type of the sequence column, which can be integral and time
stream load
The syntax of the stream load is to add the mapping of hidden columns corresponding to source_sequence in the ‘function_column.sequence_col’ field in the header, for example
curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load
broker load
Set the source_sequence field for the hidden column map at ORDER BY
LOAD LABEL db1.label1
(
DATA INFILE("hdfs://host:port/user/data/*/test.txt")
INTO TABLE `tbl1`
COLUMNS TERMINATED BY ","
(k1,k2,source_sequence,v1,v2)
ORDER BY source_sequence
)
WITH BROKER 'broker'
(
"username"="user",
"password"="pass"
)
PROPERTIES
(
"timeout" = "3600"
);
routine load
The mapping method is the same as above, as shown below
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
[WITH MERGE|APPEND|DELETE]
COLUMNS(k1, k2, source_sequence, v1, v2),
WHERE k1 > 100 and k2 like "%doris%"
[ORDER BY source_sequence]
PROPERTIES
(
"desired_concurrent_number"="3",
"max_batch_interval" = "20",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
"kafka_topic" = "my_topic",
"kafka_partitions" = "0,1,2,3",
"kafka_offsets" = "101,0,0,200"
);
Enable sequence column support
If function_column.sequence_type
is set when creating a new table, then the sequence column will be supported. For a table that does not support sequence column, use the following statement if you would like to use this feature: ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")
to enable. If you want to determine if a table supports sequence column, you can set the session variable to display the hidden column SET show_hidden_columns=true
, followed by desc Tablename
, if the output contains the column __DORIS_SEQUENCE_COL__
, it is supported, if not, it is not supported
Usage example
Let’s take the stream Load as an example to show how to use it
- Create a table that supports sequence column.
The table structure is shown below
MySQL > desc test_table;
+-------------+--------------+------+-------+---------+---------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-------+---------+---------+
| user_id | BIGINT | No | true | NULL | |
| date | DATE | No | true | NULL | |
| group_id | BIGINT | No | true | NULL | |
| modify_date | DATE | No | false | NULL | REPLACE |
| keyword | VARCHAR(128) | No | false | NULL | REPLACE |
+-------------+--------------+------+-------+---------+---------+
- Import data normally:
Import the following data
1 2020-02-22 1 2020-02-22 a
1 2020-02-22 1 2020-02-22 b
1 2020-02-22 1 2020-03-05 c
1 2020-02-22 1 2020-02-26 d
1 2020-02-22 1 2020-02-22 e
1 2020-02-22 1 2020-02-22 b
Take the Stream Load as an example here and map the sequence column to the modify_date column
curl --location-trusted -u root: -H "function_column.sequence_col: modify_date" -T testData http://host:port/api/test/test_table/_stream_load
The results is
MySQL > select * from test_table;
+---------+------------+----------+-------------+---------+
| user_id | date | group_id | modify_date | keyword |
+---------+------------+----------+-------------+---------+
| 1 | 2020-02-22 | 1 | 2020-03-05 | c |
+---------+------------+----------+-------------+---------+
In this import, the c is eventually retained in the keyword column because the value of the sequence column (the value in modify_date) is the maximum value: ‘2020-03-05’.
- Guarantee of substitution order
After the above steps are completed, import the following data
1 2020-02-22 1 2020-02-22 a
1 2020-02-22 1 2020-02-23 b
Query data
MySQL [test]> select * from test_table;
+---------+------------+----------+-------------+---------+
| user_id | date | group_id | modify_date | keyword |
+---------+------------+----------+-------------+---------+
| 1 | 2020-02-22 | 1 | 2020-03-05 | c |
+---------+------------+----------+-------------+---------+
Because the sequence column for the newly imported data are all smaller than the values already in the table, they cannot be replaced Try importing the following data again
1 2020-02-22 1 2020-02-22 a
1 2020-02-22 1 2020-03-23 w
Query data
MySQL [test]> select * from test_table;
+---------+------------+----------+-------------+---------+
| user_id | date | group_id | modify_date | keyword |
+---------+------------+----------+-------------+---------+
| 1 | 2020-02-22 | 1 | 2020-03-23 | w |
+---------+------------+----------+-------------+---------+
At this point, you can replace the original data in the table