When importing data, Doris assigns a table-unique value to rows that do not have specified values in the auto-increment column.
Functionality
For tables containing an auto-increment column, during data import:
- If the target columns don’t include the auto-increment column, Doris will populate the auto-increment column with generated values.
- If the target columns include the auto-increment column, null values in the imported data for that column will be replaced by values generated by Doris, while non-null values will remain unchanged. Note that non-null values can disrupt the uniqueness of the auto-increment column values.
Uniqueness
Doris ensures that values generated on the auto-increment column have table-wide uniqueness. However, it’s important to note that the uniqueness of the auto-increment column only guarantees uniqueness for values automatically filled by Doris and does not consider values provided by users. If a user explicitly inserts user-provided values for this table by specifying the auto-increment column, this uniqueness cannot be guaranteed.
Density
Doris ensures that the values generated on the auto-increment column are dense, but it cannot guarantee that the values automatically generated in the auto-increment column during an import will be entirely contiguous. Thus, there might be some jumps in the values generated by the auto-increment column during an import. This is because, for performance consideration, each BE caches a portion of pre-allocated auto-increment column values, and these cached values do not intersect between different BEs. Additionally, due to this caching mechanism, Doris cannot guarantee that the values automatically generated on the auto-increment column in a later import on the physical timeline will be larger than those from the previous import. Therefore, the values allocated by the auto-increment column cannot be used to determine the chronological order of imports.
Syntax
To use auto-increment columns, you need to add the AUTO_INCREMENT
attribute to the corresponding column during table creation (CREATE-TABLE). To manually specify the starting value for an auto-increment column, you can do so by using the AUTO_INCREMENT(start_value)
statement when creating the table. If not specified, the default starting value is 1.
Examples
Creating a Duplicate table with one key column as an auto-increment column:
CREATE TABLE `demo`.`tbl` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`value` BIGINT NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
Creating a Duplicate table with one key column as an auto-increment column, and set start value is 100:
CREATE TABLE `demo`.`tbl` (
`id` BIGINT NOT NULL AUTO_INCREMENT(100),
`value` BIGINT NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
Creating a Duplicate table with one value column as an auto-increment column:
CREATE TABLE `demo`.`tbl` (
`uid` BIGINT NOT NULL,
`name` BIGINT NOT NULL,
`id` BIGINT NOT NULL AUTO_INCREMENT,
`value` BIGINT NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`uid`, `name`)
DISTRIBUTED BY HASH(`uid`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
Creating a Unique tbl table with one key column as an auto-increment column:
CREATE TABLE `demo`.`tbl` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`name` varchar(65533) NOT NULL,
`value` int(11) NOT NULL
) ENGINE=OLAP
UNIQUE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"enable_unique_key_merge_on_write" = "true"
);
Creating a Unique tbl table with one value column as an auto-increment column:
CREATE TABLE `demo`.`tbl` (
`text` varchar(65533) NOT NULL,
`id` BIGINT NOT NULL AUTO_INCREMENT,
) ENGINE=OLAP
UNIQUE KEY(`text`)
DISTRIBUTED BY HASH(`text`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"enable_unique_key_merge_on_write" = "true"
);
Constraints and Limitations
- Only Duplicate model tables and Unique model tables can contain auto-increment columns.
- A table can contain at most one auto-increment column.
- The type of the auto-increment column must be BIGINT and must be NOT NULL.
- The manually specified starting value for an auto-increment column must be greater than or equal to 0.
Usage
Import
Consider the following table:
CREATE TABLE `demo`.`tbl` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`name` varchar(65533) NOT NULL,
`value` int(11) NOT NULL
) ENGINE=OLAP
UNIQUE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"enable_unique_key_merge_on_write" = "true"
);
When using the insert into statement to import data without specifying the auto-increment column id
, the id
column will automatically be filled with generated values.
mysql> insert into tbl(name, value) values("Bob", 10), ("Alice", 20), ("Jack", 30);
Query OK, 3 rows affected (0.09 sec)
{'label':'label_183babcb84ad4023_a2d6266ab73fb5aa', 'status':'VISIBLE', 'txnId':'7'}
mysql> select * from tbl order by id;
+------+-------+-------+
| id | name | value |
+------+-------+-------+
| 1 | Bob | 10 |
| 2 | Alice | 20 |
| 3 | Jack | 30 |
+------+-------+-------+
3 rows in set (0.05 sec)
Similarly, using stream load to import the file test.csv without specifying the auto-increment column id
will result in the id column being automatically filled with generated values.
test.csv:
Tom, 40
John, 50
curl --location-trusted -u user:passwd -H "columns:name,value" -H "column_separator:," -T ./test1.csv http://{host}:{port}/api/{db}/tbl/_stream_load
mysql> select * from tbl order by id;
+------+-------+-------+
| id | name | value |
+------+-------+-------+
| 1 | Bob | 10 |
| 2 | Alice | 20 |
| 3 | Jack | 30 |
| 4 | Tom | 40 |
| 5 | John | 50 |
+------+-------+-------+
5 rows in set (0.04 sec)
When importing using insert into statement while specifying the auto-increment column id
, null values in the imported data for that column will be replaced by generated values.
mysql> insert into tbl(id, name, value) values(null, "Doris", 60), (null, "Nereids", 70);
Query OK, 2 rows affected (0.07 sec)
{'label':'label_9cb0c01db1a0402c_a2b8b44c11ce4703', 'status':'VISIBLE', 'txnId':'10'}
mysql> select * from tbl order by id;
+------+---------+-------+
| id | name | value |
+------+---------+-------+
| 1 | Bob | 10 |
| 2 | Alice | 20 |
| 3 | Jack | 30 |
| 4 | Tom | 40 |
| 5 | John | 50 |
| 6 | Doris | 60 |
| 7 | Nereids | 70 |
+------+---------+-------+
7 rows in set (0.04 sec)
Partial Update
When performing a partial update on a merge-on-write Unique table containing an auto-increment column:
If the auto-increment column is a key column, during partial updates, as users must explicitly specify the key column, the target columns for partial column updates must include the auto-increment column. In this scenario, the import behavior is similar to regular partial updates.
mysql> CREATE TABLE `demo`.`tbl2` (
-> `id` BIGINT NOT NULL AUTO_INCREMENT,
-> `name` varchar(65533) NOT NULL,
-> `value` int(11) NOT NULL DEFAULT "0"
-> ) ENGINE=OLAP
-> UNIQUE KEY(`id`)
-> DISTRIBUTED BY HASH(`id`) BUCKETS 10
-> PROPERTIES (
-> "replication_allocation" = "tag.location.default: 3",
-> "enable_unique_key_merge_on_write" = "true"
-> );
Query OK, 0 rows affected (0.03 sec)
mysql> insert into tbl2(id, name, value) values(1, "Bob", 10), (2, "Alice", 20), (3, "Jack", 30);
Query OK, 3 rows affected (0.14 sec)
{'label':'label_5538549c866240b6_bce75ef323ac22a0', 'status':'VISIBLE', 'txnId':'1004'}
mysql> select * from tbl2 order by id;
+------+-------+-------+
| id | name | value |
+------+-------+-------+
| 1 | Bob | 10 |
| 2 | Alice | 20 |
| 3 | Jack | 30 |
+------+-------+-------+
3 rows in set (0.08 sec)
mysql> set enable_unique_key_partial_update=true;
Query OK, 0 rows affected (0.01 sec)
mysql> set enable_insert_strict=false;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into tbl2(id, name) values(1, "modified"), (4, "added");
Query OK, 2 rows affected (0.06 sec)
{'label':'label_3e68324cfd87457d_a6166cc0a878cfdc', 'status':'VISIBLE', 'txnId':'1005'}
mysql> select * from tbl2 order by id;
+------+----------+-------+
| id | name | value |
+------+----------+-------+
| 1 | modified | 10 |
| 2 | Alice | 20 |
| 3 | Jack | 30 |
| 4 | added | 0 |
+------+----------+-------+
4 rows in set (0.04 sec)
When the auto-increment column is a non-key column and users haven’t specified the value for the auto-increment column, the value will be filled from existing data rows in the table. If users specify the auto-increment column, null values in the imported data for that column will be replaced by generated values, while non-null values will remain unchanged, and then these data will be loaded with the semantics of partial updates.
mysql> CREATE TABLE `demo`.`tbl3` (
-> `id` BIGINT NOT NULL,
-> `name` varchar(100) NOT NULL,
-> `score` BIGINT NOT NULL,
-> `aid` BIGINT NOT NULL AUTO_INCREMENT
-> ) ENGINE=OLAP
-> UNIQUE KEY(`id`)
-> DISTRIBUTED BY HASH(`id`) BUCKETS 1
-> PROPERTIES (
-> "replication_allocation" = "tag.location.default: 3",
-> "enable_unique_key_merge_on_write" = "true"
-> );
Query OK, 0 rows affected (0.16 sec)
mysql> insert into tbl3(id, name, score) values(1, "Doris", 100), (2, "Nereids", 200), (3, "Bob", 300);
Query OK, 3 rows affected (0.28 sec)
{'label':'label_c52b2c246e244dda_9b91ee5e27a31f9b', 'status':'VISIBLE', 'txnId':'2003'}
mysql> select * from tbl3 order by id;
+------+---------+-------+------+
| id | name | score | aid |
+------+---------+-------+------+
| 1 | Doris | 100 | 0 |
| 2 | Nereids | 200 | 1 |
| 3 | Bob | 300 | 2 |
+------+---------+-------+------+
3 rows in set (0.13 sec)
mysql> set enable_unique_key_partial_update=true;
Query OK, 0 rows affected (0.00 sec)
mysql> set enable_insert_strict=false;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into tbl3(id, score) values(1, 999), (2, 888);
Query OK, 2 rows affected (0.07 sec)
{'label':'label_dfec927d7a4343ca_9f9ade581391de97', 'status':'VISIBLE', 'txnId':'2004'}
mysql> select * from tbl3 order by id;
+------+---------+-------+------+
| id | name | score | aid |
+------+---------+-------+------+
| 1 | Doris | 999 | 0 |
| 2 | Nereids | 888 | 1 |
| 3 | Bob | 300 | 2 |
+------+---------+-------+------+
3 rows in set (0.06 sec)
mysql> insert into tbl3(id, aid) values(1, 1000), (3, 500);
Query OK, 2 rows affected (0.07 sec)
{'label':'label_b26012959f714f60_abe23c87a06aa0bf', 'status':'VISIBLE', 'txnId':'2005'}
mysql> select * from tbl3 order by id;
+------+---------+-------+------+
| id | name | score | aid |
+------+---------+-------+------+
| 1 | Doris | 999 | 1000 |
| 2 | Nereids | 888 | 1 |
| 3 | Bob | 300 | 500 |
+------+---------+-------+------+
3 rows in set (0.06 sec)
Usage Scenarios
Dictionary Encoding
Using bitmaps for audience analysis in user profile requires building a user dictionary where each user corresponds to a unique integer dictionary value. Aggregating these dictionary values can improve the performance of bitmap.
Taking the offline UV and PV analysis scenario as an example, assuming there’s a detailed user behavior table:
CREATE TABLE `demo`.`dwd_dup_tbl` (
`user_id` varchar(50) NOT NULL,
`dim1` varchar(50) NOT NULL,
`dim2` varchar(50) NOT NULL,
`dim3` varchar(50) NOT NULL,
`dim4` varchar(50) NOT NULL,
`dim5` varchar(50) NOT NULL,
`visit_time` DATE NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`user_id`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
Using the auto-incrementa column to create the following dictionary table:
CREATE TABLE `demo`.`dictionary_tbl` (
`user_id` varchar(50) NOT NULL,
`aid` BIGINT NOT NULL AUTO_INCREMENT
) ENGINE=OLAP
UNIQUE KEY(`user_id`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"enable_unique_key_merge_on_write" = "true"
);
Import the value of user_id
from existing data into the dictionary table, establishing the mapping of user_id
to integer values:
insert into dictionary_tbl(user_id)
select user_id from dwd_dup_tbl group by user_id;
Or import only the value of user_id
in incrementa data into the dictionary table alternatively:
insert into dictionary_tbl(user_id)
select dwd_dup_tbl.user_id from dwd_dup_tbl left join dictionary_tbl
on dwd_dup_tbl.user_id = dictionary_tbl.user_id where dwd_dup_tbl.visit_time > '2023-12-10' and dictionary_tbl.user_id is NULL;
In real-world scenarios, Flink connectors can also be employed to write data into Doris.
Assuming dim1
, dim3
, dim5
represent statistical dimensions of interest to us, create the following table to store aggregated results:
CREATE TABLE `demo`.`dws_agg_tbl` (
`dim1` varchar(50) NOT NULL,
`dim3` varchar(50) NOT NULL,
`dim5` varchar(50) NOT NULL,
`user_id_bitmap` BITMAP BITMAP_UNION NOT NULL,
`pv` BIGINT SUM NOT NULL
) ENGINE=OLAP
AGGREGATE KEY(`dim1`,`dim3`,`dim5`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
Store the result of the data aggregation operations into the aggregation result table:
insert into dws_agg_tbl
select dwd_dup_tbl.dim1, dwd_dup_tbl.dim3, dwd_dup_tbl.dim5, BITMAP_UNION(TO_BITMAP(dictionary_tbl.aid)), COUNT(1)
from dwd_dup_tbl INNER JOIN dictionary_tbl on dwd_dup_tbl.user_id = dictionary_tbl.user_id;
Perform UV and PV queries using the following statement:
select dim1, dim3, dim5, user_id_bitmap as uv, pv from dws_agg_tbl;
Efficient Pagination
When displaying data on a page, pagination is often necessary. Traditional pagination typically involves using limit
, offset
, and order by
in SQL queries. For instance, consider the following business table intended for display:
CREATE TABLE `demo`.`records_tbl` (
`key` int(11) NOT NULL COMMENT "",
`name` varchar(26) NOT NULL COMMENT "",
`address` varchar(41) NOT NULL COMMENT "",
`city` varchar(11) NOT NULL COMMENT "",
`nation` varchar(16) NOT NULL COMMENT "",
`region` varchar(13) NOT NULL COMMENT "",
`phone` varchar(16) NOT NULL COMMENT "",
`mktsegment` varchar(11) NOT NULL COMMENT ""
) DUPLICATE KEY (`key`, `name`)
DISTRIBUTED BY HASH(`key`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
Assuming 100 records are displayed per page in pagination. To fetch the first page’s data, the following SQL query can be used:
select * from records_tbl order by `key`, `name` limit 100;
Fetching the data for the second page can be accomplished by:
select * from records_tbl order by `key`, `name` limit 100, offset 100;
However, when performing deep pagination queries (with large offsets), even if the actual required data rows are few, this method still reads all data into memory for full sorting before subsequent processing, which is quite inefficient. Using an auto-incrementa column assigns a unique value to each row, allowing the use of where unique_value
> x limit y to filter a significant amount of data beforehand, making pagination more efficient.
Continuing with the aforementioned business table, an auto-increment column is added to the table to give each row a unique identifier:
CREATE TABLE `demo`.`records_tbl2` (
`key` int(11) NOT NULL COMMENT "",
`name` varchar(26) NOT NULL COMMENT "",
`address` varchar(41) NOT NULL COMMENT "",
`city` varchar(11) NOT NULL COMMENT "",
`nation` varchar(16) NOT NULL COMMENT "",
`region` varchar(13) NOT NULL COMMENT "",
`phone` varchar(16) NOT NULL COMMENT "",
`mktsegment` varchar(11) NOT NULL COMMENT "",
`unique_value` BIGINT NOT NULL AUTO_INCREMENT
) DUPLICATE KEY (`key`, `name`)
DISTRIBUTED BY HASH(`key`) BUCKETS 10
PROPERTIES (
"replication_num" = "3"
);
For pagination displaying 100 records per page, to fetch the first page’s data, the following SQL query can be used:
select * from records_tbl2 order by unique_value limit 100;
By recording the maximum value of unique_value in the returned results, let’s assume it’s 99. The following query can then fetch data for the second page:
select * from records_tbl2 where unique_value > 99 order by unique_value limit 100;
If directly querying contents from a later page and it’s inconvenient to directly obtain the maximum value of unique_value
from the preceding page’s data (for instance, directly obtaining contents from the 101st page), the following query can be used:
select key, name, address, city, nation, region, phone, mktsegment
from records_tbl2, (select unique_value as max_value from records_tbl2 order by unique_value limit 1 offset 9999) as previous_data
where records_tbl2.unique_value > previous_data.max_value
order by records_tbl2.unique_value limit 100;