Partial Update

Partial Update

Overview

To perform partial column updates on your table, Doris provides two models: the Unique Key model and the Aggregate Key model.

Unique Key Model

Doris’s Unique Key Model provides default whole-row Upsert semantics. Before version 2.0, if users wanted to update certain columns of some rows, they could only use the UPDATE command. However, due to the granularity of locks in read-write transactions, the UPDATE command is not suitable for high-frequency data write scenarios. Therefore, in version 2.0, we introduced support for partial column updates in the Unique Key model.

Note:

Partial update are only supported in the Unique Key Merge-on-Write implementation in version 2.0.0.
Version 2.0.2 introduces support for performing partial column updates using INSERT INTO.
Version 2.1.0 will offer more flexible column updates, as described in the “Usage Limitations” section below.

Aggregate Key Model

Aggregate Key tables are primarily used in pre-aggregation scenarios rather than data update scenarios. However, column updates can also be achieved by setting the aggregation function to REPLACE_IF_NOT_NULL.

Applicable scenarios

Real-time dynamic column updates. Where certain field values in the table need to be updated in real-time at high frequencies. For example, in a user label table generated by T+1, some fields containing information about the latest user behaviors need to be updated in real-time for real-time analysis and decision-making in advertising/recommendation systems.
Combining multiple source tables into one large wide table.
Data correction.

Fundamentals

For more information about the principles of the Unique Key model and Aggregate Key model, please refer to the Data Model introduction.

Unique Key Model

The Unique Key model currently supports column updates only in the Merge-on-Write implementation.

Users write data for certain columns into Doris’s Memtable through the regular load methods. At this point, the Memtable does not contain complete rows of data. When flushing the Memtable, Doris searches for historical data, fills in the entire row, then writes it to the data file, and marks the data rows with the same key in the historical data file as deleted.

In the case of concurrent loads, Doris uses the MVCC mechanism to ensure data correctness. If two batches of loaded data update different columns with the same key, the load task with the higher system version will recomplete the data rows with the same key written by the lower version load task after the lower version load task succeeds.

Aggregate Key Model

You can achieve partial column updates by setting the aggregation function to REPLACE_IF_NOT_NULL. For detailed usage examples, please refer to the following section.

Concurrent Writes and Data Visibility

Partial column updates support high-frequency concurrent writes, and once the write is successful, the data becomes visible. The system automatically ensures the correctness of concurrent writes through the MVCC mechanism.

Performance

Usage Recommendations:

For users with high write performance requirements and low query performance requirements, it is recommended to use the Aggregate Key Model.
For users with high query performance requirements and lower write performance requirements (e.g., data writes and updates are mainly completed during off-peak hours in the early morning), or for users with low write frequency, it is recommended to use the Unique Key Merge-on-Write implementation.

Unique Key Merge-on-Write Implementation

Since the Merge-on-Write implementation requires filling in entire rows of data during data writes to ensure optimal query performance, performing partial column updates using the Merge-on-Write implementation may lead to noticeable load performance degradation.

Write Performance Optimization Recommendations:

Use SSDs equipped with NVMe or high-speed SSD cloud disks. Because reading historical data in large quantities occurs when filling in row data, might results in higher read IOPS and read throughput.
Enabling row storage can significantly reduce the IOPS generated when filling in row data, leading to a noticeable improvement in load performance. Users can enable row storage when creating tables using the following property:

"store_row_column" = "true"

Aggregate Key Model

The Aggregate Key model does not perform any additional processing during the write process, so its write performance is not affected and is the same as regular data ingestion. However, there is a significant cost associated with aggregation during query execution, with typical aggregation query performance being 5-10 times lower compared to the Merge-on-Write implementation of the Unique Key model.

Usage Instructions and Examples

Unique Key Model

Table Creation

When creating a table, you need to specify the following property to enable the Merge-on-Write implementation:

enable_unique_key_merge_on_write = true

StreamLoad/BrokerLoad/RoutineLoad

If you are using StreamLoad/BrokerLoad/RoutineLoad, add the following header when loading:

partial_columns:true

Also, specify the columns to be loaded in the columns header (it must include all key columns, or else updates cannot be performed).

Flink Connector

If you are using the Flink Connector, you need to add the following configuration:

'sink.properties.partial_columns' = 'true',

Also, specify the columns to be loaded in sink.properties.column (it must include all key columns, or else updates cannot be performed).

INSERT INTO

In all data models, by default, when you use INSERT INTO with a given set of columns, the default behavior is to insert the entire row. To enable partial column updates in the Merge-on-Write implementation, you need to set the following session variable:

set enable_unique_key_partial_update=true

Please note that the default value of the session variable enable_insert_strict, which controls whether the insert statement operates in strict mode, is true. In other words, the insert statement is in strict mode by default, and in this mode, updating non-existing keys in partial column updates is not allowed. Therefore, when using the insert statement for partial columns update and wishing to insert non-existing keys, you need to set enable_unique_key_partial_update to true and simultaneously set enable_insert_strict to false.

Example

Suppose there is an order table order_tbl in Doris, where the order ID is the Key column, and the order status and order amount are Value columns. The data status is as follows:

Order ID	Order Amount	Order Status
1	100	Pending Payment

+----------+--------------+-----------------+
| order_id | order_amount | order_status    |
+----------+--------------+-----------------+
| 1        | 100          | Pending Payment |
+----------+--------------+-----------------+
1 row in set (0.01 sec)

Now, when a user clicks to make a payment, he needs to change the order status of the order with Order ID ‘1’ to ‘Pending Delivery’.

If you are using StreamLoad, you can perform the update as follows:

$ cat update.csv
1,Pending Delivery
$ curl  --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:48037/api/db1/order_tbl/_stream_load

If you are using INSERT INTO, you can perform the update as follows:

set enable_unique_key_partial_update=true;
INSERT INTO order_tbl (order_id, order_status) values (1,'Pending Delivery');

The updated result is as follows:

+----------+--------------+------------------+
| order_id | order_amount | order_status     |
+----------+--------------+------------------+
| 1        | 100          | Pending Delivery |
+----------+--------------+------------------+
1 row in set (0.01 sec)

Aggregate Key Model

Table Creation

Set the aggregation function for the columns that need column updates to REPLACE_IF_NOT_NULL as follows:

CREATE TABLE `order_tbl` (
  `order_id` int(11) NULL,
  `order_amount` int(11) REPLACE_IF_NOT_NULL NULL,
  `order_status` varchar(100) REPLACE_IF_NOT_NULL NULL
) ENGINE=OLAP
AGGREGATE KEY(`order_id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`order_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

Write Data

Whether through load tasks or using INSERT INTO, you can directly write the data for the columns you want to update.

Example

Similar to the previous example, the corresponding Stream Load command (no additional header required) is:

curl  --location-trusted -u root: -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:48037/api/db1/order_tbl/_stream_load

The corresponding INSERT INTO statement (no need to set additional session variables) is:

INSERT INTO order_tbl (order_id, order_status) values (1,'Pending Delivery');

Usage Limitations

Unique Key Merge-on-Write Implementation

In version 2.0, all rows in the same batch of data write tasks (whether load tasks or INSERT INTO) can only update the same columns. If you need to update different columns of data, you will need to write them in separate batches.

In version 2.1, we will support more flexible column updates, allowing users to update different columns for each row within the same batch load.

Aggregate Key Model

Users cannot change a field from non-NULL to NULL; any NULL values written with the REPLACE_IF_NOT_NULL aggregation function will be automatically ignored during processing.