Dear community members, the Apache Doris 3.0.1 version was officially released on August 23, 2024, featuring updates and improvements in compute-storage decoupling, lakehouse, semi-structured data analysis, asynchronous materialized views, and more.
Quick Download: https://doris.apache.org/download/
GitHub Release: https://github.com/apache/doris/releases
Behavior Changes
Query Optimizer
Added the variable
use_max_length_of_varchar_in_ctas
to control the length behavior of VARCHAR type when executingCREATE TABLE AS SELECT
(CTAS) operations. #37069This variable is set to true by default.
When set to true, if the VARCHAR type column originates from a table, the derived length is used; otherwise, the maximum length is used.
When set to false, the VARCHAR type will always use the derived length.
All data types will now be displayed in lowercase to maintain compatibility with MySQL format. #38012
Multiple query statements in the same query request must now be separated by semicolons. #38670
Query Execution
- The default number of parallel tasks after shuffle operations in the cluster is set to 100, which will improve query stability and concurrent processing capability in large clusters. #38196
Storage
The default value of
trash_file_expire_time_sec
has been changed from 86400 seconds to 0 seconds, which means that if files are deleted by mistake and the FE trash is cleared, the data cannot be recovered.The table attribute
enable_mow_delete_on_delete_predicate
(introduced in version 3.0.0) has been renamed toenable_mow_light_delete
.Explicit transactions are now prohibited from performing delete operations on tables with written data.
Heavy schema change operations are prohibited on tables with auto-increment fields.
New Features
Job Scheduling
- Optimized the execution logic of internal scheduling jobs, decoupling the strong association between start time and immediate execution parameters. Now, tasks can be created with a specified start time or selected for immediate execution, without conflict, enhancing scheduling flexibility. #36805
Compute-Storage Decoupled
Supports dynamic modification of the upper limit for file cache usage. #37484
Recycler now supports object storage rate limiting and server-side rate limiting retry functionality. #37663 #37680
Lakehouse
Added the session variable
serde_dialect
to set the output format for complex types. #37039SQL interception now supports external tables.
- For more information, refer to the documentation on SQL Interception.
- Insert overwrite now supports Iceberg tables. #37191
Asynchronous Materialized Views
Supports partition roll-up and build at the hourly level. #37678
Supports atomic replacement of asynchronous materialized view definition statements. #36749
Transparent rewriting now supports Insert statements. #38115
Transparent rewriting now supports the VARIANT type. #37929
Query Execution
- The group concat function now supports DISTINCT and ORDER BY options. #38744
Semi-Structured Data Management
The ES Catalog now maps
nested
orobject
types in Elasticsearch to the JSON type in Doris. #37101Added the
MULTI_MATCH
function, which supports matching keywords across multiple fields and can leverage inverted indexes to accelerate searches. #37722Added the
explode_json_object
function, which can unfold objects in JSON data into multiple rows. #36887Inverted indexes now support memtable advancement, requiring index construction only once during multi-replica writes, reducing CPU consumption and improving performance. #35891
Added
MATCH_PHRASE
support for positive slop, e.g.,msg MATCH_PHRASE 'a b 2+'
can match instances containing words a and b with a slop of no more than two, and a preceding b; regular slop without the final+
does not guarantee this order. #36356
Other
Added the FE parameter
skip_audit_user_list
, where user operations specified in this configuration will not be recorded in the audit log. #38310- For more information, refer to the documentation on Audit Plugin.
Improvements
Storage
Reduced the likelihood of write failures caused by disk balancing within a single BE. #38000
Decreased memory consumption by the memtable limiter. #37511
Moved old partitions to the FE trash during partition replacement operations. #36361
Optimized memory consumption during compaction. #37099
Added a session variable to control audit logs for JDBC PreparedStatement, with default setting to not print. #38419
Optimized the logic for selecting BEs for group commits. #35558
Improved the performance of column updates. #38487
Optimized the use of
delete bitmap cache
. #38761Added a configuration to control query affinity during hot and cold tiering. #37492
Compute-Storage Decoupled
Implemented automatic retries when encountering object storage server rate limiting. #37199
Adapted the number of threads for memtable flush in the compute-storage decoupled mode. #38789
Added Azure as a compile option to support compilation in environments without Azure support.
Optimized the observability of object storage access rate limiting. #38294
Allowed the file cache TTL queue to perform LRU eviction, enhancing TTL queue usability. #37312
Optimized the number of balance writeeditlog IO operations in the storage and compute separation mode. #37787
Improved table creation speed in the storage and compute separation mode by sending tablet creation requests in batches. #36786
Optimized read failures caused by potential inconsistencies in the local file cache through backoff retries. #38645
Lakehouse
Optimized memory statistics for Parquet/ORC format read and write operations. #37234
Trino Connector Catalog now supports predicate pushdown. #37874
Added a session variable
enable_count_push_down_for_external_table
to control whether to enablecount(*)
pushdown optimization for external tables. #37046Optimized the read logic for Hudi snapshot reads, returning an empty set when the snapshot is empty, consistent with Spark behavior. #37702
Improved the read performance of partition columns for Hive tables. #37377
Asynchronous Materialized Views
Improved transparent rewrite plan speed by 20%. #37197
Eliminated roll-up during transparent rewrite if the group key satisfies data uniqueness for better nested matching. #38387
Transparent rewrite now performs better aggregation elimination to improve the matching success rate of nested materialized views. #36888
MySQL Compatibility
Now correctly populates the database name, table name, and original name in the MySQL protocol result columns. #38126
Supported the hint format
/*+ func(value) */
. #37720
Query Optimizer
Significantly improved the plan speed for complex queries. #38317
Adaptively chose whether to perform bucket shuffle based on the number of data buckets to avoid performance degradation in extreme cases. #36784
Optimized the cost estimation logic for SEMI / ANTI JOIN. #37951 #37060
Supported pushing Limit down to the first stage of aggregation to improve performance. #34853
Partition pruning now supports filter conditions containing the
date_trunc
ordate
function. #38025 #38743SQL cache now supports query scenarios that include user variables. #37915
Optimized error messages for invalid aggregation semantics. #38122
Query Execution
Adapted AggState compatibility from 2.1 to 3.x and fixed Coredump issues. #37104
Refactored the strategy selection for local shuffle without Join. #37282
Modified the scanner for internal table queries to be asynchronous to prevent stalling during such queries. #38403
Optimized the block merge process during Hash table construction for Join operators. #37471
Optimized the duration of lock holding for MultiCast. #37462
Optimized gRPC keepAliveTime and added link monitoring to reduce the probability of query failure due to RPC errors. #37304
Cleaned up all dirty pages in jemalloc when memory limits were exceeded. #37164
Optimized the processing performance of
aes_encrypt
/decrypt
functions for constant types. #37194Optimized the processing performance of the
json_extract
function for constant data. #36927Optimized the processing performance of the
ParseUrl
function for constant data. #36882
Semi-Structured Data Management
Bitmap indexes now default to using inverted indexes, with
enable_create_bitmap_index_as_inverted_index
set to true by default. #36692In the compute-storage decoupled mode, DESC can now view sub-columns of VARIANT type. #38143
Removed the step of checking file existence during inverted index queries to reduce access latency to remote storage. #36945
Complex types ARRAY / MAP / STRUCT now support
replace_if_not_null
for AGG tables. #38304Escape characters for JSON data are now supported. #37176 #37251
Inverted index queries now behave consistently on MOW tables and DUP tables. #37428
Optimized the performance of inverted index acceleration for IN queries. #37395
Reduced unnecessary memory allocation during TOPN queries to improve performance. #37429
When creating an inverted index with tokenization, the
support_phrase
option is now automatically enabled to acceleratematch_phrase
series phrase queries. #37949
Other
Audit log now can record SQL types. #37790
Added support for
information_schema.processlist
to show all FE. #38701Cached ranger’s
atamask
androwpolicy
to accelerate query efficiency. #37723Optimized metadata management in job manager to release locks immediately after modifying metadata, reducing lock holding time. #38162
Bug Fixes
Upgrade
Fix the issue where
mtmv load
fails during upgrade from version 2.1. #38799Resolve the issue where
null_type
cannot be found during the upgrade to version 2.1. #39373Address the compatibility issue with permission persistence during the upgrade from version 2.1 to 3.0. #39288
Load
Fix the issue where parsing fails when the newline character is surrounded by delimiters in CSV format parsing. #38347
Resolve potential exception issues when FE forwards group commit. #38228 #38265
Group commit now supports the new optimizer. #37002
Fix the issue where group commit reports data errors when JDBC setNull is used. #38262
Optimize the retry logic for group commit when encountering
delete bitmap lock
errors. #37600Resolve the issue where routine load cannot use CSV delimiters and escape characters. #38402
Fix the issue where routine load job names with mixed case cannot be displayed. #38523
Optimize the logic for actively recovering routine load during FE master-slave switching. #37876
Resolve the issue where routine load pauses when all data in Kafka is expired. #37288
Fix the issue where
show routine load
returns empty results. #38199Resolve the memory leak issue during multi-table stream import in routine load. #38255
Fix the issue where stream load does not return the error URL. #38325
Fix the issue where no error may be reported when importing fewer segments than expected. #36753
Resolve the load stream leak issue. #38912
Optimize the impact of offline nodes on import operations. #38198
Fix the issue where transactions do not end when inserting into empty data. #38991
Storage
01 Backup and Restoration
Fix the issue where tables cannot be written after backup and restoration. #37089
Resolve the issue where view database names are incorrect after backup and restoration. #37412
02 Compaction
Fix the issue where cumu compaction handles delete errors incorrectly during ordered data compression. #38742
Resolve the issue of duplicate keys in aggregate tables caused by sequential compression optimization. #38224
Fix the issue where compression operations cause coredump in large wide tables. #37960
Resolve the compression starvation issue caused by inaccurate concurrent statistics of compression tasks. #37318
03 MOW Unique Key
Resolve the issue of inconsistent data between replicas caused by cumulative compression deletion of delete sign. #37950
MOW delete now uses partial column updates with the new optimizer. #38751
Fix the potential duplicate key issue in MOW tables under compute-storage decoupled. #39018
Resolve the issue where MOW unique and duplicate tables cannot modify column order. #37067
Fix the potential data correctness issue caused by segcompaction. #37760
Resolve the potential memory leak issue during column updates. #37706
04 Other
Fix the small probability of exceptions in TOPN queries. #39119 #39199
Resolve the issue where auto-increment IDs may duplicate during FE restart. #37306
Fix the potential queuing issue in the delete operation priority queue. #37169
Optimize the delete retry logic. #37363
Resolve the issue with
bucket = 0
in table creation statements under the new optimizer. #38971Fix the issue where FE reports success incorrectly when image generation fails. #37508
Resolve the issue where using the wrong nodename during FE offline nodes may cause inconsistent FE members. #37987
Fix the issue where CCR partition addition may fail. #37295
Resolve the
int32
overflow issue in inverted index files. #38891Fix the issue where TRUNCATE TABLE failure may cause BE to fail to go offline. #37334
Resolve the issue where publish cannot continue due to null pointers. #37724 #37531
Fix the potential coredump issue when manually triggering disk migration. #37712
Compute-Storage Decoupled
Fixed the issue where
show create table
might display thefile_cache_ttl_seconds
attribute twice. #38052Fixed the issue where segment Footer TTL was not set correctly after setting file cache TTL. #37485
Fixed the issue where file cache might cause coredump due to massive conversion of cache types. #38518
Fixed the potential file descriptor (fd) leak in file cache. #38051
Fixed the issue where schema change Job overwriting compaction Job prevented base tablet compaction from completing normally. #38210
Fixed the potential inaccuracy of base compaction score due to data race. #38006
Fixed the issue where error messages from imports might not be uploaded correctly to object storage. #38359
Fixed the inconsistency in return information between compute-storage decoupled mode and storage and compute integration mode for 2PC imports. #38076
Fix the issue where incorrect file size setting during file cache warm-up leads to coredump. #38939
Fixed the issue where partial column updates did not correctly dequeue delete operations. #37151
Fixed compatibility issues with permission persistence in compute-storage decoupled mode. #38136 #37708
Fixed the issue where observer did not retry correctly when encountering a
-230
error. #37625Fixed the issue where
show load
with conditions did not perform correct analysis. #37656Fixed the issue where
show streamload
in compute-storage decoupled mode caused BE coredump. #37903Fixed the issue where
copy into
did not correctly verify column names in strict mode. #37650Fixed the issue where multi-stream imports into a single table lacked permissions. #38878
Fixed the potential overflow issue in
getVersionUpdateTimeMs
. #38074Fixed the issue where FE azure blob list was not implemented correctly. #37986
Fixed the issue where inaccurate azure blob recycling time calculation prevented recycling. #37535
Fixed the issue where inverted index files were not deleted in compute-storage decoupled mode. #38306
Lakehouse
Fixed the issue with reading binary data from Oracle Catalog. #37078
Fixed the potential deadlock issue when acquiring external table metadata in multi-FE scenarios. #37756
Fixed the issue where JNI scanner failure caused BE nodes to crash. #37697
Fixed the issue with slow reading of date types from Trino Connector Catalog. #37266
Optimized kerberos authentication logic for Hive Catalog. #37301
Fixed the issue where region attributes might be parsed incorrectly when parsing MinIO properties. #37249
Fixed the issue where creating too many FileSystems by FE caused memory leaks. #36954
Fixed the issue with reading incorrect time zone information from Paimon. #37716
Fixed the potential thread leak issue caused by Hive write-back operations. #36990
Fixed the null pointer issue caused by enabling Hive metastore event synchronization. #38421
Fixed the issue where error messages were unclear or caused stalling when creating catalogs. #37551
Fixed the issue where reading Hive text format tables behaved differently from Hive. #37638
Fixed the logic error when switching between catalogs and databases. #37828
MySQL Compatibility
- Fixed the issue where certain flags in the MySQL protocol were set incorrectly when SSL was enabled. #38086
Asynchronous Materialized Views
Fixed the issue where construction might fail when the base table had a very large number of partitions. #37589
Fixed the issue where nested materialized views incorrectly performed full table refreshes even when partition refreshes were possible. #38698
Fixed the issue where partition refresh could not handle the simultaneous existence of valid and invalid dependencies when analyzing partition dependencies. #38367
Fixed the issue where the final result containing NULL type might cause asynchronous materialized views to fail. #37019
Fixed the planning error that might occur during transparent rewriting when both synchronous and asynchronous materialized views with the same name were present. #37311
Synchronous Materialized Views
The rewritten synchronous materialized views now can correctly perform partition pruning. #38527
When rewriting synchronous materialized views, those with unready data are no longer selected. #38148
Query Optimizer
Fixed the deadlock issue that might occur when queries and delete operations are performed simultaneously. #38660
Fixed the issue where bucket pruning might incorrectly prune on decimal column buckets. #37889
Fixed the issue where planning might be incorrect when mark join participates in join reorder. #39152
Fixed the issue where the result is incorrect when the correlation condition of a correlated subquery is not a simple column. #37644
Fixed the issue where partition pruning cannot correctly handle or expressions. #38897
Fixed the planning error that might occur when optimizing the execution order of JOIN and AGG. #37343
Fixed the issue where
str_to_date
performs incorrect constant folding calculations on datev1 types. #37360Fixed the issue where the ACOS function’s constant folding returns non-NaN values. #37932
Fixed the occasional planning error: “The children format needs to be [WhenClause+, DefaultValue?]“. #38491
Fixed the issue where planning might be incorrect when the projection includes window functions and there is both the original column and its alias. #38166
Fixed the issue where planning might report an error when the aggregation parameter contains a lambda expression. #37109
Fixed the insert error that might occur in extreme cases: “MultiCastDataSink cannot be cast to DataStreamSink”. #38526
Fixed the issue where the new optimizer does not correctly handle
char(0)/varchar(0)
when creating a table. #38427Fixed the incorrect behavior of
char(255) toSql
. #37340Fixed the issue where the nullable attribute within the
agg_state
type might lead to planning errors. #37489Fixed the issue where row count statistics are inaccurate during mark Join. #38270
Query Execution
Fixed issues where the Pipeline execution engine was stuck, causing queries to not end, in multiple scenarios. #38657, #38206, #38885, #38151, #37297
Fixed the coredump issue caused by NULL and non-NULL columns during set difference calculations. #38750
Fixed the error when using the DECIMAL type with pure decimals in delete statements. #37801
Fixed the issue where the
width_bucket
function returned incorrect results. #37892Fixed the query error when a single row of data was very large and the result set was also large (exceeding 2GB). #37990
Fixed the coredump issue caused by incorrect release of rpc connections during single-replica imports. #38087
Fixed the coredump issue caused by processing NULL values with the
foreach
function. #37349Fixed the issue where stddev returned incorrect results for DECIMALV2 types. #38731
Fixed the slow performance of
bitmap union
calculations. #37816Fixed the issue where RowsProduced for aggregation operators was not set in the profile. #38271
Fixed the overflow issue when calculating the number of buckets for the hash table under hash join. #37193, #37493
Fixed the inaccurate recording of the
jemalloc cache memory tracker
. #37464Added the
enable_stacktrace
configuration option, allowing users to control whether exception stacks are output in BE logs. #37713Fixed the issue where Arrow Flight SQL did not work correctly when
enable_parallel_result_sink
was set to false. #37779Fixed the calculation overflow issue of the
round
function on DECIMAL128 types. #37733, #38106Fixed the coredump issue when passing a const string to the
sleep
function. #37681Increased the queue length for audit logs, solving the issue where audit logs could not be recorded normally under high concurrency scenarios with thousands of concurrent connections. #37786
Fixed the issue where creating a workload group caused too many threads, leading to BE coredump. #38096
Fixed the coredump issue caused by the
MULTI_MATCH_ANY
function. #37959Fixed the transaction rollback issue caused by
insert overwrite auto partition
. #38103Fixed the issue where the TimeUtils formatter did not use the correct time zone. #37465
Fixed the issue where results were incorrect under constant folding scenarios for week/yearweek. #37376
Fixed the issue where the
convert_tz
function returned incorrect results. #37358, #38764Fixed the coredump issue when using the
collect_set
function with window functions. #38234Fixed the coredump issue caused by
percentile_approx
during rolling upgrades. #39321Fixed the coredump issue caused by the
mod
function when encountering abnormal input. #37999Fixed the issue where the hash table was not fully built when the broadcast join probe started running. #37643
Fixed the issue where executing the same expression in multithreaded environments might lead to incorrect results for Java UDFs. #38612
Fixed the overflow issue caused by incorrect return types of the
conv
function. #38001Fixed the issue where the
json_replace
function returned incorrect types. #3701Fixed the issue where the nullable attribute setting was unreasonable for the
percentile
aggregation function. #37330Fixed the issue where the results of the
histogram
function were unstable. #38608Fixed the issue where task state was displayed incorrectly in the profile. #38082
Fixed the issue where some queries were incorrectly canceled when the system just started. #37662
Semi-Structured Data Management
Fix the issue of incorrect index size statistics during compression. #37232
Fix the potential incorrect matching of ultra-long strings without tokenization in inverted indexes. #37679 #38218
Fix the high memory usage issue of
array_range
andarray_with_const
functions when dealing with large data volumes. #38284 #37495Fix the potential coredump issue when selecting columns of ARRAY / MAP / STRUCT types. #37936
Fix the import failure issue caused by simdjson parsing errors when specifying jsonpath in Stream Load. #38490
Fix the exception handling issue when there are duplicate keys in JSON data. #38146
Fix the potential query error after DROP INDEX. #37646
Fix the error return issue in row merging checks during index compression. #38732
Inverted index v2 format now supports renaming columns. #38079
Fix the coredump issue when the
MATCH
function matches an empty string without an index. #37947Fix the handling of NULL values in inverted indexes. #37921 #37842 #38741
Fix the incorrect
row_store_page_size
after FE restart. #38240
Other
Fix the timezone configuration issue. The default timezone is no longer fixed at UTC+8 and is now obtained from system configuration. #37294
Fix the class conflict issue when using ranger due to multiple JSR specification implementations. #37575
Fix the potential uninitialized field issue in some BE code. #37403
Fix the error in delete statements for random distributed tables. #37985
Fix the incorrect requirement for
alter_priv
permission on the base table when creating a synchronized materialized view. #38011Fix the issue of not authenticating resources when used in TVF. #36928
Credits
Thanks all who contribute to this release:
@133tosakarin, @924060929, @AshinGau, @Baymine, @BePPPower, @BiteTheDDDDt, @ByteYue, @CalvinKirs, @Ceng23333, @DarvenDuan, @FreeOnePlus, @Gabriel39, @HappenLee, @JNSimba, @Jibing-Li, @KassieZ, @Lchangliang, @LiBinfeng-01, @Mryange, @SWJTU-ZhangLei, @TangSiyang2001, @Tech-Circle-48, @Vallishp, @Yukang-Lian, @Yulei-Yang, @airborne12, @amorynan, @bobhan1, @cambyzju, @cjj2010, @csun5285, @dataroaring, @deardeng, @eldenmoon, @englefly, @feiniaofeiafei, @felixwluo, @freemandealer, @gavinchou, @ghkang98, @hello-stephen, @hubgeter, @hust-hhb, @jacktengg, @kaijchen, @kaka11chen, @keanji-x, @liaoxin01, @liutang123, @luwei16, @luzhijing, @lxr599, @morningman, @morrySnow, @mrhhsg, @mymeiyi, @platoneko, @qidaye, @qzsee, @seawinde, @shuke987, @sollhui, @starocean999, @suxiaogang223, @w41ter, @wangbo, @wangshuo128, @whutpencil, @wsjz, @wuwenchi, @wyxxxcat, @xiaokang, @xiedeyantu, @xinyiZzz, @xy720, @xzj7019, @yagagagaga, @yiguolei, @yujun777, @z404289981, @zclllyybb, @zddr, @zfr9527, @zhangbutao, @zhangstar333, @zhannngchen, @zhiqiang-hhhh, @zjj, @zy-kkk, @zzzxl1993