Dear community members, the Apache Doris 3.0.1 version was officially released on August 23, 2024, featuring updates and improvements in compute-storage decoupling, lakehouse, semi-structured data analysis, asynchronous materialized views, and more.

Quick Download: https://doris.apache.org/download/

GitHub Release: https://github.com/apache/doris/releases

Behavior Changes

Query Optimizer

  • Added the variable use_max_length_of_varchar_in_ctas to control the length behavior of VARCHAR type when executing CREATE TABLE AS SELECT (CTAS) operations. #37069

    • This variable is set to true by default.

    • When set to true, if the VARCHAR type column originates from a table, the derived length is used; otherwise, the maximum length is used.

    • When set to false, the VARCHAR type will always use the derived length.

  • All data types will now be displayed in lowercase to maintain compatibility with MySQL format. #38012

  • Multiple query statements in the same query request must now be separated by semicolons. #38670

Query Execution

  • The default number of parallel tasks after shuffle operations in the cluster is set to 100, which will improve query stability and concurrent processing capability in large clusters. #38196

Storage

  • The default value of trash_file_expire_time_sec has been changed from 86400 seconds to 0 seconds, which means that if files are deleted by mistake and the FE trash is cleared, the data cannot be recovered.

  • The table attribute enable_mow_delete_on_delete_predicate (introduced in version 3.0.0) has been renamed to enable_mow_light_delete.

  • Explicit transactions are now prohibited from performing delete operations on tables with written data.

  • Heavy schema change operations are prohibited on tables with auto-increment fields.

New Features

Job Scheduling

  • Optimized the execution logic of internal scheduling jobs, decoupling the strong association between start time and immediate execution parameters. Now, tasks can be created with a specified start time or selected for immediate execution, without conflict, enhancing scheduling flexibility. #36805

Compute-Storage Decoupled

  • Supports dynamic modification of the upper limit for file cache usage. #37484

  • Recycler now supports object storage rate limiting and server-side rate limiting retry functionality. #37663 #37680

Lakehouse

  • Added the session variable serde_dialect to set the output format for complex types. #37039

  • SQL interception now supports external tables.

  • Insert overwrite now supports Iceberg tables. #37191

Asynchronous Materialized Views

  • Supports partition roll-up and build at the hourly level. #37678

  • Supports atomic replacement of asynchronous materialized view definition statements. #36749

  • Transparent rewriting now supports Insert statements. #38115

  • Transparent rewriting now supports the VARIANT type. #37929

Query Execution

  • The group concat function now supports DISTINCT and ORDER BY options. #38744

Semi-Structured Data Management

  • The ES Catalog now maps nested or object types in Elasticsearch to the JSON type in Doris. #37101

  • Added the MULTI_MATCH function, which supports matching keywords across multiple fields and can leverage inverted indexes to accelerate searches. #37722

  • Added the explode_json_object function, which can unfold objects in JSON data into multiple rows. #36887

  • Inverted indexes now support memtable advancement, requiring index construction only once during multi-replica writes, reducing CPU consumption and improving performance. #35891

  • Added MATCH_PHRASE support for positive slop, e.g., msg MATCH_PHRASE 'a b 2+' can match instances containing words a and b with a slop of no more than two, and a preceding b; regular slop without the final + does not guarantee this order. #36356

Other

  • Added the FE parameter skip_audit_user_list, where user operations specified in this configuration will not be recorded in the audit log. #38310

    • For more information, refer to the documentation on Audit Plugin.

Improvements

Storage

  • Reduced the likelihood of write failures caused by disk balancing within a single BE. #38000

  • Decreased memory consumption by the memtable limiter. #37511

  • Moved old partitions to the FE trash during partition replacement operations. #36361

  • Optimized memory consumption during compaction. #37099

  • Added a session variable to control audit logs for JDBC PreparedStatement, with default setting to not print. #38419

  • Optimized the logic for selecting BEs for group commits. #35558

  • Improved the performance of column updates. #38487

  • Optimized the use of delete bitmap cache. #38761

  • Added a configuration to control query affinity during hot and cold tiering. #37492

Compute-Storage Decoupled

  • Implemented automatic retries when encountering object storage server rate limiting. #37199

  • Adapted the number of threads for memtable flush in the compute-storage decoupled mode. #38789

  • Added Azure as a compile option to support compilation in environments without Azure support.

  • Optimized the observability of object storage access rate limiting. #38294

  • Allowed the file cache TTL queue to perform LRU eviction, enhancing TTL queue usability. #37312

  • Optimized the number of balance writeeditlog IO operations in the storage and compute separation mode. #37787

  • Improved table creation speed in the storage and compute separation mode by sending tablet creation requests in batches. #36786

  • Optimized read failures caused by potential inconsistencies in the local file cache through backoff retries. #38645

Lakehouse

  • Optimized memory statistics for Parquet/ORC format read and write operations. #37234

  • Trino Connector Catalog now supports predicate pushdown. #37874

  • Added a session variable enable_count_push_down_for_external_table to control whether to enable count(*) pushdown optimization for external tables. #37046

  • Optimized the read logic for Hudi snapshot reads, returning an empty set when the snapshot is empty, consistent with Spark behavior. #37702

  • Improved the read performance of partition columns for Hive tables. #37377

Asynchronous Materialized Views

  • Improved transparent rewrite plan speed by 20%. #37197

  • Eliminated roll-up during transparent rewrite if the group key satisfies data uniqueness for better nested matching. #38387

  • Transparent rewrite now performs better aggregation elimination to improve the matching success rate of nested materialized views. #36888

MySQL Compatibility

  • Now correctly populates the database name, table name, and original name in the MySQL protocol result columns. #38126

  • Supported the hint format /*+ func(value) */. #37720

Query Optimizer

  • Significantly improved the plan speed for complex queries. #38317

  • Adaptively chose whether to perform bucket shuffle based on the number of data buckets to avoid performance degradation in extreme cases. #36784

  • Optimized the cost estimation logic for SEMI / ANTI JOIN. #37951 #37060

  • Supported pushing Limit down to the first stage of aggregation to improve performance. #34853

  • Partition pruning now supports filter conditions containing the date_trunc or date function. #38025 #38743

  • SQL cache now supports query scenarios that include user variables. #37915

  • Optimized error messages for invalid aggregation semantics. #38122

Query Execution

  • Adapted AggState compatibility from 2.1 to 3.x and fixed Coredump issues. #37104

  • Refactored the strategy selection for local shuffle without Join. #37282

  • Modified the scanner for internal table queries to be asynchronous to prevent stalling during such queries. #38403

  • Optimized the block merge process during Hash table construction for Join operators. #37471

  • Optimized the duration of lock holding for MultiCast. #37462

  • Optimized gRPC keepAliveTime and added link monitoring to reduce the probability of query failure due to RPC errors. #37304

  • Cleaned up all dirty pages in jemalloc when memory limits were exceeded. #37164

  • Optimized the processing performance of aes_encrypt/decrypt functions for constant types. #37194

  • Optimized the processing performance of the json_extract function for constant data. #36927

  • Optimized the processing performance of the ParseUrl function for constant data. #36882

Semi-Structured Data Management

  • Bitmap indexes now default to using inverted indexes, with enable_create_bitmap_index_as_inverted_index set to true by default. #36692

  • In the compute-storage decoupled mode, DESC can now view sub-columns of VARIANT type. #38143

  • Removed the step of checking file existence during inverted index queries to reduce access latency to remote storage. #36945

  • Complex types ARRAY / MAP / STRUCT now support replace_if_not_null for AGG tables. #38304

  • Escape characters for JSON data are now supported. #37176 #37251

  • Inverted index queries now behave consistently on MOW tables and DUP tables. #37428

  • Optimized the performance of inverted index acceleration for IN queries. #37395

  • Reduced unnecessary memory allocation during TOPN queries to improve performance. #37429

  • When creating an inverted index with tokenization, the support_phrase option is now automatically enabled to accelerate match_phrase series phrase queries. #37949

Other

  • Audit log now can record SQL types. #37790

  • Added support for information_schema.processlist to show all FE. #38701

  • Cached ranger’s atamask and rowpolicy to accelerate query efficiency. #37723

  • Optimized metadata management in job manager to release locks immediately after modifying metadata, reducing lock holding time. #38162

Bug Fixes

Upgrade

  • Fix the issue where mtmv load fails during upgrade from version 2.1. #38799

  • Resolve the issue where null_type cannot be found during the upgrade to version 2.1. #39373

  • Address the compatibility issue with permission persistence during the upgrade from version 2.1 to 3.0. #39288

Load

  • Fix the issue where parsing fails when the newline character is surrounded by delimiters in CSV format parsing. #38347

  • Resolve potential exception issues when FE forwards group commit. #38228 #38265

  • Group commit now supports the new optimizer. #37002

  • Fix the issue where group commit reports data errors when JDBC setNull is used. #38262

  • Optimize the retry logic for group commit when encountering delete bitmap lock errors. #37600

  • Resolve the issue where routine load cannot use CSV delimiters and escape characters. #38402

  • Fix the issue where routine load job names with mixed case cannot be displayed. #38523

  • Optimize the logic for actively recovering routine load during FE master-slave switching. #37876

  • Resolve the issue where routine load pauses when all data in Kafka is expired. #37288

  • Fix the issue where show routine load returns empty results. #38199

  • Resolve the memory leak issue during multi-table stream import in routine load. #38255

  • Fix the issue where stream load does not return the error URL. #38325

  • Resolve potential load channel leak issues. #38031 #37500

  • Fix the issue where no error may be reported when importing fewer segments than expected. #36753

  • Resolve the load stream leak issue. #38912

  • Optimize the impact of offline nodes on import operations. #38198

  • Fix the issue where transactions do not end when inserting into empty data. #38991

Storage

01 Backup and Restoration

  • Fix the issue where tables cannot be written after backup and restoration. #37089

  • Resolve the issue where view database names are incorrect after backup and restoration. #37412

02 Compaction

  • Fix the issue where cumu compaction handles delete errors incorrectly during ordered data compression. #38742

  • Resolve the issue of duplicate keys in aggregate tables caused by sequential compression optimization. #38224

  • Fix the issue where compression operations cause coredump in large wide tables. #37960

  • Resolve the compression starvation issue caused by inaccurate concurrent statistics of compression tasks. #37318

03 MOW Unique Key

  • Resolve the issue of inconsistent data between replicas caused by cumulative compression deletion of delete sign. #37950

  • MOW delete now uses partial column updates with the new optimizer. #38751

  • Fix the potential duplicate key issue in MOW tables under compute-storage decoupled. #39018

  • Resolve the issue where MOW unique and duplicate tables cannot modify column order. #37067

  • Fix the potential data correctness issue caused by segcompaction. #37760

  • Resolve the potential memory leak issue during column updates. #37706

04 Other

  • Fix the small probability of exceptions in TOPN queries. #39119 #39199

  • Resolve the issue where auto-increment IDs may duplicate during FE restart. #37306

  • Fix the potential queuing issue in the delete operation priority queue. #37169

  • Optimize the delete retry logic. #37363

  • Resolve the issue with bucket = 0 in table creation statements under the new optimizer. #38971

  • Fix the issue where FE reports success incorrectly when image generation fails. #37508

  • Resolve the issue where using the wrong nodename during FE offline nodes may cause inconsistent FE members. #37987

  • Fix the issue where CCR partition addition may fail. #37295

  • Resolve the int32 overflow issue in inverted index files. #38891

  • Fix the issue where TRUNCATE TABLE failure may cause BE to fail to go offline. #37334

  • Resolve the issue where publish cannot continue due to null pointers. #37724 #37531

  • Fix the potential coredump issue when manually triggering disk migration. #37712

Compute-Storage Decoupled

  • Fixed the issue where show create table might display the file_cache_ttl_seconds attribute twice. #38052

  • Fixed the issue where segment Footer TTL was not set correctly after setting file cache TTL. #37485

  • Fixed the issue where file cache might cause coredump due to massive conversion of cache types. #38518

  • Fixed the potential file descriptor (fd) leak in file cache. #38051

  • Fixed the issue where schema change Job overwriting compaction Job prevented base tablet compaction from completing normally. #38210

  • Fixed the potential inaccuracy of base compaction score due to data race. #38006

  • Fixed the issue where error messages from imports might not be uploaded correctly to object storage. #38359

  • Fixed the inconsistency in return information between compute-storage decoupled mode and storage and compute integration mode for 2PC imports. #38076

  • Fix the issue where incorrect file size setting during file cache warm-up leads to coredump. #38939

  • Fixed the issue where partial column updates did not correctly dequeue delete operations. #37151

  • Fixed compatibility issues with permission persistence in compute-storage decoupled mode. #38136 #37708

  • Fixed the issue where observer did not retry correctly when encountering a -230 error. #37625

  • Fixed the issue where show load with conditions did not perform correct analysis. #37656

  • Fixed the issue where show streamload in compute-storage decoupled mode caused BE coredump. #37903

  • Fixed the issue where copy into did not correctly verify column names in strict mode. #37650

  • Fixed the issue where multi-stream imports into a single table lacked permissions. #38878

  • Fixed the potential overflow issue in getVersionUpdateTimeMs. #38074

  • Fixed the issue where FE azure blob list was not implemented correctly. #37986

  • Fixed the issue where inaccurate azure blob recycling time calculation prevented recycling. #37535

  • Fixed the issue where inverted index files were not deleted in compute-storage decoupled mode. #38306

Lakehouse

  • Fixed the issue with reading binary data from Oracle Catalog. #37078

  • Fixed the potential deadlock issue when acquiring external table metadata in multi-FE scenarios. #37756

  • Fixed the issue where JNI scanner failure caused BE nodes to crash. #37697

  • Fixed the issue with slow reading of date types from Trino Connector Catalog. #37266

  • Optimized kerberos authentication logic for Hive Catalog. #37301

  • Fixed the issue where region attributes might be parsed incorrectly when parsing MinIO properties. #37249

  • Fixed the issue where creating too many FileSystems by FE caused memory leaks. #36954

  • Fixed the issue with reading incorrect time zone information from Paimon. #37716

  • Fixed the potential thread leak issue caused by Hive write-back operations. #36990

  • Fixed the null pointer issue caused by enabling Hive metastore event synchronization. #38421

  • Fixed the issue where error messages were unclear or caused stalling when creating catalogs. #37551

  • Fixed the issue where reading Hive text format tables behaved differently from Hive. #37638

  • Fixed the logic error when switching between catalogs and databases. #37828

MySQL Compatibility

  • Fixed the issue where certain flags in the MySQL protocol were set incorrectly when SSL was enabled. #38086

Asynchronous Materialized Views

  • Fixed the issue where construction might fail when the base table had a very large number of partitions. #37589

  • Fixed the issue where nested materialized views incorrectly performed full table refreshes even when partition refreshes were possible. #38698

  • Fixed the issue where partition refresh could not handle the simultaneous existence of valid and invalid dependencies when analyzing partition dependencies. #38367

  • Fixed the issue where the final result containing NULL type might cause asynchronous materialized views to fail. #37019

  • Fixed the planning error that might occur during transparent rewriting when both synchronous and asynchronous materialized views with the same name were present. #37311

Synchronous Materialized Views

  • The rewritten synchronous materialized views now can correctly perform partition pruning. #38527

  • When rewriting synchronous materialized views, those with unready data are no longer selected. #38148

Query Optimizer

  • Fixed the deadlock issue that might occur when queries and delete operations are performed simultaneously. #38660

  • Fixed the issue where bucket pruning might incorrectly prune on decimal column buckets. #37889

  • Fixed the issue where planning might be incorrect when mark join participates in join reorder. #39152

  • Fixed the issue where the result is incorrect when the correlation condition of a correlated subquery is not a simple column. #37644

  • Fixed the issue where partition pruning cannot correctly handle or expressions. #38897

  • Fixed the planning error that might occur when optimizing the execution order of JOIN and AGG. #37343

  • Fixed the issue where str_to_date performs incorrect constant folding calculations on datev1 types. #37360

  • Fixed the issue where the ACOS function’s constant folding returns non-NaN values. #37932

  • Fixed the occasional planning error: “The children format needs to be [WhenClause+, DefaultValue?]“. #38491

  • Fixed the issue where planning might be incorrect when the projection includes window functions and there is both the original column and its alias. #38166

  • Fixed the issue where planning might report an error when the aggregation parameter contains a lambda expression. #37109

  • Fixed the insert error that might occur in extreme cases: “MultiCastDataSink cannot be cast to DataStreamSink”. #38526

  • Fixed the issue where the new optimizer does not correctly handle char(0)/varchar(0) when creating a table. #38427

  • Fixed the incorrect behavior of char(255) toSql. #37340

  • Fixed the issue where the nullable attribute within the agg_state type might lead to planning errors. #37489

  • Fixed the issue where row count statistics are inaccurate during mark Join. #38270

Query Execution

  • Fixed issues where the Pipeline execution engine was stuck, causing queries to not end, in multiple scenarios. #38657, #38206, #38885, #38151, #37297

  • Fixed the coredump issue caused by NULL and non-NULL columns during set difference calculations. #38750

  • Fixed the error when using the DECIMAL type with pure decimals in delete statements. #37801

  • Fixed the issue where the width_bucket function returned incorrect results. #37892

  • Fixed the query error when a single row of data was very large and the result set was also large (exceeding 2GB). #37990

  • Fixed the coredump issue caused by incorrect release of rpc connections during single-replica imports. #38087

  • Fixed the coredump issue caused by processing NULL values with the foreach function. #37349

  • Fixed the issue where stddev returned incorrect results for DECIMALV2 types. #38731

  • Fixed the slow performance of bitmap union calculations. #37816

  • Fixed the issue where RowsProduced for aggregation operators was not set in the profile. #38271

  • Fixed the overflow issue when calculating the number of buckets for the hash table under hash join. #37193, #37493

  • Fixed the inaccurate recording of the jemalloc cache memory tracker. #37464

  • Added the enable_stacktrace configuration option, allowing users to control whether exception stacks are output in BE logs. #37713

  • Fixed the issue where Arrow Flight SQL did not work correctly when enable_parallel_result_sink was set to false. #37779

  • Fixed the incorrect use of colocate Join. #37361, #37729

  • Fixed the calculation overflow issue of the round function on DECIMAL128 types. #37733, #38106

  • Fixed the coredump issue when passing a const string to the sleep function. #37681

  • Increased the queue length for audit logs, solving the issue where audit logs could not be recorded normally under high concurrency scenarios with thousands of concurrent connections. #37786

  • Fixed the issue where creating a workload group caused too many threads, leading to BE coredump. #38096

  • Fixed the coredump issue caused by the MULTI_MATCH_ANY function. #37959

  • Fixed the transaction rollback issue caused by insert overwrite auto partition. #38103

  • Fixed the issue where the TimeUtils formatter did not use the correct time zone. #37465

  • Fixed the issue where results were incorrect under constant folding scenarios for week/yearweek. #37376

  • Fixed the issue where the convert_tz function returned incorrect results. #37358, #38764

  • Fixed the coredump issue when using the collect_set function with window functions. #38234

  • Fixed the coredump issue caused by percentile_approx during rolling upgrades. #39321

  • Fixed the coredump issue caused by the mod function when encountering abnormal input. #37999

  • Fixed the issue where the hash table was not fully built when the broadcast join probe started running. #37643

  • Fixed the issue where executing the same expression in multithreaded environments might lead to incorrect results for Java UDFs. #38612

  • Fixed the overflow issue caused by incorrect return types of the conv function. #38001

  • Fixed the issue where the json_replace function returned incorrect types. #3701

  • Fixed the issue where the nullable attribute setting was unreasonable for the percentile aggregation function. #37330

  • Fixed the issue where the results of the histogram function were unstable. #38608

  • Fixed the issue where task state was displayed incorrectly in the profile. #38082

  • Fixed the issue where some queries were incorrectly canceled when the system just started. #37662

Semi-Structured Data Management

  • Fix some issues with time series compression. #39170 #39176

  • Fix the issue of incorrect index size statistics during compression. #37232

  • Fix the potential incorrect matching of ultra-long strings without tokenization in inverted indexes. #37679 #38218

  • Fix the high memory usage issue of array_range and array_with_const functions when dealing with large data volumes. #38284 #37495

  • Fix the potential coredump issue when selecting columns of ARRAY / MAP / STRUCT types. #37936

  • Fix the import failure issue caused by simdjson parsing errors when specifying jsonpath in Stream Load. #38490

  • Fix the exception handling issue when there are duplicate keys in JSON data. #38146

  • Fix the potential query error after DROP INDEX. #37646

  • Fix the error return issue in row merging checks during index compression. #38732

  • Inverted index v2 format now supports renaming columns. #38079

  • Fix the coredump issue when the MATCH function matches an empty string without an index. #37947

  • Fix the handling of NULL values in inverted indexes. #37921 #37842 #38741

  • Fix the incorrect row_store_page_size after FE restart. #38240

Other

  • Fix the timezone configuration issue. The default timezone is no longer fixed at UTC+8 and is now obtained from system configuration. #37294

  • Fix the class conflict issue when using ranger due to multiple JSR specification implementations. #37575

  • Fix the potential uninitialized field issue in some BE code. #37403

  • Fix the error in delete statements for random distributed tables. #37985

  • Fix the incorrect requirement for alter_priv permission on the base table when creating a synchronized materialized view. #38011

  • Fix the issue of not authenticating resources when used in TVF. #36928

Credits

Thanks all who contribute to this release:

@133tosakarin, @924060929, @AshinGau, @Baymine, @BePPPower, @BiteTheDDDDt, @ByteYue, @CalvinKirs, @Ceng23333, @DarvenDuan, @FreeOnePlus, @Gabriel39, @HappenLee, @JNSimba, @Jibing-Li, @KassieZ, @Lchangliang, @LiBinfeng-01, @Mryange, @SWJTU-ZhangLei, @TangSiyang2001, @Tech-Circle-48, @Vallishp, @Yukang-Lian, @Yulei-Yang, @airborne12, @amorynan, @bobhan1, @cambyzju, @cjj2010, @csun5285, @dataroaring, @deardeng, @eldenmoon, @englefly, @feiniaofeiafei, @felixwluo, @freemandealer, @gavinchou, @ghkang98, @hello-stephen, @hubgeter, @hust-hhb, @jacktengg, @kaijchen, @kaka11chen, @keanji-x, @liaoxin01, @liutang123, @luwei16, @luzhijing, @lxr599, @morningman, @morrySnow, @mrhhsg, @mymeiyi, @platoneko, @qidaye, @qzsee, @seawinde, @shuke987, @sollhui, @starocean999, @suxiaogang223, @w41ter, @wangbo, @wangshuo128, @whutpencil, @wsjz, @wuwenchi, @wyxxxcat, @xiaokang, @xiedeyantu, @xinyiZzz, @xy720, @xzj7019, @yagagagaga, @yiguolei, @yujun777, @z404289981, @zclllyybb, @zddr, @zfr9527, @zhangbutao, @zhangstar333, @zhannngchen, @zhiqiang-hhhh, @zjj, @zy-kkk, @zzzxl1993