Dear community, Apache Doris version 2.1.6 was officially released on September 10, 2024. This version brings continuous upgrades and improvements to the Lakehouse, Async Materialized Views, and Semi-Structured Data Management. Additionally, several fixes have been implemented in areas such as the query optimizer, execution engine, storage management, permission management.

Quick Download: https://doris.apache.org/download/

GitHub Release: https://github.com/apache/doris/releases

Behavior changes

  • Removed the delete_if_exists option from create repository. #38192

  • Added the enable_prepared_stmt_audit_log session variable to control whether JDBC prepared statements record audit logs, with the default being no recording. #38624 #39009

  • Implemented fd limit and memory constraints for segment cache. #39689

  • When the FE configuration item sys_log_mode is set to BRIEF, file location information is added to the logs. #39571

  • Changed the default value of the session variable max_allowed_packet to 16MB. #38697

  • When a single request contains multiple statements, semicolons must be used to separate them. #38670

  • Added support for statements to begin with a semicolon. #39399

  • Aligned type formatting with MySQL in statements such as show create table. #38012

  • When the new optimizer planning times out, it no longer falls back to prevent the old optimizer from using longer planning times. #39499

New features

Lakehouse

  • Supported writeback for Iceberg tables.

  • SQL interception rules now support external tables.

  • Added the system table file_cache_statistics to view BE data cache metrics.

Async Materialized View

  • Supported transparent rewriting during inserts. #38115

  • Supported transparent rewriting when variant types exist in queries.#37929

Semi-Structured Data Management

  • Supported casting ARRAY MAP to JSON type.#36548

  • Supported the json_keys function.#36411

  • Supported specifying the JSON path $. when importing JSON. #38213

  • ARRAY, MAP, STRUCT types now support replace_if_not_null#38304

  • ARRAY, MAP, STRUCT types now support adjusting column order.#39210

  • Added the multi_match function to match keywords across multiple fields, with support for inverted index acceleration. #37722

Query Optimizer

  • Filled in the original database name, table name, column name, and alias for returned columns in the MySQL protocol. #38126

  • Supported the aggregation function group_concat with both order by and distinct simultaneously. #38080

  • SQL cache now supports reusing cached results for queries with different comments. #40049

  • In partition pruning, supported including date_trunc and date functions in filter conditions. #38025 #38743

  • Allowed using the database name where the table resides as a qualifier prefix for table aliases. #38640

  • Supported hint-style comments.#39113

Others

  • Added the system table table_properties for viewing table properties.

  • Introduced deadlock and slow lock detection in FE.

Improvements

Lakehouse

  • Reimplemented the external table metadata caching mechanism.

  • Added the session variable keep_carriage_return with a default value of false. By default, reading Hive Text format tables treats both \r\n and \n as newline characters. #38099

  • Optimized memory statistics for Parquet/ORC file read/write operations.#37257

  • Supported pushing down IN/NOT IN predicates for Paimon tables. #38390

  • Enhanced the optimizer to support Time Travel syntax for Hudi tables. #38591

  • Optimized Kerberos authentication-related processes. #37301

  • Enabled reading Hive tables after renaming column operations. #38809

  • Optimized the reading performance of partition columns for external tables. #38810

  • Improved the data shard merging strategy during external table query planning to avoid performance degradation caused by a large number of small shards.#38964

  • Added attributes such as location to SHOW CREATE DATABASE/TABLE. #39644

  • Supported complex types in MaxCompute Catalog. #39822

  • Optimized the file cache loading strategy by using asynchronous loading to avoid long BE startup times. #39036

  • Improved the file cache eviction strategy, such as evicting locks held for extended periods. #39721

Async Materialized View

  • Supported hourly, weekly, and quarterly partition roll-up construction. #37678

  • For materialized views based on Hive external tables, the metadata cache is now updated before refresh to ensure the latest data is obtained during each refresh. #38212

  • Improved the performance of transparent rewrite planning in storage-compute decoupled mode by batch fetching metadata. #39301

  • Enhanced the performance of transparent rewrite planning by prohibiting duplicate enumerations. #39541

  • Improved the performance of transparent rewrite for refreshing materialized views based on Hive external table partitions.#38525

Semi-Structured Data Management

  • Optimized memory allocation for TOPN queries to improve performance. #37429

  • Enhanced the performance of string processing in inverted indexes.#37395

  • Optimized the performance of inverted indexes in MOW tables. #37428

  • Supported specifying the row-store page_size during table creation to control compression effectiveness. #37145

Query Optimizer

  • Adjusted the row count estimation algorithm for mark joins, resulting in more accurate cardinality estimates for mark joins. #38270

  • Optimized the cost estimation algorithm for semi/anti joins, enabling more accurate selection of semi/anti join orders. #37951

  • Adjusted the filter estimation algorithm for cases where some columns have no statistical information, leading to more accurate cardinality estimates. #39592

  • Modified the instance calculation logic for set operation operators to prevent insufficient parallelism in extreme cases. #39999

  • Adjusted the usage strategy of bucket shuffle, achieving better performance when data is not sufficiently shuffled. #36784

  • Enabled early filtering of window function data, supporting multiple window functions in a single projection. #38393

  • When a NullLiteral exists in a filter condition, it can now be folded into false, further converted to an EmptySet to reduce unnecessary data scanning and computation. #38135

  • Expanded the scope of predicate derivation, reducing data scanning in queries with specific patterns. #37314

  • Supported partial short-circuit evaluation logic in partition pruning to improve partition pruning performance, achieving over 100% improvement in specific scenarios. #38191

  • Enabled the computation of arbitrary scalar functions within user variables. #39144

  • Maintained error messages consistent with MySQL when alias conflicts exist in queries. #38104

Query Execution

  • Adapted AggState for compatibility from 2.1 to 3.x and fixed coredump issues. #37104

  • Refactored the strategy selection for local shuffle when no joins are involved. #37282

  • Modified the scanner for internal table queries to an asynchronous approach to prevent blocking during internal table queries. #38403

  • Optimized the block merge process when building hash tables in Join operators. #37471

  • Reduced the lock holding time for MultiCast operations. 37462

  • Optimized gRPC’s keepAliveTime and added a connection monitoring mechanism, reducing the probability of query failures due to RPC errors during query execution. #37304

  • Cleaned up all dirty pages in jemalloc when memory limits are exceeded. #37164

  • Improved the performance of aes_encrypt/decrypt functions when handling constant types. #37194

  • Optimized the performance of json_extract functions when processing constant data. #36927

  • Optimized the performance of ParseURL functions when processing constant data. #36882

Backup Recovery / CCR

  • Restore now supports deleting redundant tablets and partition options. #39363

  • Check storage connectivity when creating a repository. #39538

  • Enables binlog to support DROP TABLE, allowing CCR to incrementally synchronize DROP TABLE operations. #38541

Compaction

  • Improves the issue where high-priority compaction tasks were not subject to task concurrency control limits. #38189

  • Automatically reduces compaction memory consumption based on data characteristics. #37486

  • Fixes an issue where the sequential data optimization strategy could lead to incorrect data in aggregate tables or MOR UNIQUE tables. #38299

  • Optimizes the rowset selection strategy during compaction during replica replenishment to avoid triggering -235 errors. #39262

MOW (Merge-On-Write)

  • Optimizes slow column updates caused by concurrent column updates and compactions. #38682

  • Fixes an issue where segcompaction during bulk data imports could lead to incorrect MOW data. #38992 #39707

  • Fixes data loss in column updates that may occur after BE restarts. #39035

Storage Management

  • Adds FE configuration to control whether queries under hot-cold tiering prefer local data replicas. #38322

  • Optimizes expired BE report messages to include newly created tablets. #38839 #39605

  • Optimizes replica scheduling priority strategy to prioritize replicas with missing data. #38884

  • Prevents tablets with unfinished ALTER jobs from being balanced. #39202

  • Enables modifying the number of buckets for tables with list partitioning. #39688

  • Prefers querying from online disk services. #39654

  • Improves error messages for materialized view base tables that do not support deletion during synchronization. #39857

  • Improves error messages for single columns exceeding 4GB. #39897

  • Fixes an issue where aborted transactions were omitted when plan errors occurred during INSERT statements.#38260

  • Fixes exceptions during SSL connection closure.#38677

  • Fixes an issue where table locks were not held when aborting transactions using labels. #38842

  • Fixes gson pretty causing large image issues. #39135

  • Fixes an issue where the new optimizer did not check for bucket values of 0 in CREATE TABLE statements.#38999

  • Fixes errors when Chinese column names are included in DELETE condition predicates. #39500

  • Fixes frequent tablet balancing issues in partition balancing mode. #39606

  • Fixes an issue where partition storage policy attributes were lost. #39677

  • Fixes incorrect statistics when importing multiple tables within a transaction. #39548

  • Fixes errors when deleting random bucket tables. #39830

  • Fixes issues where FE fails to start due to non-existent UDFs. #39868

  • Fixes inconsistencies in the last failed version between FE master and slave. #39947

  • Fixes an issue where related tablets may still be in schema change state when schema change jobs are canceled. #39327

  • Fixes errors when modifying type and column order in a single statement schema change (SC). #39107

Data Loading

  • Improves error messages for -238 errors during imports. #39182

  • Allows importing to other partitions while restoring a partition. #39915

  • Optimizes the strategy for FE to select BEs during group commit. #37830 #39010

  • Avoids printing stack traces for some common streamload error messages. #38418

  • Improves handling of issues where offline BEs may affect import errors. #38256

Permissions

  • Optimizes access performance after enabling the Ranger authentication plugin. #38575
  • Optimizes permission strategies for Refresh Catalog/Database/Table operations, allowing users to perform these operations with only SHOW permissions. #39008

Bug fixes

Lakehouse

  • Fixes the issue where switching catalogs may result in an error of not finding the database. #38114

  • Addresses exceptions caused by attempting to read non-existent data on S3. #38253

  • Resolves the issue where specifying an abnormal path during export operations may lead to incorrect export locations. #38602

  • Fixes the timezone issue for time columns in Paimon tables. #37716

  • Temporarily disables the Parquet PageIndex feature to avoid certain erroneous behaviors.

  • Corrects the selection of Backend nodes in the blacklist during external table queries. #38984

  • Resolves errors caused by missing subcolumns in Parquet Struct column types.#39192

  • Addresses several issues with predicate pushdown in JDBC Catalog. #39082

  • Fixes issues where some historical Parquet formats led to incorrect query results. #39375

  • Improves compatibility with ojdbc6 drivers for Oracle JDBC Catalog. #39408

  • Resolves potential FE memory leaks caused by Refresh Catalog/Database/Table operations. #39186 #39871

  • Fixes thread leaks in JDBC Catalog under certain conditions. #39666 #39582

  • Addresses potential event processing failures after enabling Hive Metastore event subscription. #39239

  • Disables reading Hive Text format tables with custom escape characters and null formats to prevent data errors. #39869

  • Resolves issues accessing Iceberg tables created via the Iceberg API under certain conditions. #39203

  • Fixes the inability to read Paimon tables stored on HDFS clusters with high availability enabled. #39876

  • Addresses errors that may occur when reading Paimon table deletion vectors after enabling file caching. #39875

  • Resolves potential deadlocks when reading Parquet files under certain conditions. #39945

Async Materialized View

  • Fixes the inability to use SHOW CREATE MATERIALIZED VIEW on follower FEs. #38794

  • Unifies the object type of asynchronous materialized views in metadata as tables to enable proper display in data tools. #38797

  • Resolves the issue where nested asynchronous materialized views always perform full refreshes. #38698

  • Fixes the issue where canceled tasks may show as running after restarting FEs. #39424

  • Addresses incorrect use of contexts, which may lead to unexpected failures of materialized view refresh tasks. #39690

  • Resolves issues that may cause varchar type write failures due to unreasonable lengths when creating asynchronous materialized views based on external tables.#37668

  • Fixes the potential invalidation of asynchronous materialized views based on external tables after FE restarts or catalog rebuilds. #39355

  • Prohibits the use of partition rollup for materialized views with list partitions to prevent the generation of incorrect data. #38124

  • Fixes incorrect results when literals exist in the select list during transparent rewriting for aggregation rollup. #38958

  • Addresses potential errors during transparent rewriting when queries contain filters like a = a. #39629

  • Fixes issues where transparent rewriting for direct external table queries fails. #39041

Semi-Structured Data Management

  • Removes support for prepared statements in the old optimizer. #39465

  • Fixes issues with JSON escape character handling. #37251

  • Resolves issues with duplicate processing of JSON fields. #38490

  • Fixes issues with some ARRAY and MAP functions. #39307 #39699 #39757

  • Resolves complex combinations of inverted index queries and LIKE queries. #36687

Query Optimizer

  • Fixed the potential partition pruning error issue when the ‘OR’ condition exists in partition filter conditions. #38897

  • Fixed the potential partition pruning error issue when complex expressions are involved. #39298

  • Fixed the issue where nullable in agg_state subtypes might be planned incorrectly, leading to execution errors. #37489

  • Fixed the issue where nullable in set operation operators might be planned incorrectly, leading to execution errors. #39109

  • Fixed the incorrect execution priority issue of intersect operator. #39095

  • Fixed the NPE issue that may occur when the maximum valid date literal exists in the query. #39482

  • Fixed the occasional planning error that results in an illegal slot error during execution. #39640

  • Fixed the issue where repeatedly referencing columns in cte may lead to missing data in some columns in the result. #39850

  • Fixed the occasional planning error issue when ‘case when’ exists in the query. #38491

  • Fixed the issue where IP types cannot be implicitly converted to string types. #39318

  • Fixed the potential planning error issue when using multi-dimensional aggregation and the same column and its alias exist in the select list. #38166

  • Fixed the issue where boolean types might be handled incorrectly when using BE constant folding. #39019

  • Fixed the planning error issue caused by default_cluster: as a prefix for the database name in expressions. #39114

  • Fixed the potential deadlock issue caused by insert into. #38660

  • Fixed the potential planning error issue caused by not holding table locks throughout the planning process. #38950

  • Fixed the issue where CHAR(0), VARCHAR(0) are not handled correctly when creating tables. #38427

  • Fixed the issue where show create table may incorrectly display hidden columns. #38796

  • Fixed the issue where columns with the same name as hidden columns are not prohibited when creating tables. #38796

  • Fixed the occasional planning error issue when executing insert into as select with CTEs. #38526

  • Fixed the issue where insert into values cannot automatically fill null default values. [fix](Nereids) fix insert into table with null literal default value #39122

  • Fixed the NPE issue caused by using cte in delete without using it. #39379

  • Fixed the issue where deleting from a randomly distributed aggregation model table fails. #37985

Query Execution

  • Fixed the issue where the pipeline execution engine gets stuck in multiple scenarios, causing queries not to end. #38657 #38206 #38885

  • Fixed the coredump issue caused by null and non-null columns in set difference calculations.#38737

  • Fixed the incorrect result issue of the width_bucket function. #37892

  • Fixed the query error issue when a single row of data is large and the result set is also large (exceeding 2GB). #37990

  • Fixed the incorrect result issue of stddev with DecimalV2 type. #38731

  • Fixed the coredump issue caused by the MULTI_MATCH_ANY function. #37959

  • Fixed the issue where insert overwrite auto partition causes transaction rollback. #38103

  • Fixed the incorrect result issue of the convert_tz function. #37358 #38764

  • Fixed the coredump issue when using the collect_set function with window functions. #38234

  • Fixed the coredump issue caused by the mod function with abnormal input. #37999

  • Fixed the issue where executing the same expression in multiple threads may lead to incorrect Java UDF results. #38612

  • Fixed the overflow issue caused by the incorrect return type of the conv function. #38001

  • Fixed the unstable result issue of the histogram function. #38608

Backup & Recovery / CCR

  • Fixed the issue where the data version after backup and recovery may be incorrect, leading to unreadability. #38343

  • Fixed the issue of using restore version across versions. #38396

  • Fixed the issue where the job is not canceled when backup fails. #38993

  • Fixed the NPE issue in ccr during the upgrade from 2.1.4 to 2.1.5, causing the FE to fail to start. #39910

  • Fixed the issue where views and materialized views cannot be used after restoration. #38072 #39848

Storage Management

  • Fixed possible memory leaks in routine load when loading multiple tables from a single stream. #38824

  • Fixed the issue where delimiters and escape characters in routine load were not effective. #38825

  • Fixed incorrectly show routine load results when the routine load task name contained uppercase letters. #38826

  • Fixed the issue where the offset cache was not reset when changing the routineload topic. #38474

  • Fixed the potential exception triggered by show routineload under concurrent scenarios. #39525

  • Fixed the issue where routine load might import data repeatedly. #39526

  • Fixed the data error caused by setNull when enabling group commit via JDBC. #38276

  • Fixed the potential NPE issue when enabling group commit insert to a non-master FE. #38345

  • Fixed incorrect error handling during internal data writing in group commit. #38997

  • Fixed the coredump that might be triggered when the group commit execution plan failed. #39396

  • Fixed the issue where concurrent imports into auto partition tables might report non-existent tablets. #38793

  • Fixed potential load stream leakage issues. #39039

  • Fixed the issue where transactions were opened for insert into select with no data. #39108

  • Ignored the single-replica import configuration when using memtable prefetching. #39154

  • Fixed the issue where background imports of stream load records might be abnormally aborted upon encountering db deletion. #39527

  • Fixed inaccurate error messages when data errors occurred in strict mode. #39587

  • Fixed the issue where streamload did not return an error URL upon encountering erroneous data. #38417

  • Fixed the issue with the combined use of insert overwrite and auto partition. #38442

  • Fixed parsing errors when CSV encountered data where the line delimiter was enclosed by the enclosing character. #38445

Data Exporting

  • Fixed the issue where enabling the delete_existing_files property during export operations might result in duplicate deletion of exported data. #39304)

Permissions

  • Fixed the incorrect requirement of ALTER TABLE permission when creating a materialized view. #38011

  • Fixed the issue where the db was explicitly displayed as empty when showing routine load. #38365

  • Fixed the incorrect requirement of CREATE permission on the original table when using CREATE TABLE LIKE. #37879

  • Fixed the issue where grant operations did not check if the object existed. #39597

Upgrade suggestions

When upgrading Doris, please follow the principle of not skipping two minor versions and upgrade sequentially.

For example, if you are upgrading from version 0.15.x to 2.0.x, it is recommended to first upgrade to the latest version of 1.1, then upgrade to the latest version of 1.2, and finally upgrade to the latest version of 2.0.

For more upgrade information, see the documentation: Cluster Upgrade