Thanks to our community users and developers, about 1000 improvements and bug fixes have been made in Doris 2.0.3 version, including optimizer statistics, inverted index, complex datatypes, data lake, replica management.

1 Behavior change

  • The output format of the complex data type array/map/struct has been changed to be consistent to the input format and JSON specification. The main changes from the previous version are that DATE/DATETIME and STRING/VARCHAR are enclosed in double quotes and null values inside ARRAY/MAP are displayed as null instead of NULL.
  • SHOW_VIEW permission is supported. Users with SELECT or LOAD permission will no longer be able to execute the ‘SHOW CREATE VIEW’ statement and must be granted the SHOW_VIEW permission separately.

2 New features

2.1 Support collecting statistics for optimizer automatically

Collecting statistics helps the optimizer understand the data distribution characteristics and choose a better plan to greatly improve query performance. It is officially supported starting from version 2.0.3 and is enabled all day by default.

see more:https://doris.apache.org/docs/query-acceleration/statistics/

2.2 Support complex datatypes for more datalake source

2.3 Add more builtin functions

3 Improvement and optimizations

3.1 Performance optimizations

  • When the inverted index MATCH WHERE condition with a high filter rate is combined with the common WHERE condition with a low filter rate, the I/O of the index column is greatly reduced.
  • Optimize the efficiency of random data access after the where filter.
  • Optimizes the performance of the old get_json_xx function on JSON data types by 2~4x.
  • Supports the configuration to reduce the priority of the data read thread, ensuring the CPU resources for real-time writing.
  • Adds uuid-numeric function that returns largeint, which is 20 times faster than uuid function that returns string.
  • Optimized the performance of case when by 3x.
  • Cut out unnecessary predicate calculations in storage engine execution.
  • Accelerate count performance by pushing down count operator to storage tier.
  • Optimizes the computation performance of the nullable type in and or expressions.
  • Supports rewriting the limit operator before join in more scenarios to improve query performance.
  • Eliminate useless order by operators from inline view to improve query performance.
  • Optimizes the accuracy of cardinality estimates and cost models in some cases.
  • Optimized jdbc catalog predicate pushdown logic.
  • Optimized the read efficiency of the file cache when it’s enable for the first time.
  • Optimizes the hive table sql cache policy and uses the partition update time stored in HMS to improve the cache hit ratio.
  • Optimize mow compaction efficiency.
  • Optimized thread allocation logic for external table query to reduce memory usage
  • Optimize memory usage for column reader.

3.2 Distributed replica management improvements

Distributed replica management improvements include skipping partition deletion, colocate group deletion, balance failure due to continuous write, and hot and cold seperation table balance.

3.3 Security enhancement

4 Bugfix and stability

4.1 Complex datatypes

4.2 Inverted index

4.3 Materialized View

4.4 Table sample

4.5 Unique with merge on write

4.6 Load and compaction

4.7 Data Lake compatibility

4.8 JDBC external table compatibility

4.9 SQL Planner and Optimizer

Others

See the complete list of improvements and bug fixes on github dev/2.0.3-merged .