Statistics

Collecting statistics helps the optimizer understand data distribution characteristics. When performing Cost-Based Optimization (CBO), the optimizer uses these statistics to calculate the selectivity of predicates and estimate the cost of each execution plan. This allows for the selection of more optimal plans, significantly improving query efficiency.

Currently, the following information is collected for each column:

InformationDescription
row_countTotal number of rows
data_sizeTotal data size
avg_size_byteAverage length of values
ndvNumber of distinct values
minMinimum value
maxMaximum value
null_countNumber of null values

1. Collecting Statistics


1.1 Manual Collection Using ANALYZE Statement

Doris allows users to manually trigger the collection and update of statistics by submitting the ANALYZE statement.

Syntax:

  1. ANALYZE < TABLE table_name | DATABASE db_name >
  2. [ (column_name [, ...]) ]
  3. [ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];

Where:

  • table_name: The specified target table. It can be in the format db_name.table_name.
  • column_name: The specified target column. It must be an existing column in table_name. You can specify multiple column names separated by commas.
  • sync: Collect statistics synchronously. Returns after collection. If not specified, it executes asynchronously and returns a JOB ID.
  • sample percent | rows: Collect statistics with sampling. You can specify a sampling percentage or a number of sampling rows.

By default (WITH SAMPLE is not specified), a table will be fully analyzed. For relatively large tables (e.g. above 5GiB), from the perspective of system resources, we recommend sampling analyze, and the number of sampled rows is recommended to be no less than 4 million rows. Here are some examples:

Collect statistics for a table with full analyze:

  1. ANALYZE TABLE lineitem;

Collect statistics for a table with a 10% sampling rate:

  1. ANALYZE TABLE lineitem WITH SAMPLE PERCENT 10;

Collect statistics for a table with a sample of 100,000 rows:

  1. ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;

1.2 Automatic Collection

This feature has been officially supported since 2.0.3 and is enabled by default. The basic operation logic is described below. After each import transaction commit, Doris records the number of rows updated by the import transaction to estimate the health of the existing table’s statistics data (for tables that have not collected statistics, their health is 0). When the health of a table is below 60 (adjustable through the table_stats_health_threshold parameter), Doris considers the statistics for that table outdated and triggers statistics collection jobs for that table in subsequent operations. For tables with a health value above 60, no repeated collection is performed.

The collection jobs for statistics themselves consume a certain amount of system resources. To minimize the overhead, Doris automatically uses sampling to collect statistics. Automatic sampling defaults to sample 4,194,304 (2^22) rows to reduce the system’s burden and complete the collection job as quickly as possible. If you want to sample more rows to obtain a more accurate data distribution, you can increase the sampling row count by adjusting the huge_table_default_sample_rows parameter. You can also control the full collection of small tables and the collection interval of large tables through session variables. For detailed configuration, please refer to 3.1.

If you are concerned about automatic collection jobs interfering with your business, you can specify a time frame for the automatic collection jobs to run during low business loads by setting the auto_analyze_start_time and auto_analyze_end_time parameters according to your needs. You can also completely disable this feature by setting the enable_auto_analyze parameter to false.

External catalogs do not participate in automatic collection by default. Because external catalogs often contain massive historical data, if they participate in automatic collection, it may occupy too many resources. You can turn on and off the automatic collection of external catalogs by setting the catalog’s properties.

  1. ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true'); // Turn on
  2. ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false'); // Turn off

2. Job Management


2.1 View Analyze Jobs

Use SHOW ANALYZE to view information about statistics collection jobs.

Syntax:

  1. SHOW [AUTO] ANALYZE < table_name | job_id >
  2. [ WHERE [ STATE = [ "PENDING" | "RUNNING" | "FINISHED" | "FAILED" ] ] ];
  • AUTO: Show historical information for automatic collection jobs only. Note that, by default, the status of only the last 20,000 completed automatic collection jobs is retained.
  • table_name: Table name, specify to view statistics job information for that table. It can be in the format db_name.table_name. When not specified, it returns information for all statistics jobs.
  • job_id: Job ID for statistics collection, obtained when executing ANALYZE. When not specified, this command returns information for all statistics jobs.

Output:

Column NameDescription
job_idJob ID
catalog_nameCatalog Name
db_nameDatabase Name
tbl_nameTable Name
col_nameColumn Name List
job_typeJob Type
analysis_typeAnalysis Type
messageJob Information
last_exec_time_in_msLast Execution Time
stateJob Status
schedule_typeScheduling Method

Here’s an example:

  1. mysql> show analyze 245073\G;
  2. *************************** 1. row ***************************
  3. job_id: 245073
  4. catalog_name: internal
  5. db_name: default_cluster:tpch
  6. tbl_name: lineitem
  7. col_name: [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
  8. job_type: MANUAL
  9. analysis_type: FUNDAMENTALS
  10. message:
  11. last_exec_time_in_ms: 2023-11-07 11:00:52
  12. state: FINISHED
  13. progress: 16 Finished | 0 Failed | 0 In Progress | 16 Total
  14. schedule_type: ONCE

2.2 View Column Statistics Collection Status

Each collection job can contain one or more tasks, with each task corresponding to the collection of a column. Users can use the following command to view the completion status of statistics collection for each column.

Syntax:

  1. SHOW ANALYZE TASK STATUS [job_id]

Here’s an example:

  1. mysql> show analyze task status 20038 ;
  2. +---------+----------+---------+----------------------+----------+
  3. | task_id | col_name | message | last_exec_time_in_ms | state |
  4. +---------+----------+---------+----------------------+----------+
  5. | 20039 | col4 | | 2023-06-01 17:22:15 | FINISHED |
  6. | 20040 | col2 | | 2023-06-01 17:22:15 | FINISHED |
  7. | 20041 | col3 | | 2023-06-01 17:22:15 | FINISHED |
  8. | 20042 | col1 | | 2023-06-01 17:22:15 | FINISHED |
  9. +---------+----------+---------+----------------------+----------+

2.3 View Column Statistics

Use SHOW COLUMN STATS to view various statistics data for columns.

Syntax:

  1. SHOW COLUMN [cached] STATS table_name [ (column_name [, ...]) ];

Where:

  • cached: Show statistics information in the current FE memory cache.
  • table_name: The target table for collecting statistics. It can be in the format db_name.table_name.
  • column_name: Specifies the target column, which must be an existing column in table_name. You can specify multiple column names separated by commas.

Here’s an example:

  1. mysql> show column stats lineitem(l_tax)\G;
  2. *************************** 1. row ***************************
  3. column_name: l_tax
  4. count: 6001215.0
  5. ndv: 9.0
  6. num_null: 0.0
  7. data_size: 4.800972E7
  8. avg_size_byte: 8.0
  9. min: 0.00
  10. max: 0.08
  11. method: FULL
  12. type: FUNDAMENTALS
  13. trigger: MANUAL
  14. query_times: 0
  15. updated_time: 2023-11-07 11:00:46

2.4 Table Collection Overview

Use SHOW TABLE STATS to view an overview of statistics collection for a table.

Syntax:

  1. SHOW TABLE STATS table_name;

Where:

  • table_name: The target table name. It can be in the format db_name.table_name.

Output:

Column NameDescription
updated_rowsUpdated rows since the last ANALYZE
query_timesReserved column for recording the number of times the table was queried in future versions
row_countNumber of rows (does not reflect the exact number of rows at the time of command execution)
updated_timeLast update time
columnsColumns for which statistics information has been collected

Here’s an example:

  1. mysql> show table stats lineitem \G;
  2. *************************** 1. row ***************************
  3. updated_rows: 0
  4. query_times: 0
  5. row_count: 6001215
  6. updated_time: 2023-11-07
  7. columns: [l_returnflag, l_receiptdate, l_tax, l_shipmode, l_suppkey, l_shipdate, l_commitdate, l_partkey, l_orderkey, l_quantity, l_linestatus, l_comment, l_extendedprice, l_linenumber, l_discount, l_shipinstruct]
  8. trigger: MANUAL

2.5 Terminate Statistics Jobs

Use KILL ANALYZE to terminate running statistics jobs.

Syntax:

  1. KILL ANALYZE job_id;

Where:

  • job_id: Job ID for statistics collection. Obtained when performing asynchronous statistics collection using the ANALYZE statement, and it can also be obtained through the SHOW ANALYZE statement.

Example:

  • Terminate statistics job with ID 52357.
  1. mysql> KILL ANALYZE 52357;

3. Session Variables and Configuration Options


3.1 Session Variables

Session VariableDescriptionDefault Value
auto_analyze_start_timeStart time for automatic statistics collection00:00:00
auto_analyze_end_timeEnd time for automatic statistics collection23:59:59
enable_auto_analyzeEnable automatic collection functionalitytrue
huge_table_default_sample_rowsSampling rows for large tables4194304
huge_table_lower_bound_size_in_bytesTables with size greater than this value will be automatically sampled during collection of statistics0
huge_table_auto_analyze_interval_in_millisControls the minimum time interval for automatic ANALYZE on large tables. Tables with sizes greater than huge_table_lower_bound_size_in_bytes * 5 will be ANALYZEed only once within this time interval.0
table_stats_health_thresholdRanges from 0 to 100. If data updates since the last statistics collection exceed (100 - table_stats_health_threshold)%, the table’s statistics are considered outdated.60
analyze_timeoutControls the timeout for synchronous ANALYZE in seconds43200
auto_analyze_table_width_thresholdControls the maximum width of table that will be auto analyzed. Table with more columns than this value will not be auto analyzed.100

3.2 FE Configuration Options

The following FE configuration options are typically not a major concern:

FE Configuration OptionDescriptionDefault Value
analyze_record_limitControls the persistence of statistics job execution records20000
stats_cache_sizeFE-side statistics cache entries500,000
statistics_simultaneously_running_task_numNumber of asynchronous jobs that can run simultaneously3
statistics_sql_mem_limit_in_bytesControls the amount of BE memory each statistics SQL can use2,147,483,648 bytes (2 GiB)

4. Common Issues

4.1 ANALYZE Submission Error: Stats table not available…

When ANALYZE is executed, statistics data is written to the internal table __internal_schema.column_statistics. FE checks the tablet status of this table before executing ANALYZE. If there are unavailable tablets, the task is rejected. Please check the BE cluster status if this error occurs.

Users can use SHOW BACKENDS\G to verify the BE (Backend) status. If the BE status is normal, you can use the command SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db] to check the tablet status within this database, ensuring that the tablet status is also normal.

4.2 Failure of ANALYZE on Large Tables

Due to resource limitations, ANALYZE on some large tables may timeout or exceed BE memory limits. In such cases, it is recommended to use ANALYZE ... WITH SAMPLE....