Statistics
Collecting statistics helps the optimizer understand data distribution characteristics. When performing Cost-Based Optimization (CBO), the optimizer uses these statistics to calculate the selectivity of predicates and estimate the cost of each execution plan. This allows for the selection of more optimal plans, significantly improving query efficiency.
Currently, the following information is collected for each column:
Information | Description |
---|---|
row_count | Total number of rows |
data_size | Total data size |
avg_size_byte | Average length of values |
ndv | Number of distinct values |
min | Minimum value |
max | Maximum value |
null_count | Number of null values |
1. Collecting Statistics
1.1 Manual Collection Using ANALYZE Statement
Doris allows users to manually trigger the collection and update of statistics by submitting the ANALYZE statement.
Syntax:
ANALYZE < TABLE table_name | DATABASE db_name >
[ (column_name [, ...]) ]
[ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];
Where:
table_name
: The specified target table. It can be in the formatdb_name.table_name
.column_name
: The specified target column. It must be an existing column intable_name
. You can specify multiple column names separated by commas.sync
: Collect statistics synchronously. Returns after collection. If not specified, it executes asynchronously and returns a JOB ID.sample percent | rows
: Collect statistics with sampling. You can specify a sampling percentage or a number of sampling rows.
By default (WITH SAMPLE is not specified), a table will be fully analyzed. For relatively large tables (e.g. above 5GiB), from the perspective of system resources, we recommend sampling analyze, and the number of sampled rows is recommended to be no less than 4 million rows. Here are some examples:
Collect statistics for a table with full analyze:
ANALYZE TABLE lineitem;
Collect statistics for a table with a 10% sampling rate:
ANALYZE TABLE lineitem WITH SAMPLE PERCENT 10;
Collect statistics for a table with a sample of 100,000 rows:
ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;
1.2 Automatic Collection
This feature has been officially supported since 2.0.3 and is enabled by default. The basic operation logic is described below. After each import transaction commit, Doris records the number of rows updated by the import transaction to estimate the health of the existing table’s statistics data (for tables that have not collected statistics, their health is 0). When the health of a table is below 60 (adjustable through the table_stats_health_threshold
parameter), Doris considers the statistics for that table outdated and triggers statistics collection jobs for that table in subsequent operations. For tables with a health value above 60, no repeated collection is performed.
The collection jobs for statistics themselves consume a certain amount of system resources. To minimize the overhead, Doris automatically uses sampling to collect statistics. Automatic sampling defaults to sample 4,194,304 (2^22) rows to reduce the system’s burden and complete the collection job as quickly as possible. If you want to sample more rows to obtain a more accurate data distribution, you can increase the sampling row count by adjusting the huge_table_default_sample_rows
parameter. You can also control the full collection of small tables and the collection interval of large tables through session variables. For detailed configuration, please refer to 3.1.
If you are concerned about automatic collection jobs interfering with your business, you can specify a time frame for the automatic collection jobs to run during low business loads by setting the auto_analyze_start_time
and auto_analyze_end_time
parameters according to your needs. You can also completely disable this feature by setting the enable_auto_analyze
parameter to false
.
External catalogs do not participate in automatic collection by default. Because external catalogs often contain massive historical data, if they participate in automatic collection, it may occupy too many resources. You can turn on and off the automatic collection of external catalogs by setting the catalog’s properties.
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true'); // Turn on
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false'); // Turn off
2. Job Management
2.1 View Analyze Jobs
Use SHOW ANALYZE
to view information about statistics collection jobs.
Syntax:
SHOW [AUTO] ANALYZE < table_name | job_id >
[ WHERE [ STATE = [ "PENDING" | "RUNNING" | "FINISHED" | "FAILED" ] ] ];
- AUTO: Show historical information for automatic collection jobs only. Note that, by default, the status of only the last 20,000 completed automatic collection jobs is retained.
- table_name: Table name, specify to view statistics job information for that table. It can be in the format
db_name.table_name
. When not specified, it returns information for all statistics jobs. - job_id: Job ID for statistics collection, obtained when executing
ANALYZE
. When not specified, this command returns information for all statistics jobs.
Output:
Column Name | Description |
---|---|
job_id | Job ID |
catalog_name | Catalog Name |
db_name | Database Name |
tbl_name | Table Name |
col_name | Column Name List |
job_type | Job Type |
analysis_type | Analysis Type |
message | Job Information |
last_exec_time_in_ms | Last Execution Time |
state | Job Status |
schedule_type | Scheduling Method |
Here’s an example:
mysql> show analyze 245073\G;
*************************** 1. row ***************************
job_id: 245073
catalog_name: internal
db_name: default_cluster:tpch
tbl_name: lineitem
col_name: [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
job_type: MANUAL
analysis_type: FUNDAMENTALS
message:
last_exec_time_in_ms: 2023-11-07 11:00:52
state: FINISHED
progress: 16 Finished | 0 Failed | 0 In Progress | 16 Total
schedule_type: ONCE
2.2 View Column Statistics Collection Status
Each collection job can contain one or more tasks, with each task corresponding to the collection of a column. Users can use the following command to view the completion status of statistics collection for each column.
Syntax:
SHOW ANALYZE TASK STATUS [job_id]
Here’s an example:
mysql> show analyze task status 20038 ;
+---------+----------+---------+----------------------+----------+
| task_id | col_name | message | last_exec_time_in_ms | state |
+---------+----------+---------+----------------------+----------+
| 20039 | col4 | | 2023-06-01 17:22:15 | FINISHED |
| 20040 | col2 | | 2023-06-01 17:22:15 | FINISHED |
| 20041 | col3 | | 2023-06-01 17:22:15 | FINISHED |
| 20042 | col1 | | 2023-06-01 17:22:15 | FINISHED |
+---------+----------+---------+----------------------+----------+
2.3 View Column Statistics
Use SHOW COLUMN STATS
to view various statistics data for columns.
Syntax:
SHOW COLUMN [cached] STATS table_name [ (column_name [, ...]) ];
Where:
- cached: Show statistics information in the current FE memory cache.
- table_name: The target table for collecting statistics. It can be in the format
db_name.table_name
. - column_name: Specifies the target column, which must be an existing column in
table_name
. You can specify multiple column names separated by commas.
Here’s an example:
mysql> show column stats lineitem(l_tax)\G;
*************************** 1. row ***************************
column_name: l_tax
count: 6001215.0
ndv: 9.0
num_null: 0.0
data_size: 4.800972E7
avg_size_byte: 8.0
min: 0.00
max: 0.08
method: FULL
type: FUNDAMENTALS
trigger: MANUAL
query_times: 0
updated_time: 2023-11-07 11:00:46
2.4 Table Collection Overview
Use SHOW TABLE STATS
to view an overview of statistics collection for a table.
Syntax:
SHOW TABLE STATS table_name;
Where:
- table_name: The target table name. It can be in the format
db_name.table_name
.
Output:
Column Name | Description |
---|---|
updated_rows | Updated rows since the last ANALYZE |
query_times | Reserved column for recording the number of times the table was queried in future versions |
row_count | Number of rows (does not reflect the exact number of rows at the time of command execution) |
updated_time | Last update time |
columns | Columns for which statistics information has been collected |
Here’s an example:
mysql> show table stats lineitem \G;
*************************** 1. row ***************************
updated_rows: 0
query_times: 0
row_count: 6001215
updated_time: 2023-11-07
columns: [l_returnflag, l_receiptdate, l_tax, l_shipmode, l_suppkey, l_shipdate, l_commitdate, l_partkey, l_orderkey, l_quantity, l_linestatus, l_comment, l_extendedprice, l_linenumber, l_discount, l_shipinstruct]
trigger: MANUAL
2.5 Terminate Statistics Jobs
Use KILL ANALYZE
to terminate running statistics jobs.
Syntax:
KILL ANALYZE job_id;
Where:
- job_id: Job ID for statistics collection. Obtained when performing asynchronous statistics collection using the
ANALYZE
statement, and it can also be obtained through theSHOW ANALYZE
statement.
Example:
- Terminate statistics job with ID 52357.
mysql> KILL ANALYZE 52357;
3. Session Variables and Configuration Options
3.1 Session Variables
Session Variable | Description | Default Value |
---|---|---|
auto_analyze_start_time | Start time for automatic statistics collection | 00:00:00 |
auto_analyze_end_time | End time for automatic statistics collection | 23:59:59 |
enable_auto_analyze | Enable automatic collection functionality | true |
huge_table_default_sample_rows | Sampling rows for large tables | 4194304 |
huge_table_lower_bound_size_in_bytes | Tables with size greater than this value will be automatically sampled during collection of statistics | 0 |
huge_table_auto_analyze_interval_in_millis | Controls the minimum time interval for automatic ANALYZE on large tables. Tables with sizes greater than huge_table_lower_bound_size_in_bytes * 5 will be ANALYZEed only once within this time interval. | 0 |
table_stats_health_threshold | Ranges from 0 to 100. If data updates since the last statistics collection exceed (100 - table_stats_health_threshold)% , the table’s statistics are considered outdated. | 60 |
analyze_timeout | Controls the timeout for synchronous ANALYZE in seconds | 43200 |
auto_analyze_table_width_threshold | Controls the maximum width of table that will be auto analyzed. Table with more columns than this value will not be auto analyzed. | 100 |
3.2 FE Configuration Options
The following FE configuration options are typically not a major concern:
FE Configuration Option | Description | Default Value |
---|---|---|
analyze_record_limit | Controls the persistence of statistics job execution records | 20000 |
stats_cache_size | FE-side statistics cache entries | 500,000 |
statistics_simultaneously_running_task_num | Number of asynchronous jobs that can run simultaneously | 3 |
statistics_sql_mem_limit_in_bytes | Controls the amount of BE memory each statistics SQL can use | 2,147,483,648 bytes (2 GiB) |
4. Common Issues
4.1 ANALYZE Submission Error: Stats table not available…
When ANALYZE is executed, statistics data is written to the internal table __internal_schema.column_statistics
. FE checks the tablet status of this table before executing ANALYZE. If there are unavailable tablets, the task is rejected. Please check the BE cluster status if this error occurs.
Users can use SHOW BACKENDS\G
to verify the BE (Backend) status. If the BE status is normal, you can use the command SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db]
to check the tablet status within this database, ensuring that the tablet status is also normal.
4.2 Failure of ANALYZE on Large Tables
Due to resource limitations, ANALYZE on some large tables may timeout or exceed BE memory limits. In such cases, it is recommended to use ANALYZE ... WITH SAMPLE...
.