ANALYZE Statements
ANALYZE
statements are used to collect statistics for existing tables and store the result to catalog. Only ANALYZE TABLE
statements are supported now, and need to be triggered manually instead of automatically.
Attention Currently, ANALYZE TABLE
only supports in batch mode. Only existing table is supported, and an exception will be thrown if the table is a view or table not exists.
Run an ANALYZE TABLE statement
Java
ANALYZE TABLE
statements can be executed with the executeSql()
method of the TableEnvironment
.
The following examples show how to run a ANALYZE TABLE
statement in TableEnvironment
.
Scala
ANALYZE TABLE
statements can be executed with the executeSql()
method of the TableEnvironment
.
The following examples show how to run a ANALYZE TABLE
statement in TableEnvironment
.
Python
ANALYZE TABLE
statements can be executed with the execute_sql()
method of the TableEnvironment
.
The following examples show how to run a ANALYZE TABLE
statement in TableEnvironment
.
SQL CLI
ANALYZE TABLE
statements can be executed in SQL CLI.
The following examples show how to run a ANALYZE TABLE
statement in SQL CLI.
Java
TableEnvironment tableEnv = TableEnvironment.create(...);
// register a non-partition table named "Store"
tableEnv.executeSql(
"CREATE TABLE Store (" +
" `id` BIGINT NOT NULl," +
" `location` VARCHAR(32)," +
" `owner` VARCHAR(32)" +
") with (...)");
// register a partition table named "Orders"
tableEnv.executeSql(
"CREATE TABLE Orders (" +
" `id` BIGINT NOT NULl," +
" `product` VARCHAR(32)," +
" `amount` INT," +
" `sold_year` BIGINT," +
" `sold_month` BIGINT," +
" `sold_day` BIGINT" +
") PARTITIONED BY (`sold_year`, `sold_month`, `sold_day`) "
") with (...)");
// Non-partition table, collect row count.
tableEnv.executeSql("ANALYZE TABLE Store COMPUTE STATISTICS");
// Non-partition table, collect row count and statistics for all columns.
tableEnv.executeSql("ANALYZE TABLE Store COMPUTE STATISTICS FOR ALL COLUMNS");
// Non-partition table, collect row count and statistics for column `location`.
tableEnv.executeSql("ANALYZE TABLE Store COMPUTE STATISTICS FOR COLUMNS location");
// Suppose table "Orders" has 4 partitions with specs:
// Partition1 : (sold_year='2022', sold_month='1', sold_day='10')
// Partition2 : (sold_year='2022', sold_month='1', sold_day='11')
// Partition3 : (sold_year='2022', sold_month='2', sold_day='10')
// Partition4 : (sold_year='2022', sold_month='2', sold_day='11')
// Partition table, collect row count for Partition1.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS");
// Partition table, collect row count for Partition1 and Partition2.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS");
// Partition table, collect row count for all partitions.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS");
// Partition table, collect row count and statistics for all columns on partition1.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR ALL COLUMNS");
// Partition table, collect row count and statistics for all columns on partition1 and partition2.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR ALL COLUMNS");
// Partition table, collect row count and statistics for all columns on all partitions.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR ALL COLUMNS");
// Partition table, collect row count and statistics for column `amount` on partition1.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR COLUMNS amount");
// Partition table, collect row count and statistics for `amount` and `product` on partition1 and partition2.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product");
// Partition table, collect row count and statistics for column `amount` and `product` on all partitions.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product");
Scala
val tableEnv = TableEnvironment.create(...)
// register a non-partition table named "Store"
tableEnv.executeSql(
"CREATE TABLE Store (" +
" `id` BIGINT NOT NULl," +
" `location` VARCHAR(32)," +
" `owner` VARCHAR(32)" +
") with (...)");
// register a partition table named "Orders"
tableEnv.executeSql(
"CREATE TABLE Orders (" +
" `id` BIGINT NOT NULl," +
" `product` VARCHAR(32)," +
" `amount` INT," +
" `sold_year` BIGINT," +
" `sold_month` BIGINT," +
" `sold_day` BIGINT" +
") PARTITIONED BY (`sold_year`, `sold_month`, `sold_day`) "
") with (...)");
// Non-partition table, collect row count.
tableEnv.executeSql("ANALYZE TABLE Store COMPUTE STATISTICS");
// Non-partition table, collect row count and statistics for all columns.
tableEnv.executeSql("ANALYZE TABLE Store COMPUTE STATISTICS FOR ALL COLUMNS");
// Non-partition table, collect row count and statistics for column `location`.
tableEnv.executeSql("ANALYZE TABLE Store COMPUTE STATISTICS FOR COLUMNS location");
// Suppose table "Orders" has 4 partitions with specs:
// Partition1 : (sold_year='2022', sold_month='1', sold_day='10')
// Partition2 : (sold_year='2022', sold_month='1', sold_day='11')
// Partition3 : (sold_year='2022', sold_month='2', sold_day='10')
// Partition4 : (sold_year='2022', sold_month='2', sold_day='11')
// Partition table, collect row count for Partition1.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS");
// Partition table, collect row count for Partition1 and Partition2.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS");
// Partition table, collect row count for all partitions.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS");
// Partition table, collect row count and statistics for all columns on partition1.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR ALL COLUMNS");
// Partition table, collect row count and statistics for all columns on partition1 and partition2.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR ALL COLUMNS");
// Partition table, collect row count and statistics for all columns on all partitions.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR ALL COLUMNS");
// Partition table, collect row count and statistics for column `amount` on partition1.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR COLUMNS amount");
// Partition table, collect row count and statistics for `amount` and `product` on partition1 and partition2.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product");
// Partition table, collect row count and statistics for column `amount` and `product` on all partitions.
tableEnv.executeSql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product");
Python
table_env = TableEnvironment.create(...)
# register a non-partition table named "Store"
table_env.execute_sql(
"CREATE TABLE Store (" +
" `id` BIGINT NOT NULl," +
" `location` VARCHAR(32)," +
" `owner` VARCHAR(32)" +
") with (...)");
# register a partition table named "Orders"
table_env.execute_sql(
"CREATE TABLE Orders (" +
" `id` BIGINT NOT NULl," +
" `product` VARCHAR(32)," +
" `amount` INT," +
" `sold_year` BIGINT," +
" `sold_month` BIGINT," +
" `sold_day` BIGINT" +
") PARTITIONED BY (`sold_year`, `sold_month`, `sold_day`) "
") with (...)");
# Non-partition table, collect row count.
table_env.execute_sql("ANALYZE TABLE Store COMPUTE STATISTICS");
# Non-partition table, collect row count and statistics for all columns.
table_env.execute_sql("ANALYZE TABLE Store COMPUTE STATISTICS FOR ALL COLUMNS");
# Non-partition table, collect row count and statistics for column `location`.
table_env.execute_sql("ANALYZE TABLE Store COMPUTE STATISTICS FOR COLUMNS location");
# Suppose table "Orders" has 4 partitions with specs:
# Partition1 : (sold_year='2022', sold_month='1', sold_day='10')
# Partition2 : (sold_year='2022', sold_month='1', sold_day='11')
# Partition3 : (sold_year='2022', sold_month='2', sold_day='10')
# Partition4 : (sold_year='2022', sold_month='2', sold_day='11')
# Partition table, collect row count for Partition1.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS");
# Partition table, collect row count for Partition1 and Partition2.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS");
# Partition table, collect row count for all partitions.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS");
# Partition table, collect row count and statistics for all columns on partition1.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR ALL COLUMNS");
# Partition table, collect row count and statistics for all columns on partition1 and partition2.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR ALL COLUMNS");
# Partition table, collect row count and statistics for all columns on all partitions.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR ALL COLUMNS");
# Partition table, collect row count and statistics for column `amount` on partition1.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR COLUMNS amount");
# Partition table, collect row count and statistics for `amount` and `product` on partition1 and partition2.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product");
# Partition table, collect row count and statistics for column `amount` and `product` on all partitions.
table_env.execute_sql("ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product");
SQL CLI
Flink SQL> CREATE TABLE Store (
> `id` BIGINT NOT NULl,
> `location` VARCHAR(32),
> `owner` VARCHAR(32)
> ) with (
> ...
> );
[INFO] Table has been created.
Flink SQL> CREATE TABLE Orders (
> `id` BIGINT NOT NULl,
> `product` VARCHAR(32),
> `amount` INT,
> `sold_year` BIGINT,
> `sold_month` BIGINT,
> `sold_day` BIGINT
> ) PARTITIONED BY (`sold_year`, `sold_month`, `sold_day`)
> ) with (
> ...
> );
[INFO] Table has been created.
Flink SQL> ANALYZE TABLE Store COMPUTE STATISTICS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Store COMPUTE STATISTICS FOR ALL COLUMNS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Store COMPUTE STATISTICS FOR COLUMNS location;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR ALL COLUMNS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR ALL COLUMNS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR ALL COLUMNS;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year='2022', sold_month='1', sold_day='10') COMPUTE STATISTICS FOR COLUMNS amount;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION (sold_year='2022', sold_month='1', sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product;
[INFO] Execute statement succeed.
Flink SQL> ANALYZE TABLE Orders PARTITION(sold_year, sold_month, sold_day) COMPUTE STATISTICS FOR COLUMNS amount, product;
[INFO] Execute statement succeed.
Syntax
ANALYZE TABLE [catalog_name.][db_name.]table_name PARTITION(partcol1[=val1] [, partcol2[=val2], ...]) COMPUTE STATISTICS [FOR COLUMNS col1 [, col2, ...] | FOR ALL COLUMNS]
PARTITION(partcol1[=val1] [, partcol2[=val2], …]) is required for the partition table
- If no partition is specified, the statistics will be gathered for all partitions
- If a certain partition is specified, the statistics will be gathered only for specific partition
- If the table is non-partition table , while a partition is specified, an exception will be thrown
- If a certain partition is specified, but the partition does not exist, an exception will be thrown
FOR COLUMNS col1 [, col2, …] or FOR ALL COLUMNS are optional
- If no column is specified, only the table level statistics will be gathered
- If a column does not exist, or column is not a physical column, an exception will be thrown.
- If a column or any column is specified, the column level statistics will be gathered
- the column level statistics include:
- ndv: the number of distinct values
- nullCount: the number of nulls
- avgLen: the average length of column values
- maxLen: the max length of column values
- minValue: the min value of column values
- maxValue: the max value of column values
- valueCount: the value count only for boolean type
- the supported types and its corresponding column level statistics are as following sheet lists(“Y” means support, “N” means unsupported):
- the column level statistics include:
Types | ndv | nullCount | avgLen | maxLen | maxValue | minValue | valueCount |
---|---|---|---|---|---|---|---|
BOOLEAN | N | Y | N | N | N | N | Y |
TINYINT | Y | Y | N | N | Y | Y | N |
SMALLINT | Y | Y | N | N | Y | Y | N |
INTEGER | Y | Y | N | N | Y | Y | N |
FLOAT | Y | Y | N | N | Y | Y | N |
DATE | Y | Y | N | N | Y | Y | N |
TIME_WITHOUT_TIME_ZONE | Y | Y | N | N | Y | Y | N |
BIGINT | Y | Y | N | N | Y | Y | N |
DOUBLE | Y | Y | N | N | Y | Y | N |
DECIMAL | Y | Y | N | N | Y | Y | N |
TIMESTAMP_WITH_LOCAL_TIME_ZONE | Y | Y | N | N | Y | Y | N |
TIMESTAMP_WITHOUT_TIME_ZONE | Y | Y | N | N | Y | Y | N |
CHAR | Y | Y | Y | Y | N | N | N |
VARCHAR | Y | Y | Y | Y | N | N | N |
other types | N | Y | N | N | N | N | N |
NOTE: For the fix length types (like BOOLEAN
, INTEGER
, DOUBLE
etc.), we need not collect the avgLen
and maxLen
from the original records.