For a partitioned table, GPORCA uses statistics of the table root partition to generate query plans. These statistics are used for determining the join order, for splitting and joining aggregate nodes, and for costing the query steps. In contrast, the Postgres Planner uses the statistics of each leaf partition.
If you run queries on partitioned tables, you should collect statistics on the root partition and periodically update those statistics to ensure that GPORCA can generate optimal query plans. If the root partition statistics are not up-to-date or do not exist, GPORCA still performs dynamic partition elimination for queries against the table. However, the query plan might not be optimal.
Parent topic: About GPORCA
Running ANALYZE
By default, running the ANALYZE
command on the root partition of a partitioned table samples the leaf partition data in the table, and stores the statistics for the root partition. ANALYZE
collects statistics on the root and leaf partitions, including HyperLogLog (HLL) statistics on the leaf partitions. ANALYZE ROOTPARTITION
collects statistics only on the root partition. The server configuration parameter optimizer_analyze_root_partition controls whether the ROOTPARTITION
keyword is required to collect root statistics for the root partition of a partitioned table. See the ANALYZE command for information about collecting statistics on partitioned tables.
Keep in mind that ANALYZE
always scans the entire table before updating the root partition statistics. If your table is very large, this operation can take a significant amount of time. ANALYZE ROOTPARTITION
also uses an ACCESS SHARE
lock that prevents certain operations, such as TRUNCATE
and VACUUM
operations, during runtime. For these reasons, you should schedule ANALYZE
operations periodically, or when there are significant changes to leaf partition data.
Follow these best practices for running ANALYZE
or ANALYZE ROOTPARTITION
on partitioned tables in your system:
- Run
ANALYZE <root_partition_table_name>
on a new partitioned table after adding initial data. RunANALYZE <leaf_partition_table_name>
on a new leaf partition or a leaf partition where data has changed. By default, running the command on a leaf partition updates the root partition statistics if the other leaf partitions have statistics. - Update root partition statistics when you observe query performance regression in
EXPLAIN
plans against the table, or after significant changes to leaf partition data. For example, if you add a new leaf partition at some point after generating root partition statistics, consider runningANALYZE
orANALYZE ROOTPARTITION
to update root partition statistics with the new tuples inserted from the new leaf partition. - For very large tables, run
ANALYZE
orANALYZE ROOTPARTITION
only weekly, or at some interval longer than daily. - Avoid running
ANALYZE
with no arguments, because doing so runs the command on all database tables including partitioned tables. With large databases, these globalANALYZE
operations are difficult to monitor, and it can be difficult to predict the time needed for completion. - Consider running multiple
ANALYZE <table_name>
orANALYZE ROOTPARTITION <table_name>
operations in parallel to speed the operation of statistics collection, if your I/O throughput can support the load. - You can also use the Greenplum Database utility
analyzedb
to update table statistics. Usinganalyzedb
ensures that tables that were previously analyzed are not re-analyzed if no modifications were made to the leaf partition.
GPORCA and Leaf Partition Statistics
Although creating and maintaining root partition statistics is crucial for GPORCA query performance with partitioned tables, maintaining leaf partition statistics is also important. If GPORCA cannot generate a plan for a query against a partitioned table, then the Postgres Planner is used and leaf partition statistics are needed to produce the optimal plan for that query.
GPORCA itself also uses leaf partition statistics for any queries that access leaf partitions directly, instead of using the root partition with predicates to eliminate partitions. For example, if you know which partitions hold necessary tuples for a query, you can directly query the leaf partition table itself; in this case GPORCA uses the leaf partition statistics.
Deactivating Automatic Root Partition Statistics Collection
If you do not intend to run queries on partitioned tables with GPORCA (setting the server configuration parameter optimizer to off
), then you can deactivate the automatic collection of statistics on the root partition of the partitioned table. The server configuration parameter optimizer_analyze_root_partition controls whether the ROOTPARTITION
keyword is required to collect root statistics for the root partition of a partitioned table. The default setting for the parameter is on
, the ANALYZE
command can collect root partition statistics without the ROOTPARTITION
keyword. You can deactivate automatic collection of root partition statistics by setting the parameter to off
. When the value is off
, you must run ANALZYE ROOTPARTITION
to collect root partition statistics.
- Log into the Greenplum Database master host as
gpadmin
, the Greenplum Database administrator. Set the values of the server configuration parameters. These Greenplum Database
gpconfig
utility commands sets the value of the parameters tooff
:$ gpconfig -c optimizer_analyze_root_partition -v off --masteronly
Restart Greenplum Database. This Greenplum Database
gpstop
utility command reloads thepostgresql.conf
files of the master and segments without shutting down Greenplum Database.gpstop -u