5.3. Cost based optimizations
- Join Enumeration
- Connector Implementations

5.3. Cost based optimizations

Presto supports several cost based optimizations, described below.

Join Enumeration

The order in which joins are executed in a query can have a significant impacton the query’s performance. The aspect of join ordering that has the largestimpact on performance is the size of the data being processed and transferredover the network. If a join that produces a lot of data is performed early inthe execution, then subsequent stages will need to process large amounts ofdata for longer than necessary, increasing the time and resources needed forthe query.

With cost based join enumeration, Presto usescdoc:/optimizer/statistics provided by connectors to estimatethe costs for different join orders and automatically pick thejoin order with the lowest computed costs.

The join enumeration strategy is governed by the join_reordering_strategysession property, with the optimizer.join-reordering-strategyconfiguration property providing the default value.

The valid values are:
- AUTOMATIC - full automatic join enumeration enabled
- ELIMINATE_CROSS_JOINS (default) - eliminate unnecessary cross joins
- NONE - purely syntactic join order

If using AUTOMATIC and statistics are not available, or if for any otherreason a cost could not be computed, the ELIMINATE_CROSS_JOINS strategy isused instead. ## Join Distribution Selection Presto uses a hash based join algorithm. That implies that for each joinoperator a hash table must be created from one join input (called build side).The other input (probe side) is then iterated and for each row the hash table isqueried to find matching rows.

There are two types of join distributions:
- Partitioned: each node participating in the query builds a hash tablefrom only a fraction of the data
- Broadcast: each node participating in the query builds a hash tablefrom all of the data (data is replicated to each node)

Each type have its trade offs. Partitioned joins require redistributing bothtables using a hash of the join key. This can be slower (sometimessubstantially) than broadcast joins, but allows much larger joins. Inparticular, broadcast joins will be faster if the build side is much smallerthan the probe side. However, broadcast joins require that the tables on thebuild side of the join after filtering fit in memory on each node, whereasdistributed joins only need to fit in distributed memory across all nodes. With cost based join distribution selection, Presto automatically chooses whether touse a partitioned or broadcast join. With both cost based join enumeration and cost based join distribution, Prestoautomatically chooses which side is the probe and which is the build. The join distribution type is governed by the join_distribution_typesession property, with the join-distribution-type configurationproperty providing the default value.

The valid values are:
- AUTOMATIC - join distribution type is determined automaticallyfor each join
- BROADCAST - broadcast join distribution is used for all joins
- PARTITIONED (default) - partitioned join distribution is used for all join

Connector Implementations

In order for the Presto optimizer to use the cost based strategies,the connector implementation must provide Table Statistics.