Hints

Description

Hints give users a way to suggest how Spark SQL to use specific approaches to generate its execution plan.

Syntax

  1. /*+ hint [ , ... ] */

Partitioning Hints

Partitioning hints allow users to suggest a partitioning strategy that Spark should follow. COALESCE, REPARTITION, and REPARTITION_BY_RANGE hints are supported and are equivalent to coalesce, repartition, and repartitionByRange Dataset APIs, respectively. The REBALANCE can only be used as a hint .These hints give users a way to tune performance and control the number of output files in Spark SQL. When multiple partitioning hints are specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer.

Partitioning Hints Types

  • COALESCE

    The COALESCE hint can be used to reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter.

  • REPARTITION

    The REPARTITION hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters.

  • REPARTITION_BY_RANGE

    The REPARTITION_BY_RANGE hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.

  • REBALANCE

    The REBALANCE hint can be used to rebalance the query result output partitions, so that every partition is of a reasonable size (not too small and not too big). It can take column names as parameters, and try its best to partition the query result by these columns. This is a best-effort: if there are skews, Spark will split the skewed partitions, to make these partitions not too big. This hint is useful when you need to write the result of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not enabled.

Examples

  1. SELECT /*+ COALESCE(3) */ * FROM t;
  2. SELECT /*+ REPARTITION(3) */ * FROM t;
  3. SELECT /*+ REPARTITION(c) */ * FROM t;
  4. SELECT /*+ REPARTITION(3, c) */ * FROM t;
  5. SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t;
  6. SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;
  7. SELECT /*+ REBALANCE */ * FROM t;
  8. SELECT /*+ REBALANCE(3) */ * FROM t;
  9. SELECT /*+ REBALANCE(c) */ * FROM t;
  10. SELECT /*+ REBALANCE(3, c) */ * FROM t;
  11. -- multiple partitioning hints
  12. EXPLAIN EXTENDED SELECT /*+ REPARTITION(100), COALESCE(500), REPARTITION_BY_RANGE(3, c) */ * FROM t;
  13. == Parsed Logical Plan ==
  14. 'UnresolvedHint REPARTITION, [100]
  15. +- 'UnresolvedHint COALESCE, [500]
  16. +- 'UnresolvedHint REPARTITION_BY_RANGE, [3, 'c]
  17. +- 'Project [*]
  18. +- 'UnresolvedRelation [t]
  19. == Analyzed Logical Plan ==
  20. name: string, c: int
  21. Repartition 100, true
  22. +- Repartition 500, false
  23. +- RepartitionByExpression [c#30 ASC NULLS FIRST], 3
  24. +- Project [name#29, c#30]
  25. +- SubqueryAlias spark_catalog.default.t
  26. +- Relation[name#29,c#30] parquet
  27. == Optimized Logical Plan ==
  28. Repartition 100, true
  29. +- Relation[name#29,c#30] parquet
  30. == Physical Plan ==
  31. Exchange RoundRobinPartitioning(100), false, [id=#121]
  32. +- *(1) ColumnarToRow
  33. +- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet,
  34. Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
  35. PushedFilters: [], ReadSchema: struct<name:string>

Join Hints

Join hints allow users to suggest the join strategy that Spark should use. Prior to Spark 3.0, only the BROADCAST Join Hint was supported. MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL Joint Hints support was added in 3.0. When different join strategy hints are specified on both sides of a join, Spark prioritizes hints in the following order: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL. When both sides are specified with the BROADCAST hint or the SHUFFLE_HASH hint, Spark will pick the build side based on the join type and the sizes of the relations. Since a given strategy may not support all join types, Spark is not guaranteed to use the join strategy suggested by the hint.

Join Hints Types

  • BROADCAST

    Suggests that Spark use broadcast join. The join side with the hint will be broadcast regardless of autoBroadcastJoinThreshold. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast. The aliases for BROADCAST are BROADCASTJOIN and MAPJOIN.

  • MERGE

    Suggests that Spark use shuffle sort merge join. The aliases for MERGE are SHUFFLE_MERGE and MERGEJOIN.

  • SHUFFLE_HASH

    Suggests that Spark use shuffle hash join. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats) as the build side.

  • SHUFFLE_REPLICATE_NL

    Suggests that Spark use shuffle-and-replicate nested loop join.

Examples

  1. -- Join Hints for broadcast join
  2. SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
  3. SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;
  4. SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;
  5. -- Join Hints for shuffle sort merge join
  6. SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
  7. SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
  8. SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
  9. -- Join Hints for shuffle hash join
  10. SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
  11. -- Join Hints for shuffle-and-replicate nested loop join
  12. SELECT /*+ SHUFFLE_REPLICATE_NL(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
  13. -- When different join strategy hints are specified on both sides of a join, Spark
  14. -- prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint
  15. -- over the SHUFFLE_REPLICATE_NL hint.
  16. -- Spark will issue Warning in the following example
  17. -- org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge)
  18. -- is overridden by another hint and will not take effect.
  19. SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;