MT_DOP Query Option

Sets the degree of intra-node parallelism used for certain operations that can benefit from multithreaded execution. You can specify values higher than zero to find the ideal balance of response time, memory usage, and CPU usage during statement processing.

Note:

The Impala execution engine is being revamped incrementally to add additional parallelism within a single host for certain statements and kinds of operations. The setting MT_DOP=0 uses the “old” code path with limited intra-node parallelism.

Currently, MT_DOP support varies by statement type:

  • COMPUTE [INCREMENTAL] STATS. Impala automatically sets MT_DOP=4 for COMPUTE STATS and COMPUTE INCREMENTAL STATS statements on Parquet tables.

  • SELECT statements. MT_DOP is 0 by default for SELECT statements but can be set to a value greater than 0 to control intra-node parallelism. This may be useful to tune query performance and in particular to reduce execution time of long-running, CPU-intensive queries.

  • DML statements. MT_DOP values greater than zero are not currently supported for DML statements. DML statements will produce an error if MT_DOP is set to a non-zero value.

  • In Impala 3.4 and earlier, not all SELECT statements support setting MT_DOP. Specifically, only scan and aggregation operators, and local joins that do not need data exchanges (such as for nested types) are supported. Other SELECT statements produce an error if MT_DOP is set to a non-zero value.

Type: integer

Default: 0

Because COMPUTE STATS and COMPUTE INCREMENTAL STATS statements for Parquet tables benefit substantially from extra intra-node parallelism, Impala automatically sets MT_DOP=4 when computing stats for Parquet tables.

Range: 0 to 64

Examples:

Note:

Any timing figures in the following examples are on a small, lightly loaded development cluster. Your mileage may vary. Speedups depend on many factors, including the number of rows, columns, and partitions within each table.

The following example shows how to run a COMPUTE STATS statement against a Parquet table with or without an explicit MT_DOP setting:

  1. -- Explicitly setting MT_DOP to 0 selects the old code path.
  2. set mt_dop = 0;
  3. MT_DOP set to 0
  4. -- The analysis for the billion rows is distributed among hosts,
  5. -- but uses only a single core on each host.
  6. compute stats billion_rows_parquet;
  7. +-----------------------------------------+
  8. | summary |
  9. +-----------------------------------------+
  10. | Updated 1 partition(s) and 2 column(s). |
  11. +-----------------------------------------+
  12. drop stats billion_rows_parquet;
  13. -- Using 4 logical processors per host is faster.
  14. set mt_dop = 4;
  15. MT_DOP set to 4
  16. compute stats billion_rows_parquet;
  17. +-----------------------------------------+
  18. | summary |
  19. +-----------------------------------------+
  20. | Updated 1 partition(s) and 2 column(s). |
  21. +-----------------------------------------+
  22. drop stats billion_rows_parquet;
  23. -- Unsetting the option reverts back to its default.
  24. -- Which for COMPUTE STATS and a Parquet table is 4,
  25. -- so again it uses the fast path.
  26. unset MT_DOP;
  27. Unsetting option MT_DOP
  28. compute stats billion_rows_parquet;
  29. +-----------------------------------------+
  30. | summary |
  31. +-----------------------------------------+
  32. | Updated 1 partition(s) and 2 column(s). |
  33. +-----------------------------------------+

The following example shows the effects of setting MT_DOP for a query on a Parquet table:

  1. set mt_dop = 0;
  2. MT_DOP set to 0
  3. -- COUNT(DISTINCT) for a unique column is CPU-intensive.
  4. select count(distinct id) from billion_rows_parquet;
  5. +--------------------+
  6. | count(distinct id) |
  7. +--------------------+
  8. | 1000000000 |
  9. +--------------------+
  10. Fetched 1 row(s) in 67.20s
  11. set mt_dop = 16;
  12. MT_DOP set to 16
  13. -- Introducing more intra-node parallelism for the aggregation
  14. -- speeds things up, and potentially reduces memory overhead by
  15. -- reducing the number of scanner threads.
  16. select count(distinct id) from billion_rows_parquet;
  17. +--------------------+
  18. | count(distinct id) |
  19. +--------------------+
  20. | 1000000000 |
  21. +--------------------+
  22. Fetched 1 row(s) in 17.19s

The following example shows how queries that are not compatible with non-zero MT_DOP settings produce an error when MT_DOP is set:

  1. set mt_dop=1;
  2. MT_DOP set to 1
  3. insert into a1
  4. select * from a2;
  5. ERROR: NotImplementedException: MT_DOP not supported for DML statements.

Related information:

COMPUTE STATS Statement, Impala Aggregate Functions

Parent topic: Query Options for the SET Statement