目前 OceanBase 数据库支持 Nested Loop Join、Merge Join、Hash Join 三种不同的连接算法。Hash join 和 Merge join 只适用于等值的连接条件,但是 Nested Loop Join 是用于任意的连接条件。

Nested Loop Join

Nested Loop Join 就是扫描一个表(外表),每读到该表中的一条记录,就去“扫描”另一张表(内表)找到满足条件的数据。这里的“扫描”可以是利用索引快速定位扫描,也可以是全表扫描。通常来说,全表扫描的性能是很差的,所以如果连接条件的列上没有索引,优化器一般就不会选择 Nested Loop Join。在 OceanBase 数据库执行计划中展示了是否能够利用索引快速定位扫描。

如下例所示,第一个计划对于内表的扫描是全表扫描,因为连接条件是 t1.c = t2.c,而 t2 没有在 c 上面的索引。第二个计划对于内表的扫描能够使用索引快速找到匹配的行,主要原因是因为连接条件是 t1.b = t2.b, 而且 t2 选择了创建在 b列上的索引 k1 作为访问路径,这样的话对于 t1 中的每一行的每个 b 值,t2 都可以根据索引快速找到满足条件的匹配行。

  1. OceanBase (root@test)> create table t1(a int primary key, b int, c int, key k1(b));
  2. Query OK, 0 rows affected (0.24 sec)
  3. OceanBase (root@test)> create table t2(a int primary key, b int, c int, key k1(b));
  4. Query OK, 0 rows affected (0.29 sec)
  5. OceanBase (root@test)> explain extended_noaddr select/*+use_nl(t1 t2)*/ * from t1, t2 where t1.c = t2.c;
  6. | ===========================================
  7. |ID|OPERATOR |NAME|EST. ROWS|COST |
  8. -------------------------------------------
  9. |0 |NESTED-LOOP JOIN| |1980 |623742|
  10. |1 | TABLE SCAN |t1 |1000 |455 |
  11. |2 | TABLE SCAN |t2 |2 |622 |
  12. ===========================================
  13. Outputs & filters:
  14. -------------------------------------
  15. 0 - output([t1.a], [t1.b], [t1.c], [t2.a], [t2.b], [t2.c]), filter(nil),
  16. conds(nil), nl_params_([t1.c])
  17. 1 - output([t1.c], [t1.a], [t1.b]), filter(nil),
  18. access([t1.c], [t1.a], [t1.b]), partitions(p0),
  19. is_index_back=false,
  20. range_key([t1.a]), range(MIN ; MAX)always true
  21. 2 - output([t2.c], [t2.a], [t2.b]), filter([? = t2.c]),
  22. access([t2.c], [t2.a], [t2.b]), partitions(p0),
  23. is_index_back=false, filter_before_indexback[false],
  24. range_key([t2.a]), range(MIN ; MAX)
  25. OceanBase (root@test)> explain extended_noaddr select/*+use_nl(t1 t2)*/ * from t1, t2 where t1.b = t2.b;
  26. | ============================================
  27. |ID|OPERATOR |NAME |EST. ROWS|COST |
  28. --------------------------------------------
  29. |0 |NESTED-LOOP JOIN| |1980 |94876|
  30. |1 | TABLE SCAN |t1 |1000 |455 |
  31. |2 | TABLE SCAN |t2(k1)|2 |94 |
  32. ============================================
  33. Outputs & filters:
  34. -------------------------------------
  35. 0 - output([t1.a], [t1.b], [t1.c], [t2.a], [t2.b], [t2.c]), filter(nil),
  36. conds(nil), nl_params_([t1.b])
  37. 1 - output([t1.b], [t1.a], [t1.c]), filter(nil),
  38. access([t1.b], [t1.a], [t1.c]), partitions(p0),
  39. is_index_back=false,
  40. range_key([t1.a]), range(MIN ; MAX)always true
  41. 2 - output([t2.b], [t2.a], [t2.c]), filter(nil),
  42. access([t2.b], [t2.a], [t2.c]), partitions(p0),
  43. is_index_back=true,
  44. range_key([t2.b], [t2.a]), range(MIN ; MAX),
  45. range_cond([? = t2.b])

Nested Loop Join 可能会对内表进行多次全表扫描,因为每次扫描都需要从存储层重新迭代一次,这个代价相对是比较高的,所以 OceanBase 数据库支持对内表进行一次扫描并把结果物化在内存中,这样的话下次就可以直接在内存中扫描相关的数据,而不需要从存储层进行多次扫描。但是物化在内存中是有代价的,所以 OceanBase 数据库的优化器基于代价去判断是否需要物化内表。

Nested Loop Join 的一个优化变种是 Blocked Nested Loop Join,它的区别在于每个从外表中读取一个 block 大小的行,然后再去扫描内表找到满足条件的数据。这样的一个好处是可以减少内表的读取次数。

Nested Loop Join 通常在内表行数比较少,而且外表在连接条件的列上有索引的时候会比较好,因为内表中的每一行都可以快速的使用索引定位到相对应的匹配的数据。

同时,OceanBase 数据库也提供了 hint 机制/*+ use_nl(table_name_list) */显示的去控制多表连接的时候选择 Nested Loop Join 连接算法,比如下面场景连接算法选择的是 Hash Join,而用户希望通过 Nested Loop Join,则可以使用上述hint 进行控制。

  1. OceanBase(TEST@TEST)>create table t1(c1 int, c2 int);
  2. Query OK, 0 rows affected (0.97 sec)
  3. OceanBase(TEST@TEST)>create table t2(c1 int, c2 int);
  4. Query OK, 0 rows affected (0.29 sec)
  5. OceanBase(TEST@TEST)>explain select * from t1,t2 where t1.c1 = t2.c1;
  6. | ========================================
  7. |ID|OPERATOR |NAME|EST. ROWS|COST |
  8. ----------------------------------------
  9. |0 |HASH JOIN | |98010000 |66774608|
  10. |1 | TABLE SCAN|T1 |100000 |68478 |
  11. |2 | TABLE SCAN|T2 |100000 |68478 |
  12. ========================================
  13. Outputs & filters:
  14. -------------------------------------
  15. 0 - output([T1.C1], [T1.C2], [T2.C1], [T2.C2]), filter(nil),
  16. equal_conds([T1.C1 = T2.C1]), other_conds(nil)
  17. 1 - output([T1.C1], [T1.C2]), filter(nil),
  18. access([T1.C1], [T1.C2]), partitions(p0)
  19. 2 - output([T2.C1], [T2.C2]), filter(nil),
  20. access([T2.C1], [T2.C2]), partitions(p0)
  21. OceanBase(TEST@TEST)>explain select /*+use_nl(t1, c2)*/* from t1, t2 where t1.c1 = t2.c1;
  22. | ===============================================
  23. |ID|OPERATOR |NAME|EST. ROWS|COST |
  24. -----------------------------------------------
  25. |0 |NESTED-LOOP JOIN| |98010000 |4595346207|
  26. |1 | TABLE SCAN |T1 |100000 |68478 |
  27. |2 | MATERIAL | |100000 |243044 |
  28. |3 | TABLE SCAN |T2 |100000 |68478 |
  29. ===============================================
  30. Outputs & filters:
  31. -------------------------------------
  32. 0 - output([T1.C1], [T1.C2], [T2.C1], [T2.C2]), filter(nil),
  33. conds([T1.C1 = T2.C1]), nl_params_(nil)
  34. 1 - output([T1.C1], [T1.C2]), filter(nil),
  35. access([T1.C1], [T1.C2]), partitions(p0)
  36. 2 - output([T2.C1], [T2.C2]), filter(nil)
  37. 3 - output([T2.C1], [T2.C2]), filter(nil),
  38. access([T2.C1], [T2.C2]), partitions(p0)

Merge Join

Merge Join 首先会按照连接的字段对两个表进行 sort (如果内存空间不够,就需要进行外排),然后开始扫描两张表进行merge。Merge 的过程会从每个表取一条记录开始匹配,如果符合关联条件,则放入结果集中;否则,将关联字段值较小的记录抛弃,从这条记录对应的表中取下一条记录继续进行匹配,直到整个循环结束。

在多对多的两张表上进行 Merge 时,通常需要使用临时空间进行操作。例如 A join B 使用 Merge Join 时,如果对于关联字段的某一组值,在 A 和 B 中都存在多条记录 A1、A2…An、B1、B2…Bn,则为A中每一条记录 A1、A2…An,都必须在 B 中对所有相等的记录 B1、B2…Bn 进行一次匹配。这样,指针需要多次从 B1 移动到 Bn,每一次都需要读取相应的 B1…Bn 记录。将 B1…Bn 的记录预先读出来放入内存临时表中,比从原数据页或磁盘读取要快。在一些场景中,如果连接字段上有可用的索引,并且排序一致,那么可以直接跳过排序操作。

通常来说,Merge join 比较适合两个输入表已经有序的情况,否则 Hash Join 会更加好。下图展示了两个 Merge join 的计划,其中第一个是需要排序的,第二个是不需要排序的(因为两种表都选择了 k1 这两个索引访问路径,这两个索引本身就是按照 b 排序的)。

  1. OceanBase (root@test)> create table t1(a int primary key, b int, c int, key k1(b));
  2. Query OK, 0 rows affected (0.24 sec)
  3. OceanBase (root@test)> create table t2(a int primary key, b int, c int, key k1(b));
  4. Query OK, 0 rows affected (0.29 sec)
  5. OceanBase (root@test)> explain select/*+use_merge(t1 t2)*/ * from t1, t2 where t1.c = t2.c;
  6. | =====================================
  7. |ID|OPERATOR |NAME|EST. ROWS|COST|
  8. -------------------------------------
  9. |0 |MERGE JOIN | |1980 |6011|
  10. |1 | SORT | |1000 |2198|
  11. |2 | TABLE SCAN|t1 |1000 |455 |
  12. |3 | SORT | |1000 |2198|
  13. |4 | TABLE SCAN|t2 |1000 |455 |
  14. =====================================
  15. Outputs & filters:
  16. -------------------------------------
  17. 0 - output([t1.a], [t1.b], [t1.c], [t2.a], [t2.b], [t2.c]), filter(nil),
  18. equal_conds([t1.c = t2.c]), other_conds(nil)
  19. 1 - output([t1.a], [t1.b], [t1.c]), filter(nil), sort_keys([t1.c, ASC])
  20. 2 - output([t1.c], [t1.a], [t1.b]), filter(nil),
  21. access([t1.c], [t1.a], [t1.b]), partitions(p0)
  22. 3 - output([t2.a], [t2.b], [t2.c]), filter(nil), sort_keys([t2.c, ASC])
  23. 4 - output([t2.c], [t2.a], [t2.b]), filter(nil),
  24. access([t2.c], [t2.a], [t2.b]), partitions(p0)
  25. OceanBase (root@test)> explain select/*+use_merge(t1 t2),index(t1 k1),index(t2 k1)*/ * from t1, t2 where t1.b = t2.b;
  26. | =======================================
  27. |ID|OPERATOR |NAME |EST. ROWS|COST |
  28. ---------------------------------------
  29. |0 |MERGE JOIN | |1980 |12748|
  30. |1 | TABLE SCAN|t1(k1)|1000 |5566 |
  31. |2 | TABLE SCAN|t2(k1)|1000 |5566 |
  32. =======================================
  33. Outputs & filters:
  34. -------------------------------------
  35. 0 - output([t1.a], [t1.b], [t1.c], [t2.a], [t2.b], [t2.c]), filter(nil),
  36. equal_conds([t1.b = t2.b]), other_conds(nil)
  37. 1 - output([t1.b], [t1.a], [t1.c]), filter(nil),
  38. access([t1.b], [t1.a], [t1.c]), partitions(p0)
  39. 2 - output([t2.b], [t2.a], [t2.c]), filter(nil),
  40. access([t2.b], [t2.a], [t2.c]), partitions(p0)

同时,OceanBase 数据库也提供了 hint 机制/*+ use_merge(table_name_list) */ 显示的去控制多表连接的时候选择 Merge Join 连接算法,比如下面场景连接算法选择的是 Hash Join,而用户希望通过 Merge Join,则可以使用上述 hint 进行控制。

  1. OceanBase(TEST@TEST)>create table t1(c1 int, c2 int);
  2. Query OK, 0 rows affected (0.97 sec)
  3. OceanBase(TEST@TEST)>create table t2(c1 int, c2 int);
  4. Query OK, 0 rows affected (0.29 sec)
  5. OceanBase(TEST@TEST)>explain select * from t1,t2 where t1.c1 = t2.c1;
  6. | ========================================
  7. |ID|OPERATOR |NAME|EST. ROWS|COST |
  8. ----------------------------------------
  9. |0 |HASH JOIN | |98010000 |66774608|
  10. |1 | TABLE SCAN|T1 |100000 |68478 |
  11. |2 | TABLE SCAN|T2 |100000 |68478 |
  12. ========================================
  13. Outputs & filters:
  14. -------------------------------------
  15. 0 - output([T1.C1], [T1.C2], [T2.C1], [T2.C2]), filter(nil),
  16. equal_conds([T1.C1 = T2.C1]), other_conds(nil)
  17. 1 - output([T1.C1], [T1.C2]), filter(nil),
  18. access([T1.C1], [T1.C2]), partitions(p0)
  19. 2 - output([T2.C1], [T2.C2]), filter(nil),
  20. access([T2.C1], [T2.C2]), partitions(p0)
  21. OceanBase(TEST@TEST)>explain select /*+use_merge(t1,t2)*/* from t1, t2 where t1.c1 = t2.c1;
  22. | =========================================
  23. |ID|OPERATOR |NAME|EST. ROWS|COST |
  24. -----------------------------------------
  25. |0 |MERGE JOIN | |98010000 |67488837|
  26. |1 | SORT | |100000 |563680 |
  27. |2 | TABLE SCAN|T1 |100000 |68478 |
  28. |3 | SORT | |100000 |563680 |
  29. |4 | TABLE SCAN|T2 |100000 |68478 |
  30. =========================================
  31. Outputs & filters:
  32. -------------------------------------
  33. 0 - output([T1.C1], [T1.C2], [T2.C1], [T2.C2]), filter(nil),
  34. equal_conds([T1.C1 = T2.C1]), other_conds(nil)
  35. 1 - output([T1.C1], [T1.C2]), filter(nil), sort_keys([T1.C1, ASC])
  36. 2 - output([T1.C1], [T1.C2]), filter(nil),
  37. access([T1.C1], [T1.C2]), partitions(p0)
  38. 3 - output([T2.C1], [T2.C2]), filter(nil), sort_keys([T2.C1, ASC])
  39. 4 - output([T2.C1], [T2.C2]), filter(nil),
  40. access([T2.C1], [T2.C2]), partitions(p0)

Hash Join

Hash Join 就是用两个表中相对较小的表(通常称为 build table )根据连接条件创建 hash table,然后逐行扫描较大的表(通常称为 probe table)并通过探测 hash table 找到匹配的行。 如果 build table 非常大,构建的 hash table 无法在内存中容纳时,OceanBase 会分别将 build table 和 probe table 按照连接条件切分成多个分区(partition),每个 partition都包括一个独立的、成对匹配的 build table 和 probe table,这样就将一个大的 hash join 切分成多个独立、互相不影响的 hash join,每一个分区的 hash join 都能够在内存中完成。在绝大多数情况下,Hash Join 效率比其他 join 方式效率更高。

如下是 Hash Join 计划的示例:

  1. OceanBase (root@test)> create table t1(a int primary key, b int, c int, key k1(b));
  2. Query OK, 0 rows affected (0.24 sec)
  3. OceanBase (root@test)> create table t2(a int primary key, b int, c int, key k1(b));
  4. Query OK, 0 rows affected (0.29 sec)
  5. OceanBase (root@test)> explain select/*+use_hash(t1 t2)*/ * from t1, t2 where t1.c = t2.c;
  6. | ====================================
  7. |ID|OPERATOR |NAME|EST. ROWS|COST|
  8. ------------------------------------
  9. |0 |HASH JOIN | |1980 |4093|
  10. |1 | TABLE SCAN|t1 |1000 |455 |
  11. |2 | TABLE SCAN|t2 |1000 |455 |
  12. ====================================
  13. Outputs & filters:
  14. -------------------------------------
  15. 0 - output([t1.a], [t1.b], [t1.c], [t2.a], [t2.b], [t2.c]), filter(nil),
  16. equal_conds([t1.c = t2.c]), other_conds(nil)
  17. 1 - output([t1.c], [t1.a], [t1.b]), filter(nil),
  18. access([t1.c], [t1.a], [t1.b]), partitions(p0)
  19. 2 - output([t2.c], [t2.a], [t2.b]), filter(nil),
  20. access([t2.c], [t2.a], [t2.b]), partitions(p0)

同时,OcenaBase 数据库也提供了hint 机制/*+ use_hash(table_name_list) */ 显示的去控制多表连接的时候选择 Hash Join 连接算法,比如下面场景连接算法选择的是 Merge Join,而用户希望通过 Hash Join,则可以使用上述 hint 进行控制。

  1. OceanBase(TEST@TEST)>create table t1(c1 int, c2 int, primary key(c1));
  2. Query OK, 0 rows affected (0.31 sec)
  3. OceanBase(TEST@TEST)>create table t2(c1 int, c2 int, primary key(c1));
  4. Query OK, 0 rows affected (0.33 sec)
  5. OceanBase(TEST@TEST)>explain select * from t1, t2 where t1.c1 = t2.c1;
  6. | ======================================
  7. |ID|OPERATOR |NAME|EST. ROWS|COST |
  8. --------------------------------------
  9. |0 |MERGE JOIN | |100001 |219005|
  10. |1 | TABLE SCAN|T1 |100000 |61860 |
  11. |2 | TABLE SCAN|T2 |100000 |61860 |
  12. ======================================
  13. Outputs & filters:
  14. -------------------------------------
  15. 0 - output([T1.C1], [T1.C2], [T2.C1], [T2.C2]), filter(nil),
  16. equal_conds([T1.C1 = T2.C1]), other_conds(nil)
  17. 1 - output([T1.C1], [T1.C2]), filter(nil),
  18. access([T1.C1], [T1.C2]), partitions(p0)
  19. 2 - output([T2.C1], [T2.C2]), filter(nil),
  20. access([T2.C1], [T2.C2]), partitions(p0)
  21. OceanBase(TEST@TEST)>explain select /*+use_hash(t1, t2)*/ * from t1, t2 where t1.c1 = t2.c1;
  22. | ======================================
  23. |ID|OPERATOR |NAME|EST. ROWS|COST |
  24. --------------------------------------
  25. |0 |HASH JOIN | |100001 |495180|
  26. |1 | TABLE SCAN|T1 |100000 |61860 |
  27. |2 | TABLE SCAN|T2 |100000 |61860 |
  28. ======================================
  29. Outputs & filters:
  30. -------------------------------------
  31. 0 - output([T1.C1], [T1.C2], [T2.C1], [T2.C2]), filter(nil),
  32. equal_conds([T1.C1 = T2.C1]), other_conds(nil)
  33. 1 - output([T1.C1], [T1.C2]), filter(nil),
  34. access([T1.C1], [T1.C2]), partitions(p0)
  35. 2 - output([T2.C1], [T2.C2]), filter(nil),
  36. access([T2.C1], [T2.C2]), partitions(p0)