如何分析RT突然抖动的SQL?

    首先推荐使用外部诊断工具tars进行问题分析,如果没有tars则可使用以下方式:在线上如果出现RT抖动,但RT并不是持续很高的情况,可以考虑在抖动出现后,立刻将sql audit关闭(alter system set ob_enable_sql_audit = 0),从而确保该抖动的SQL请求在sql audit中存在;然后通过sql audit查询抖动附近那段时间RT TOP N的请求,分析有异常的SQL。如果在sql audit中找到了对应的RT异常请求,则可以分析该请求在sql audit中记录:

    • 查看retry次数是否很多(RETRY_CNT, 如果次数很多,则是否考虑是否有锁冲突或切主等情况)

    • 查看queue time是不是很大(QUEUE_TIME字段)

    • 查看获取执行计划时间(GET_PLAN_TIME), 如果时间很长,一般会伴随IS_HIT_PLAN = 0, 表示没有命中plan cache)

    • 查看EXECUTE_TIME是否很长,如果很长,则a. 查看是否有很长等待事件耗时。b. 分析逻辑读次数是否异常多(突然有大账户时可能会出现)。

    1. 逻辑读次数 = 2 * ROW_CACHE_HIT
    2. + 2 * BLOOM_FILTER_CACHE_HIT
    3. + BLOCK_INDEX_CACHE_HIT
    4. + BLOCK_CACHE_HIT + DISK_READS

    如果在sql audit中RT抖动的请求数据已淘汰,则需要查看observer中抖动时间点是否有慢查询的trace日志,如果有则分析trace日志;

    如何通过sql audit分析查询中等待事件?

    sql audit记录了等待事件如下相关信息:

    1. 记录了4大类等待事件分别的耗时(APPLICATION_WAIT_TIME, CONCURRENCY_WAIT_TIME, USER_IO_WAIT_TIME, SCHEDULE_TIME), 每类等待事件都涉及很多中具体的等待事件;

    2. 记录了耗时最多的等待事件名称(EVENT)及该等待事件耗时(WAIT_TIME_MICRO);

    3. 记录了所有等待事件的发生的次数(TOTAL_WAITS) 及所有等待事件总耗时(TOTAL_WAIT_TIME_MICRO)

    一般情况下,如果等待事件总耗时较多,我们通过查看耗时最多的等待事件名称(EVENT)能够基本确定是什么原因导致较慢, 如下面这个SQL, 等待事件主要耗时在IO等待上。

    1. select sql_id, elapsed_time, queue_time, get_plan_time, execute_time, application_wait_time, concurrency_wait_time, user_io_wait_time, schedule_time, event, wait_class, wait_time_micro, total_wait_time_micro
    2. from v$sql_audit
    3. where trace_id = 'YB420B84FE35-0005648A67211DC9'\G
    4. *************************** 1. row ***************************
    5. sql_id: 5316DBF96556040831142D61BBD9014F
    6. elapsed_time: 953
    7. queue_time: 18
    8. get_plan_time: 58
    9. execute_time: 867
    10. application_wait_time: 0
    11. concurrency_wait_time: 0
    12. user_io_wait_time: 550
    13. schedule_time: 0
    14. event: db file data index read
    15. wait_class: USER_IO
    16. wait_time_micro: 352
    17. total_wait_time_micro: 550

    找到所有SQL中平均执行时间排在TOP N的SQL

    1. select/*+ parallel(15)*/ avg_exe_usec, svr_ip, svr_port, sql_id, plan_id
    2. from oceanbase.gv$plan_cache_plan_stat
    3. where tenant_id = 1001
    4. order by avg_exe_usec desc
    5. limit 3\G;
    6. *************************** 1. row ***************************
    7. avg_exe_usec: 9795912
    8. svr_ip: 10.183.76.140
    9. svr_port: 2882
    10. sql_id: C5D91E6C772D1B87C32BB3C9ED1435E1
    11. plan_id: 4668689
    12. *************************** 2. row ***************************
    13. avg_exe_usec: 9435052
    14. svr_ip: 10.103.229.107
    15. svr_port: 2882
    16. sql_id: 3B6EFEEC8332EB2A0822A3EA7B769500
    17. plan_id: 4692858
    18. *************************** 3. row ***************************
    19. avg_exe_usec: 9335002
    20. svr_ip: 11.180.113.7
    21. svr_port: 2882
    22. sql_id: 3B6EFEEC8332EB2A0822A3EA7B769500
    23. plan_id: 4683085

    如何分析SQL查询一直比较慢的问题

    如果已知某条SQL查询一直比较慢,可以进行如下分析:1)类似问题1的步骤,分析该SQL执行是sql_audit统计数据;2) 分析下执行计划是否正确,见问题5

    如何查看执行计划形状并分析

    1. 通过Explain功能查看执行计划形状2)通过实时执行计划展示查看Plan Cache中缓存执行计划

    2. 通过1或2获得执行计划形状后,可分析索引选择连接顺序连接算法等选择是否合理正确;

    如何查看集群SQL请求流量是否均衡?

    1. select/*+ parallel(15)*/t2.zone, t1.svr_ip, count(*) as QPS, avg(t1.elapsed_time), avg(t1.queue_time)
    2. from oceanbase.gv$sql_audit t1, __all_server t2
    3. where t1.svr_ip = t2.svr_ip and IS_EXECUTOR_RPC = 0
    4. and request_time > (time_to_usec(now()) - 1000000)
    5. and request_time < time_to_usec(now())
    6. group by t1.svr_ip
    7. order by t2.zone;
    8. +--------+----------------+------+----------------------+--------------------+
    9. | zone | svr_ip | QPS | avg(t1.elapsed_time) | avg(t1.queue_time) |
    10. +--------+----------------+------+----------------------+--------------------+
    11. | ET2_1 | 10.103.224.119 | 379 | 5067.3034 | 33.7071 |
    12. | ET2_1 | 10.103.226.124 | 507 | 5784.1538 | 12.5878 |
    13. | ET2_1 | 10.103.228.177 | 370 | 5958.2162 | 10.9811 |
    14. | ET2_1 | 10.103.229.107 | 356 | 5730.9972 | 39.4185 |
    15. | ET2_1 | 10.103.229.94 | 369 | 5851.7886 | 64.9621 |
    16. | EU13_2 | 10.183.78.113 | 354 | 6182.6384 | 11.3107 |
    17. | EU13_2 | 10.183.78.86 | 349 | 5881.3209 | 10.7393 |
    18. | EU13_2 | 10.183.79.56 | 347 | 5936.0144 | 11.9049 |
    19. | EU13_2 | 10.183.85.152 | 390 | 5988.4846 | 12.0487 |
    20. | EU13_3 | 10.183.76.140 | 284 | 5657.2218 | 11.7993 |
    21. | EU13_3 | 10.183.78.165 | 372 | 5360.6989 | 11.6290 |
    22. | EU13_3 | 10.183.79.198 | 416 | 4154.2861 | 12.2524 |
    23. | EU13_3 | 10.183.86.65 | 446 | 6487.6009 | 24.5112 |
    24. | EU13_3 | 11.180.113.7 | 364 | 5444.4203 | 12.3462 |
    25. +--------+----------------+------+----------------------+--------------------+

    如何查看分布式计划rpc执行数是否均衡?

    1. select/*+ parallel(15)*/t2.zone, t1.svr_ip, count(*) as RPC_COUNT, avg(t1.elapsed_time), avg(t1.queue_time)
    2. from oceanbase.gv$sql_audit t1, __all_server t2
    3. where t1.svr_ip = t2.svr_ip and IS_EXECUTOR_RPC = 1
    4. and tenant_id = 1001
    5. and request_time > (time_to_usec(now()) - 1000000)
    6. and request_time < time_to_usec(now())
    7. group by t1.svr_ip
    8. order by t2.zone;
    9. +--------+----------------+-----------+----------------------+--------------------+
    10. | zone | svr_ip | RPC_COUNT | avg(t1.elapsed_time) | avg(t1.queue_time) |
    11. +--------+----------------+-----------+----------------------+--------------------+
    12. | ET2_1 | 10.103.224.119 | 2517 | 514.2241 | 13.5515 |
    13. | ET2_1 | 10.103.226.124 | 2786 | 1628.0948 | 13.2915 |
    14. | ET2_1 | 10.103.228.177 | 3068 | 1984.0238 | 12.9029 |
    15. | ET2_1 | 10.103.229.107 | 3216 | 538.7646 | 12.8629 |
    16. | ET2_1 | 10.103.229.94 | 2228 | 802.8577 | 13.4138 |
    17. | EU13_2 | 10.183.78.113 | 2000 | 805.0485 | 13.0610 |
    18. | EU13_2 | 10.183.78.86 | 3296 | 1115.0725 | 13.2700 |
    19. | EU13_2 | 10.183.79.56 | 2460 | 1129.4085 | 14.3293 |
    20. | EU13_2 | 10.183.85.152 | 2533 | 891.0683 | 13.8602 |
    21. | EU13_3 | 10.183.76.140 | 3045 | 677.6591 | 13.7209 |
    22. | EU13_3 | 10.183.78.165 | 2202 | 821.9496 | 12.8247 |
    23. | EU13_3 | 10.183.79.198 | 2825 | 1277.0375 | 13.3345 |
    24. | EU13_3 | 10.183.86.65 | 2142 | 746.0808 | 13.0121 |
    25. | EU13_3 | 11.180.113.7 | 2735 | 765.8731 | 12.4750 |
    26. +--------+----------------+-----------+----------------------+--------------------+

    查询SQL流量分布情况及QPS

    1. select/*+ parallel(15)*/t2.zone, t1.svr_ip, count(*) as RPC_COUNT, avg(t1.elapsed_time), avg(t1.queue_time)
    2. from oceanbase.gv$sql_audit t1, __all_server t2
    3. where t1.svr_ip = t2.svr_ip
    4. and tenant_id = 1001
    5. and SQL_ID = 'BF7AA13A28DF50BA5C33FF19F1DBD8A9'
    6. and IS_EXECUTOR_RPC = 0
    7. and request_time > (time_to_usec(now()) - 1000000)
    8. and request_time < time_to_usec(now())
    9. group by t1.svr_ip

    查询某段时间内请求次数排在TOP N的SQL

    1. select/*+ parallel(15)*/ SQL_ID, count(*) as QPS, avg(t1.elapsed_time) RT from oceanbase.gv$sql_audit t1 where tenant_id = 1001 and IS_EXECUTOR_RPC = 0 and request_time > (time_to_usec(now()) - 10000000) and request_time < time_to_usec(now()) group by t1.sql_id order by QPS desc limit 10;
    2. +----------------------------------+------+------------+
    3. | SQL_ID | QPS | RT |
    4. +----------------------------------+------+------------+
    5. | BF7AA13A28DF50BA5C33FF19F1DBD8A9 | 2523 | 4233.2085 |
    6. | CE7208ADDE365D0AB5E68EE24E5FD730 | 1268 | 5935.8683 |
    7. | E5C7494018989226E69AE7D08B3D0F15 | 1028 | 7275.7490 |
    8. | D0E8D8C937E44BC3BB9A5379AE1064C5 | 1000 | 12999.1640 |
    9. | 2D45D7BE4E459CFBEAE4803971F0C6F9 | 1000 | 8050.6360 |
    10. | C81CE9AA555BE59B088B379CC7AE5B40 | 1000 | 6865.4940 |
    11. | BDC4FE903B414203A04E41C7DDA6627D | 1000 | 12751.8960 |
    12. | B1B136047D7C3B6B9125F095363A9D23 | 885 | 13293.2237 |
    13. | 47993DD69888868E92A7CAB2FDE65380 | 880 | 7282.0557 |
    14. | 05C6279D767C7F212619BF4B659D3BAB | 844 | 11474.5438 |
    15. +----------------------------------+------+------------+

    查询某段时间内平均RT排在TOP N的SQL

    1. select/*+ parallel(15)*/ SQL_ID, count(*) as QPS, avg(t1.elapsed_time) RT
    2. from oceanbase.gv$sql_audit t1
    3. where tenant_id = 1001 and IS_EXECUTOR_RPC = 0
    4. and request_time > (time_to_usec(now()) - 10000000)
    5. and request_time < time_to_usec(now())
    6. group by t1.sql_id
    7. order by RT desc
    8. limit 10;
    9. +----------------------------------+------+------------+
    10. | SQL_ID | QPS | RT |
    11. +----------------------------------+------+------------+
    12. | 0A3D3DCB3343BBBB10E4B4B9777B77FC | 1 | 53618.0000 |
    13. | A3831961C337545AF5BD1219BE29867A | 1 | 50764.0000 |
    14. | F3DC5EF627DA63AE52044FCE7732267C | 1 | 48497.0000 |
    15. | 39C63F143FDDACAEC090F480789DBCA5 | 1 | 47035.0000 |
    16. | A3BF306B02FF86E76C96C9CEFADBDB7E | 1 | 45553.0000 |
    17. | 7942E8D29BAFBF23EF3E3D29D55F428A | 1 | 45285.0000 |
    18. | 20989A74CC1703664BDE9D6EA7830C24 | 1 | 39143.0000 |
    19. | 80F40791E76C79D3DCD46FEEFFAB338E | 1 | 37654.0000 |
    20. | 07E2FE351E3DD82843E81930B84D3DDE | 1 | 37231.0000 |
    21. | 11B19DB5A1393590ABBE08005C155B2E | 1 | 37139.0000 |
    22. +----------------------------------+------+------------+

    查询某段时间内执行时间排TOP N的请求

    1. select/*+ parallel(15)*/ sql_id, elapsed_time , trace_id
    2. from oceanbase.gv$sql_audit
    3. where tenant_id = 1001
    4. and IS_EXECUTOR_RPC = 0
    5. and request_time > (time_to_usec(now()) - 10000000)
    6. and request_time < time_to_usec(now())
    7. order by elapsed_time desc
    8. limit 10;
    9. +----------------------------------+--------------+-------------------------------+
    10. | sql_id | elapsed_time | trace_id |
    11. +----------------------------------+--------------+-------------------------------+
    12. | CFA269275E3BB270408747C01F64D837 | 87381 | YB420AB75598-0005634FBED5C5E8 |
    13. | 1979A5B4A27D5C3DBE08F80383FD6EB6 | 83465 | YB420AB74E56-0005634B4B87353B |
    14. | 51248E6C3BB5EF1FC4E8E79CA685723E | 82767 | YB420AB74E56-0005634B4B82E7E1 |
    15. | 249C40E669DFCCE80E3D11446272FA11 | 79919 | YB420A67E27C-00056349549A79D3 |
    16. | BEFAD568C3858D2C2E35F01558CBEC06 | 77210 | YB420A67E4B1-00056345B0F2E97E |
    17. | FB1A6A8BC4125C324A38F91B3808D364 | 75870 | YB420AB74E71-00056347074261E6 |
    18. | 0343A519C0C5BF31C68CB68F63721990 | 75666 | YB420BB47107-00056346A5A631FB |
    19. | B140BB0C671D9B8616FB048544F3B85B | 73902 | YB420A67E56B-00056342A5A4683E |
    20. | 4F9B1D0A3822A3E0DF69DB11ABFBE0EA | 72963 | YB420BB47107-00056346A5AAC5F5 |
    21. | 9963E8D252E6CBA72FBA45AC5790BA11 | 72354 | YB420A67E56B-00056342A5A66089 |
    22. +----------------------------------+--------------+-------------------------------+

    判断系统或某个SQL的执行是否出现大量请求不合理的使用了远程执行

    该SQL能够分析出某段时间内不同类型的计划执行次数,一般情况下,如果出现远程执行比较多时可能时出现切主或proxy客户端路由不准的情况。

    1. select count(*), plan_type
    2. from oceanbase.gv$sql_audit
    3. where tenant_id = 1001
    4. and IS_EXECUTOR_RPC = 0
    5. and request_time > (time_to_usec(now()) - 10000000)
    6. and request_time < time_to_usec(now())
    7. group by plan_type ;

    找出某个租户中全表扫描的SQL

    1. select query_sql
    2. from oceanbase.gv$sql_audit
    3. where table_scan = 1 and tenant_id = 1001
    4. group by sql_id;

    分布式计划如何分析查询问题

    1)先确定是否为分布式计划, (g)v$plan_cache_plan_stat, (g)v$sql_audit中均对执行计划类型有记录;2)分析该执行计划是否正确;3)通过trace_id关联查询gv$sql_audit, 查看所有执行的子计划耗时情况,每个子计划的RPC执行均对应一条sql_audit记录,分析该sql_audit记录来定位问题, 如下举例:is_executor_rpc = 1表示子计划执行在sql_audit中记录,主要记录执行相关信息。is_executor_rpc = 0表示接受sql请求的线程在sql_audit中的记录, 该记录含有SQL执行过程的信息,包括SQL信息,获取执行计划信息等。

    1. select/*+ parallel(15)*/ sql_id, is_executor_rpc, elapsed_time
    2. from oceanbase.gv$sql_audit
    3. where trace_id = 'YB420AB74FC6-00056349D323483A';
    4. +----------------------------------+-----------------+--------------+
    5. | sql_id | is_executor_rpc | elapsed_time |
    6. +----------------------------------+-----------------+--------------+
    7. | | 1 | 124 |
    8. | | 1 | 191 |
    9. | | 1 | 123447 |
    10. | | 1 | 125 |
    11. | 20172B18BC9EE3F806D4149895754CE0 | 0 | 125192 |
    12. | | 1 | 148 |
    13. | | 1 | 149 |
    14. | | 1 | 140 |
    15. +----------------------------------+-----------------+--------------+