背景

车联网场景,一个行程由若干个点组成,业务需要按行程查询,按。

但是问题来了:

一个行程包含多个点,那么一次查询意味着需要查询并返回多个点(多条记录)。

由于有多个设备(例如汽车),大概率可能同时活跃,同时上传点,因此以堆表存储的话,一个行程中的多条记录,实际上在数据库的表里面,存储在多个BLOCK里面。

如上分析,性能问题:IO放大。

如何优化?

1、行程记录,按行聚集存储。

类似cluster操作。

2、行程记录,合并到单条,聚集存储。

类似将一个行程多条记录聚合。

例子

下面分别测试几种优化方法带来的性能优化效果。

1、cluster

2、array 聚合带压缩

3、array 聚合不带压缩

4、jsonb 聚合带压缩

5、jsonb 聚合不带压缩

6、text 聚合带压缩

7、text 聚合不带压缩

1 原始状态

  1. create unlogged table t_sensor(id serial8 primary key, sid int8, att text, crt_time timestamp);
  2. create index idx_t_sensor_1 on t_sensor (sid, crt_time);
  1. vi test.sql
  2. \set sid random(1,10000)
  3. insert into t_sensor(sid,att,crt_time) values (:sid, md5(random()::text), now());
  4. pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 2000000

查看一个行程,需要访问11227个数据块。

  1. postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_sensor where sid=1 order by crt_time;
  2. QUERY PLAN
  3. --------------------------------------------------------------------------------------------------------------------------------------------
  4. Index Scan using idx_t_sensor_1 on public.t_sensor (cost=0.49..253.36 rows=11121 width=57) (actual time=0.020..47.591 rows=11189 loops=1)
  5. Output: id, sid, att, crt_time
  6. Index Cond: (t_sensor.sid = 1)
  7. Buffers: shared hit=3406 read=7821
  8. Planning Time: 0.092 ms
  9. Execution Time: 48.303 ms
  10. (6 rows)
  11. postgres=# select 119*11227*8/1024.0;
  12. ?column?
  13. --------------------
  14. 10437.601562500000
  15. (1 row)

压测性能

  1. vi test.sql
  2. \set sid random(1,10000)
  3. select * from t_sensor where sid=:sid order by crt_time ;
  4. pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 120
  1. transaction type: ./test.sql
  2. scaling factor: 1
  3. query mode: prepared
  4. number of clients: 56
  5. number of threads: 56
  6. duration: 120 s
  7. number of transactions actually processed: 14307
  8. latency average = 470.175 ms
  9. latency stddev = 43.500 ms
  10. tps = 119.037842 (including connections establishing)
  11. tps = 119.098221 (excluding connections establishing)
  12. statement latencies in milliseconds:
  13. 0.003 \set sid random(1,10000)
  14. 470.178 select * from t_sensor where sid=:sid order by crt_time ;

2 cluster

按行程,时间索引聚集。一次查询返回一条聚合后的记录。

  1. postgres=# cluster t_sensor USING idx_t_sensor_1 ;

查询一个行程,扫描174个数据块。IO 骤降。

  1. postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_sensor where sid=1 order by crt_time;
  2. QUERY PLAN
  3. -------------------------------------------------------------------------------------------------------------------------------------------
  4. Index Scan using idx_t_sensor_1 on public.t_sensor (cost=0.49..253.39 rows=11149 width=57) (actual time=0.018..2.212 rows=11189 loops=1)
  5. Output: id, sid, att, crt_time
  6. Index Cond: (t_sensor.sid = 1)
  7. Buffers: shared hit=174
  8. Planning Time: 0.094 ms
  9. Execution Time: 2.816 ms
  10. (6 rows)

压测性能,见末尾。

3 array 带压缩

  1. create unlogged table t_sensor_agg(sid int8 primary key, agg t_sensor[]);
  2. insert into t_sensor_agg select sid,array_agg(t_sensor order by crt_time) from t_sensor group by sid;

瓶颈:array类型的INPUT OUTPUT接口,以及压缩解压缩接口

  1. Samples: 3M of event 'cpu-clock', Event count (approx.): 559749813667
  2. Overhead Shared Object Symbol
  3. 24.59% postgres [.] array_out
  4. 20.70% postgres [.] record_out
  5. 6.74% postgres [.] pglz_decompress
  6. 3.78% libc-2.17.so [.] __memcpy_ssse3_back
  7. 2.72% postgres [.] pg_ltostr_zeropad
  8. 2.34% [kernel] [k] run_timer_softirq
  9. 2.23% postgres [.] pg_lltoa
  10. 2.13% postgres [.] 0x000000000035c350
  11. 1.76% postgres [.] heap_deform_tuple
  12. 1.49% libc-2.17.so [.] __strlen_sse2_pminub
  13. 1.36% postgres [.] appendStringInfoChar
  14. 1.36% [kernel] [k] copy_user_enhanced_fast_string
  15. 1.29% postgres [.] 0x000000000035c36c
  16. 1.28% postgres [.] 0x000000000035c362
  17. 1.17% postgres [.] FunctionCall1Coll
  18. 0.92% postgres [.] hash_search_with_hash_value
  19. 0.86% [kernel] [k] _raw_spin_unlock_irqrestore
  20. 0.84% postgres [.] j2date
  21. 0.82% postgres [.] 0x000000000035c357
  22. 0.76% postgres [.] palloc
  23. 0.76% postgres [.] lookup_type_cache
  24. 0.67% postgres [.] 0x000000000035c360
  25. 0.66% postgres [.] timestamp2tm
  26. 0.64% [kernel] [k] rcu_process_callbacks
  27. 0.64% [kernel] [k] __do_softirq
  1. vi test.sql
  2. \set sid random(1,10000)
  3. select * from t_sensor_agg4 where sid=:sid ;

4 array 不带压缩

  1. create unlogged table t_sensor_agg1(sid int8 primary key, agg t_sensor[]);
  2. alter table t_sensor_agg1 alter column agg set storage external;
  3. insert into t_sensor_agg1 select sid,array_agg(t_sensor order by crt_time) from t_sensor group by sid;

瓶颈:array类型的INPUT OUTPUT接口

  1. Samples: 1M of event 'cpu-clock', Event count (approx.): 310127790569
  2. Overhead Shared Object Symbol
  3. 26.06% postgres [.] array_out
  4. 21.44% postgres [.] record_out
  5. 4.20% libc-2.17.so [.] __memcpy_ssse3_back
  6. 2.86% [kernel] [k] run_timer_softirq
  7. 2.75% postgres [.] pg_ltostr_zeropad
  8. 2.65% postgres [.] heap_deform_tuple
  9. 2.28% postgres [.] pg_lltoa
  10. 2.14% postgres [.] 0x000000000035c350
  11. 1.87% [kernel] [k] copy_user_enhanced_fast_string
  12. 1.52% libc-2.17.so [.] __strlen_sse2_pminub
  13. 1.47% postgres [.] appendStringInfoChar
  14. 1.32% postgres [.] 0x000000000035c36c
  15. 1.30% postgres [.] 0x000000000035c362
  16. 1.20% postgres [.] FunctionCall1Coll
  17. 1.11% postgres [.] hash_search_with_hash_value
  18. 0.87% postgres [.] j2date
  19. 0.85% postgres [.] 0x000000000035c357
  20. 0.81% [kernel] [k] _raw_spin_unlock_irqrestore
  21. 0.76% postgres [.] lookup_type_cache
  22. 0.75% postgres [.] 0x000000000046d33b
  23. 0.74% postgres [.] palloc
  24. 0.72% [kernel] [k] rcu_process_callbacks
  25. 0.68% postgres [.] timestamp2tm
  26. 0.68% postgres [.] pfree

5 jsonb 带压缩

  1. create unlogged table t_sensor_agg2(sid int8 primary key, agg jsonb);
  2. insert into t_sensor_agg2 select sid,jsonb_agg(t_sensor order by crt_time) from t_sensor group by sid;

6 jsonb 不带压缩

  1. create unlogged table t_sensor_agg3(sid int8 primary key, agg jsonb);
  2. alter table t_sensor_agg3 alter column agg set storage external;
  3. insert into t_sensor_agg3 select sid,jsonb_agg(t_sensor order by crt_time) from t_sensor group by sid;

7 text 带压缩

  1. create unlogged table t_sensor_agg4(sid int8 primary key, agg text);
  2. insert into t_sensor_agg4 select sid,string_agg(t_sensor::text, '|' order by crt_time) from t_sensor group by sid;

8 text 不带压缩

  1. create unlogged table t_sensor_agg5(sid int8 primary key, agg text);
  2. alter table t_sensor_agg5 alter column agg set storage external;
  3. insert into t_sensor_agg5 select sid,string_agg(t_sensor::text, '|' order by crt_time) from t_sensor group by sid;

9 index only scan 类似聚集表效果

所有内容作为INDEX的KEY,类似聚集表的效果(相邻内容在同一个INDEX PAGE里面)。查询时走INDEX ONLY SCAN扫描方法,扫描的BLOCK最少。

注意:btree 索引内容不能超过1/3 PAGE (因为BTREE是双向链表,一个PAGE至少要有一条有效记录,所以有这个限制。)。

写入数据

  1. create table t_sensor (id serial8 primary key, sid int8, att text, crt_time timestamp);
  2. create index idx_t_sensor_1 on t_sensor (sid, crt_time, att, id);
  3. vi test.sql
  4. \set sid random(1,10000)
  5. insert into t_sensor(sid,att,crt_time) values (:sid, md5(random()::text), now());
  6. pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 2000000
  7. transaction type: ./test.sql
  8. scaling factor: 1
  9. query mode: prepared
  10. number of clients: 50
  11. number of threads: 50
  12. number of transactions per client: 2000000
  13. number of transactions actually processed: 100000000/100000000
  14. latency average = 0.193 ms
  15. latency stddev = 0.461 ms
  16. tps = 257995.418591 (including connections establishing)
  17. tps = 258024.212148 (excluding connections establishing)
  18. statement latencies in milliseconds:
  19. 0.001 \set sid random(1,10000)
  20. 0.192 insert into t_sensor(sid,att,crt_time) values (:sid, md5(random()::text), now());

生成VM文件(autovacuum触发时会自动生成,但是为了立马看效果,手工执行一下。)

  1. vacuum analyze t_sensor;

INDEX ONLY SCAN, IO减少效果如下:

  1. postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_sensor where sid=2 order by crt_time;
  2. QUERY PLAN
  3. ---------------------------------------------------------------------------------------------------------------------------------------------
  4. Index Only Scan using idx_t_sensor_1 on public.t_sensor (cost=0.60..70.41 rows=9960 width=57) (actual time=0.019..2.109 rows=9978 loops=1)
  5. Output: id, sid, att, crt_time
  6. Index Cond: (t_sensor.sid = 2)
  7. Heap Fetches: 0
  8. Buffers: shared hit=235
  9. Planning Time: 0.090 ms
  10. Execution Time: 2.652 ms
  11. (7 rows)

查询性能:

  1. vi test.sql
  2. \set sid random(1,10000)
  3. select * from t_sensor where sid=:sid order by crt_time;
  4. pgbench -M prepared -n -r -P 1 -f ./test.sql -c 28 -j 28 -T 120
  5. transaction type: ./test.sql
  6. scaling factor: 1
  7. query mode: prepared
  8. number of clients: 28
  9. number of threads: 28
  10. duration: 120 s
  11. number of transactions actually processed: 283638
  12. latency average = 11.844 ms
  13. latency stddev = 1.931 ms
  14. tps = 2363.410561 (including connections establishing)
  15. tps = 2363.913145 (excluding connections establishing)
  16. statement latencies in milliseconds:
  17. 0.002 \set sid random(1,10000)
  18. 11.842 select * from t_sensor where sid=:sid order by crt_time;

小结

目标数据分散在多个BLOCK中,引起IO放大的问题,通过聚集存储,或者通过聚合存储,可以很好的解决这个问题。

聚合后,另一个瓶颈则是聚合后的类型(array,jsonb,text)的IN OUT接口。

/表存储行程查询 qps
原始(IO 放大)8880 MB119
顺序(无IO 放大)8880 MB2057
index only scan(类似聚集表)(无IO 放大)8880 MB2363
聚合array(压缩)4523 MB2362
聚合array(不压缩)8714 MB2515
聚合json(压缩)5052 MB3102
聚合json(不压缩)13 GB3184
聚合text(压缩)4969 MB6057
聚合text(不压缩)7692 MB5997

从上面的测试,可以看到IN OUT函数接口的开销,text<jsonb<array(composite array)。

实际的优化例子,可参考末尾的几篇文章。例如:

1、按时间分区,旧的分区使用cluster,按行程整理数据,使用AB表切换,解决IO放大的问题。

2、异步聚合,将点数据准实时按行程ID,聚合到聚合后的行程表。

3、使用INDEX ONLY SCAN, 达到聚集表效果。对业务无任何侵入性。(例如按天分区,加全量(业务需要查询的字段)索引。),相当于两倍存储空间(一份在堆表,一份在索引中)。

参考

《PostgreSQL IoT,车联网 - 实时轨迹、行程实践》

《PostgreSQL AB表切换最佳实践 - 提高切换成功率,杜绝雪崩 - 珍藏级》

《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》

原文:http://mysql.taobao.org/monthly/2018/12/10/