背景

PostgreSQL 可靠性与大多数关系数据库一样,都是通过REDO来保障的。

群里有位童鞋问了一个问题,为什么PostgreSQL的REDO块大小默认是8K的,不是512字节。

这位童鞋提问的理由是,大多数的块设备扇区大小是512字节的,512字节可以保证原子写,而如果REDO的块大于512字节,可能会出现partial write。

那么PostgreSQL的redo(wal) 块大小设置为8KB时,靠谱吗?本文将给大家分析一下。

什么情况下会出现partial write?

1. 当开启了易失缓存时,如果写数据的块大小大于磁盘原子写的大小(通常为512字节),掉电则可能出现partial write。

例如disk cache,没有掉电保护,而且操作系统的fsync接口不感知disk cache,如果你调用了fsync,即使返回成功,数据其实可能还在disk cache里面。

当发生掉电时,在disk cache里的数据会丢失掉,如果程序写一个8K的数据,因为磁盘的原子写小于8K,则可能出现8K里有些写成功了,有些没有写成功,即partial write。

pic1

(ps: 某些企业级SSD可以通过电容残余的电量,将DISK CACHE里的数据持久化下来,但是请不要相信所有磁盘都有这个功能)

2. 当开启了易失缓存时,如果写数据的块大小小于或等于磁盘原子写的大小(即”原子写”),掉电时也可能出现partial write。

对于MySQL来说,REDO的写为512字节的,其中包含12个字节的头信息,4个字节的校验信息。

这个怎么理解呢,为什么没有对齐则可能出现。

pic2

不对齐的坏处

1. 前面提到了,如果没有对齐,并且开启了易失缓存,原子写是没有用的,同样会出现partial write。

2. 如果没有对齐,会造成写放大,本来写512字节的,磁盘上会造成写1024字节(将两个扇区数据读出来再与要写的数据合并, 分成两个扇区回写)。

原子写不能抵御什么风险?

1. 开启易失缓存时,原子写一样会丢失易失缓存中的数据。

2. 当未对齐时,原子写并不是真的原子写。

数据库只靠REDO的原子写,如果不考虑以上两个因素,起不到保证数据可靠性和一致性的作用。

PostgreSQL如何保证数据库可靠性

1. shared buffer 中的dirty page在write前,必须要保证对应的redo已经持久化(指已经落到非易失存储介质)。

2. 在检查点后出现的脏页,必须要在redo中写dirty page的full page。

这2条保证的是数据文件的一致性。

3. 在不考虑standby的情况下,当设置为同步提交的事务在事务提交时,必须等待事务产生的REDO已持久化才返回(指已经落到非易失存储介质)。

参考

《PostgreSQL 9.6 同步多副本 与 remote_apply事务同步级别》

4. 当设置为异步提交的事务在事务提交时,不需要等待事务产生的REDO持久化。

由于有第一条的保护,所以即使使用异步事务,丢失REDO buffer中的数据后,也不会出现不一致(比如一半提交,一半未提交)的情况,仅仅丢失redo buffer中未提交的事务而已。

一致性由PostgreSQL MVCC的机制来保证,不会读到脏数据。

建议

1. 在使用COW的文件系统(如btrfs, zfs)时,可以关闭full page write,因为这种文件系统可以保证不会出现partial write。

2. 对齐,可以避免写放大的问题。

3. 不要使用易失缓存,但是可以使用有掉电保护的易失缓存。

PostgreSQL认为系统提供的fsync调用是可靠的,即写到了持久化的存储。

如果连fsync都不可靠了,管它是不是原子写,都是不可靠的。

包括DirectIO在内(PostgreSQL支持REDO使用DirectIO),也无法感知disk cache,所以请慎重。

PostgreSQL redo block不是原子写,安全吗?

首先,前面已经分析了,原子写并不能抵御易失存储导致的丢数据。

1. PostgreSQL redo block是有checksum的,可以保证块的一致性,不会APPLY不一致的块。

2. 事务提交时,返回给用户前,一定会保证REDO已持久化。

所以用户收到反馈的事务,一定是持久化的,不可能存在partial write。

而没有收到反馈或未结束的事务,才有可能包含partial write,那么问题就简化了:

这些没有收到反馈或未结束的事务产生的REDO 出现partial write会不会导致数据不一致?

回答是不会,参考前面 “PostgreSQL如何保证数据库可靠性”,MVCC机制可以保证这些 。

模拟redo block partial write

数据库参数

  1. wal_level = logical
  2. 便于观察,验证

产生测试数据

  1. pgbench -i -s 100

模拟压力测试

  1. pgbench -M prepared -n -r -P 2 -c 32 -j 32 -T 1000

观测到产生了一些XLOG,约200秒后,测试过程中强制停库,下次启动会进入恢复状态

  1. pg_ctl stop -m immediate

记录接下来要纂改的REDO文件以及之前的文件最后的内容

纂改的前一个文件的末尾的一些内容,用于判断已持久化的记录
能看到几笔commit rec就行了

  1. pg_xlogdump -b 0000000100000116000000F7 0000000100000116000000F7 | tail -n 20
  2. rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F7FFFA80, prev 116/F7FFFA50, desc: CLEAN remxid 772066680
  3. blkref #0: rel 1663/13241/38254 fork main blk 90969
  4. 772208346已持久化
  5. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208346, lsn: 116/F7FFFAC0, prev 116/F7FFFA80, desc: COMMIT 2016-10-11 15:04:16.395000 CST
  6. rmgr: Heap len (rec/tot): 3/ 79, tx: 772208353, lsn: 116/F7FFFAF0, prev 116/F7FFFAC0, desc: INSERT off 130
  7. blkref #0: rel 1663/13241/38242 fork main blk 17723
  8. rmgr: Heap len (rec/tot): 14/ 163, tx: 772208368, lsn: 116/F7FFFB40, prev 116/F7FFFAF0, desc: HOT_UPDATE off 71 xmax 772208368 ; new off 76 xmax 0
  9. blkref #0: rel 1663/13241/38254 fork main blk 90969
  10. rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F7FFFBE8, prev 116/F7FFFB40, desc: CLEAN remxid 772061924
  11. blkref #0: rel 1663/13241/38254 fork main blk 37123
  12. rmgr: Heap len (rec/tot): 14/ 74, tx: 772208358, lsn: 116/F7FFFC28, prev 116/F7FFFBE8, desc: HOT_UPDATE off 22 xmax 772208358 ; new off 25 xmax 0
  13. blkref #0: rel 1663/13241/38251 fork main blk 34
  14. rmgr: Heap len (rec/tot): 14/ 78, tx: 772208360, lsn: 116/F7FFFC78, prev 116/F7FFFC28, desc: HOT_UPDATE off 121 xmax 772208360 ; new off 123 xmax 0
  15. blkref #0: rel 1663/13241/38245 fork main blk 124
  16. 772208344已持久化
  17. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208344, lsn: 116/F7FFFCC8, prev 116/F7FFFC78, desc: COMMIT 2016-10-11 15:04:16.395018 CST
  18. rmgr: Heap len (rec/tot): 14/ 163, tx: 772208369, lsn: 116/F7FFFCF8, prev 116/F7FFFCC8, desc: HOT_UPDATE off 67 xmax 772208369 ; new off 73 xmax 0
  19. blkref #0: rel 1663/13241/38254 fork main blk 37123
  20. rmgr: Heap len (rec/tot): 14/ 78, tx: 772208355, lsn: 116/F7FFFDA0, prev 116/F7FFFCF8, desc: HOT_UPDATE off 97 xmax 772208355 ; new off 110 xmax 0
  21. blkref #0: rel 1663/13241/38245 fork main blk 988
  22. 772208351,772208352已持久化
  23. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208351, lsn: 116/F7FFFDF0, prev 116/F7FFFDA0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
  24. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208352, lsn: 116/F7FFFE20, prev 116/F7FFFDF0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
  25. rmgr: Heap len (rec/tot): 3/ 79, tx: 772208354, lsn: 116/F7FFFE50, prev 116/F7FFFE20, desc: INSERT off 117
  26. blkref #0: rel 1663/13241/38242 fork main blk 17727
  27. rmgr: Heap len (rec/tot): 7/ 53, tx: 772208357, lsn: 116/F7FFFEA0, prev 116/F7FFFE50, desc: LOCK off 133: xid 772208357 LOCK_ONLY EXCL_LOCK
  28. blkref #0: rel 1663/13241/38251 fork main blk 42
  29. 已持久化
  30. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208353, lsn: 116/F7FFFED8, prev 116/F7FFFEA0, desc: COMMIT 2016-10-11 15:04:16.395037 CST
  31. rmgr: Heap len (rec/tot): 14/ 78, tx: 772208363, lsn: 116/F7FFFF08, prev 116/F7FFFED8, desc: HOT_UPDATE off 127 xmax 772208363 ; new off 186 xmax 0
  32. blkref #0: rel 1663/13241/38245 fork main blk 79
  33. 已持久化
  34. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208345, lsn: 116/F7FFFF58, prev 116/F7FFFF08, desc: COMMIT 2016-10-11 15:04:16.395040 CST
  35. rmgr: Heap len (rec/tot): 7/ 53, tx: 772208349, lsn: 116/F7FFFF88, prev 116/F7FFFF58, desc: LOCK off 154: xid 772208349 LOCK_ONLY EXCL_LOCK
  36. blkref #0: rel 1663/13241/38251 fork main blk 38
  37. 查看某事务的REDO
  38. pg_xlogdump -x 772208351 0000000100000116000000F7 0000000100000116000000F7
  39. rmgr: Heap len (rec/tot): 14/ 163, tx: 772208351, lsn: 116/F7FFD3B8, prev 116/F7FFD378, desc: HOT_UPDATE off 8 xmax 772208351 ; new off 73 xmax 0, blkref #0: rel 1663/13241/38254 blk 69436
  40. rmgr: Heap len (rec/tot): 14/ 78, tx: 772208351, lsn: 116/F7FFE6A0, prev 116/F7FFE660, desc: HOT_UPDATE off 17 xmax 772208351 ; new off 40 xmax 0, blkref #0: rel 1663/13241/38245 blk 117
  41. rmgr: Heap len (rec/tot): 14/ 74, tx: 772208351, lsn: 116/F7FFF048, prev 116/F7FFEFA0, desc: HOT_UPDATE off 165 xmax 772208351 ; new off 166 xmax 0, blkref #0: rel 1663/13241/38251 blk 35
  42. rmgr: Heap len (rec/tot): 3/ 79, tx: 772208351, lsn: 116/F7FFF7D8, prev 116/F7FFF788, desc: INSERT off 66, blkref #0: rel 1663/13241/38242 blk 17736
  43. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208351, lsn: 116/F7FFFDF0, prev 116/F7FFFDA0, desc: COMMIT 2016-10-11 15:04:16.395031 CST

被纂改的文件的头部的内容,用于判断未持久化的记录
这里显示的都是将要纂改掉,对PG来说就是未持久化的事务,数据库恢复后是不会显示的.

  1. pg_xlogdump -b -n 20 0000000100000116000000F8 0000000100000116000000F9
  2. 纂改后,772208342这个事务将不可见
  3. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208342, lsn: 116/F8000038, prev 116/F7FFFFC0, desc: COMMIT 2016-10-11 15:04:16.395055 CST
  4. rmgr: Heap len (rec/tot): 14/ 78, tx: 772208362, lsn: 116/F8000068, prev 116/F8000038, desc: HOT_UPDATE off 148 xmax 772208362 ; new off 154 xmax 0
  5. blkref #0: rel 1663/13241/38245 fork main blk 90
  6. rmgr: Heap len (rec/tot): 14/ 78, tx: 772208365, lsn: 116/F80000B8, prev 116/F8000068, desc: HOT_UPDATE off 85 xmax 772208365 ; new off 89 xmax 0
  7. blkref #0: rel 1663/13241/38245 fork main blk 68
  8. rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F8000108, prev 116/F80000B8, desc: CLEAN remxid 772208308
  9. blkref #0: rel 1663/13241/38254 fork main blk 146480
  10. rmgr: Heap len (rec/tot): 14/ 74, tx: 772208349, lsn: 116/F8000148, prev 116/F8000108, desc: HOT_UPDATE off 154 xmax 772208349 ; new off 155 xmax 772208349
  11. blkref #0: rel 1663/13241/38251 fork main blk 38
  12. rmgr: Heap len (rec/tot): 3/ 79, tx: 772208358, lsn: 116/F8000198, prev 116/F8000148, desc: INSERT off 101
  13. blkref #0: rel 1663/13241/38242 fork main blk 17730
  14. rmgr: Heap len (rec/tot): 14/ 74, tx: 772208359, lsn: 116/F80001E8, prev 116/F8000198, desc: HOT_UPDATE off 78 xmax 772208359 ; new off 85 xmax 0
  15. blkref #0: rel 1663/13241/38251 fork main blk 31
  16. rmgr: Heap len (rec/tot): 14/ 163, tx: 772208370, lsn: 116/F8000238, prev 116/F80001E8, desc: HOT_UPDATE off 25 xmax 772208370 ; new off 71 xmax 0
  17. blkref #0: rel 1663/13241/38254 fork main blk 146480
  18. 纂改后,772208354这个事务将不可见
  19. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208354, lsn: 116/F80002E0, prev 116/F8000238, desc: COMMIT 2016-10-11 15:04:16.395071 CST
  20. rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F8000310, prev 116/F80002E0, desc: CLEAN remxid 772112027
  21. blkref #0: rel 1663/13241/38254 fork main blk 121847
  22. rmgr: Heap len (rec/tot): 14/ 74, tx: 772208355, lsn: 116/F8000350, prev 116/F8000310, desc: HOT_UPDATE off 82 xmax 772208355 ; new off 86 xmax 0
  23. blkref #0: rel 1663/13241/38251 fork main blk 31
  24. rmgr: Heap len (rec/tot): 14/ 78, tx: 772208366, lsn: 116/F80003A0, prev 116/F8000350, desc: HOT_UPDATE off 73 xmax 772208366 ; new off 104 xmax 0
  25. blkref #0: rel 1663/13241/38245 fork main blk 86
  26. rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F80003F0, prev 116/F80003A0, desc: CLEAN remxid 772176420
  27. blkref #0: rel 1663/13241/38254 fork main blk 162972
  28. rmgr: Heap len (rec/tot): 14/ 74, tx: 772208363, lsn: 116/F8000430, prev 116/F80003F0, desc: HOT_UPDATE off 23 xmax 772208363 ; new off 26 xmax 0
  29. blkref #0: rel 1663/13241/38251 fork main blk 30
  30. rmgr: Heap len (rec/tot): 14/ 74, tx: 772208360, lsn: 116/F8000480, prev 116/F8000430, desc: HOT_UPDATE off 164 xmax 772208360 ; new off 167 xmax 0
  31. blkref #0: rel 1663/13241/38251 fork main blk 35
  32. rmgr: Heap len (rec/tot): 14/ 163, tx: 772208371, lsn: 116/F80004D0, prev 116/F8000480, desc: HOT_UPDATE off 2 xmax 772208371 ; new off 72 xmax 0
  33. blkref #0: rel 1663/13241/38254 fork main blk 121847
  34. 纂改后,772208358这个事务将不可见
  35. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208358, lsn: 116/F8000578, prev 116/F80004D0, desc: COMMIT 2016-10-11 15:04:16.395090 CST
  36. rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F80005A8, prev 116/F8000578, desc: CLEAN remxid 772172802
  37. blkref #0: rel 1663/13241/38254 fork main blk 120028
  38. rmgr: Heap len (rec/tot): 14/ 163, tx: 772208372, lsn: 116/F80005E8, prev 116/F80005A8, desc: HOT_UPDATE off 57 xmax 772208372 ; new off 71 xmax 0
  39. blkref #0: rel 1663/13241/38254 fork main blk 162972
  40. 纂改后,772208350这个事务将不可见
  41. rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208350, lsn: 116/F8000690, prev 116/F80005E8, desc: COMMIT 2016-10-11 15:04:16.395095 CST
  42. ...

纂改redo

  1. cd $PGDATA/pg_xlog
  2. 找到最后两个xlog文件,纂改一下offset 100字节后的中间位置,模拟partial write
  3. dd if=/dev/zero of=./0000000100000116000000F8 bs=1 count=10000 skip=100
  4. dd if=/dev/zero of=./0000000100000116000000F9 bs=1 count=10000 skip=100

启动数据库,进入恢复状态,当读到checksum不一致的block,停止继续往前,也就是说数据库恢复到这里截至。

未恢复的事务造成的变更,对用户不可见。

  1. 2016-10-11 15:10:49.909 CST,,,15039,,57fc9076.3abf,1,,2016-10-11 15:10:46 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,"PostmasterMain, postmaster.c:1223",""
  2. 由于强制停库,数据库进入恢复状态
  3. 2016-10-11 15:10:49.910 CST,,,15042,,57fc9079.3ac2,1,,2016-10-11 15:10:49 CST,,0,LOG,00000,"database system was interrupted; last known up at 2016-10-11 15:03:14 CST",,,,,,,,"StartupXLOG, xlog.c:5934",""
  4. 2016-10-11 15:10:49.991 CST,,,15042,,57fc9079.3ac2,2,,2016-10-11 15:10:49 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG, xlog.c:6414",""
  5. 2016-10-11 15:10:49.992 CST,,,15042,,57fc9079.3ac2,3,,2016-10-11 15:10:49 CST,,0,LOG,00000,"redo starts at 116/9D8E4600",,,,,,,,"StartupXLOG, xlog.c:6669",""
  6. 读到被纂改的REDO时,停止恢复
  7. 2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,4,,2016-10-11 15:10:49 CST,,0,LOG,00000,"invalid magic number 0000 in log segment 0000000100000116000000F8, offset 0",,,,,,,,"ReadRecord, xlog.c:3942",""
  8. 2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,5,,2016-10-11 15:10:49 CST,,0,LOG,00000,"redo done at 116/F7FFFF88",,,,,,,,"StartupXLOG, xlog.c:6921",""
  9. 2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,6,,2016-10-11 15:10:49 CST,,0,LOG,00000,"last completed transaction was at log time 2016-10-11 15:04:16.39504+08",,,,,,,,"StartupXLOG, xlog.c:6926",""
  10. 2016-10-11 15:11:21.216 CST,,,15042,,57fc9079.3ac2,7,,2016-10-11 15:10:49 CST,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate",,,,,,,,"LogCheckpointStart, xlog.c:7949",""
  11. 2016-10-11 15:11:23.223 CST,,,15042,,57fc9079.3ac2,8,,2016-10-11 15:10:49 CST,,0,LOG,00000,"checkpoint complete: wrote 215999 buffers (1.3%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=1.598 s, sync=0.405 s, total=2.006 s; sync files=20, longest=0.207 s, average=0.020 s; distance=1481838 kB, estimate=1481838 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8031",""
  12. 2016-10-11 15:11:23.223 CST,,,15042,,57fc9079.3ac2,9,,2016-10-11 15:10:49 CST,,0,LOG,00000,"MultiXact member wraparound protections are now enabled",,,,,,,,"SetOffsetVacuumLimit, multixact.c:2628",""
  13. 2016-10-11 15:11:23.405 CST,,,15039,,57fc9076.3abf,2,,2016-10-11 15:10:46 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,"reaper, postmaster.c:2792",""
  14. 2016-10-11 15:11:23.405 CST,,,15083,,57fc909b.3aeb,1,,2016-10-11 15:11:23 CST,,0,LOG,00000,"autovacuum launcher started",,,,,,,,"AutoVacLauncherMain, autovacuum.c:416",""

验证

  1. 纂改前在REDO中显示提交的事务,验证确实已提交。
  2. postgres=# select xmin,* from pgbench_history where xmin in (772208346,772208344,772208351,772208352,772208353,772208345);
  3. xmin | tid | bid | aid | delta | mtime | filler
  4. -----------+-----+-----+---------+-------+----------------------------+--------
  5. 772208345 | 109 | 76 | 96685 | 4792 | 2016-10-11 15:04:16.394519 |
  6. 772208353 | 657 | 1 | 7473886 | 1540 | 2016-10-11 15:04:16.394708 |
  7. 772208344 | 146 | 58 | 2671263 | -2297 | 2016-10-11 15:04:16.394504 |
  8. 772208352 | 55 | 57 | 9608997 | 2862 | 2016-10-11 15:04:16.39463 |
  9. 772208351 | 531 | 8 | 4235604 | 1582 | 2016-10-11 15:04:16.394601 |
  10. 772208346 | 105 | 83 | 5770382 | 590 | 2016-10-11 15:04:16.394542 |
  11. (6 rows)
  12. 纂改后在REDO中显示已提交的事务,显示未提交,所以partial write没有影响数据库的一致性。
  13. postgres=# select * from pgbench_history where xmin in (772208342,772208354,772208358,772208350);
  14. tid | bid | aid | delta | mtime | filler
  15. -----+-----+-----+-------+-------+--------
  16. (0 rows)

通过检验。

PostgreSQL redo block size可配置

  1. ./configure --with-wal-blocksize=?
  2. Allowed values are 1,2,4,8,16,32,64.

redo buffer的作用和fsync调度

如果每产生一笔redo都要fsync,性能就差了,所以FSYNC实际上是有调度的。

redo buffer的作用就是减少FSYNC的次数。

1. 当wal writer sleep超过设置的sleep时间(通常设置为10毫秒)时,触发fsync,将redo buffer中已写完整的BLOCK持久化到REDO FILE。

2. 当wal writer write(异步写)的字节数超过配置的阈值(wal_writer_flush_after)时,触发fsync,将redo buffer中已写完整的BLOCK持久化到REDO FILE。

3. 当事务结束时,检查wal write全局变量,LSN是否已FLUSH,如果没有落盘,则触发fsync。

4. 第三种情况,如果开启了分组提交,则多个正在提交的事务只会请求一次fsync。

5. 当redo 日志文件发生切换时,会触发fsync,确保文件持久化。

PostgreSQL redo相关的代码

src/backend/postmaster/walwriter.c

  1. * The WAL writer background process is new as of Postgres 8.3. It attempts
  2. * to keep regular backends from having to write out (and fsync) WAL pages.
  3. * Also, it guarantees that transaction commit records that weren't synced
  4. * to disk immediately upon commit (ie, were "asynchronously committed")
  5. * will reach disk within a knowable time --- which, as it happens, is at
  6. * most three times the wal_writer_delay cycle time.
  7. *
  8. * Note that as with the bgwriter for shared buffers, regular backends are
  9. * still empowered to issue WAL writes and fsyncs when the walwriter doesn't
  10. * keep up. This means that the WALWriter is not an essential process and
  11. * can shutdown quickly when requested.
  12. *
  13. * Because the walwriter's cycle is directly linked to the maximum delay
  14. * before async-commit transactions are guaranteed committed, it's probably
  15. * unwise to load additional functionality onto it. For instance, if you've
  16. * got a yen to create xlog segments further in advance, that'd be better done
  17. * in bgwriter than in walwriter.
  18. *
  19. * The walwriter is started by the postmaster as soon as the startup subprocess
  20. * finishes. It remains alive until the postmaster commands it to terminate.
  21. * Normal termination is by SIGTERM, which instructs the walwriter to exit(0).
  22. * Emergency termination is by SIGQUIT; like any backend, the walwriter will
  23. * simply abort and exit on SIGQUIT.
  24. *
  25. * If the walwriter exits unexpectedly, the postmaster treats that the same
  26. * as a backend crash: shared memory may be corrupted, so remaining backends
  27. * should be killed by SIGQUIT and then a recovery cycle started.
  28. ......
  29. /*
  30. * Loop forever
  31. */
  32. for (;;)
  33. {
  34. ......
  35. /*
  36. * Do what we're here for; then, if XLogBackgroundFlush() found useful
  37. * work to do, reset hibernation counter.
  38. */
  39. if (XLogBackgroundFlush())
  40. left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
  41. else if (left_till_hibernate > 0)
  42. left_till_hibernate--;
  43. ......

src/backend/access/transam/xlog.c

  1. /*
  2. * Write & flush xlog, but without specifying exactly where to.
  3. *
  4. * We normally write only completed blocks; but if there is nothing to do on
  5. * that basis, we check for unwritten async commits in the current incomplete
  6. * block, and write through the latest one of those. Thus, if async commits
  7. * are not being used, we will write complete blocks only.
  8. *
  9. * If, based on the above, there's anything to write we do so immediately. But
  10. * to avoid calling fsync, fdatasync et. al. at a rate that'd impact
  11. * concurrent IO, we only flush WAL every wal_writer_delay ms, or if there's
  12. * more than wal_writer_flush_after unflushed blocks.
  13. *
  14. * We can guarantee that async commits reach disk after at most three
  15. * wal_writer_delay cycles. (When flushing complete blocks, we allow XLogWrite
  16. * to write "flexibly", meaning it can stop at the end of the buffer ring;
  17. * this makes a difference only with very high load or long wal_writer_delay,
  18. * but imposes one extra cycle for the worst case for async commits.)
  19. *
  20. * This routine is invoked periodically by the background walwriter process.
  21. *
  22. * Returns TRUE if there was any work to do, even if we skipped flushing due
  23. * to wal_writer_delay/wal_flush_after.
  24. */
  25. bool
  26. XLogBackgroundFlush(void)
  27. {
  28. XLogwrtRqst WriteRqst;
  29. bool flexible = true;
  30. static TimestampTz lastflush;
  31. TimestampTz now;
  32. int flushbytes;
  33. /* XLOG doesn't need flushing during recovery */
  34. if (RecoveryInProgress())
  35. return false;
  36. /* read LogwrtResult and update local state */
  37. SpinLockAcquire(&XLogCtl->info_lck);
  38. LogwrtResult = XLogCtl->LogwrtResult;
  39. WriteRqst = XLogCtl->LogwrtRqst;
  40. SpinLockRelease(&XLogCtl->info_lck);
  41. /* back off to last completed page boundary */
  42. WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
  43. /* if we have already flushed that far, consider async commit records */
  44. if (WriteRqst.Write <= LogwrtResult.Flush)
  45. {
  46. SpinLockAcquire(&XLogCtl->info_lck);
  47. WriteRqst.Write = XLogCtl->asyncXactLSN;
  48. SpinLockRelease(&XLogCtl->info_lck);
  49. flexible = false; /* ensure it all gets written */
  50. }
  51. /*
  52. * If already known flushed, we're done. Just need to check if we are
  53. * holding an open file handle to a logfile that's no longer in use,
  54. * preventing the file from being deleted.
  55. */
  56. if (WriteRqst.Write <= LogwrtResult.Flush)
  57. {
  58. if (openLogFile >= 0)
  59. {
  60. if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
  61. {
  62. XLogFileClose();
  63. }
  64. }
  65. return false;
  66. }
  67. /*
  68. * Determine how far to flush WAL, based on the wal_writer_delay and
  69. * wal_writer_flush_after GUCs.
  70. */
  71. now = GetCurrentTimestamp();
  72. flushbytes =
  73. WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
  74. if (WalWriterFlushAfter == 0 || lastflush == 0)
  75. {
  76. /* first call, or block based limits disabled */
  77. WriteRqst.Flush = WriteRqst.Write;
  78. lastflush = now;
  79. }
  80. // sleep时间调度,更新Flush位点
  81. else if (TimestampDifferenceExceeds(lastflush, now, WalWriterDelay))
  82. {
  83. /*
  84. * Flush the writes at least every WalWriteDelay ms. This is important
  85. * to bound the amount of time it takes for an asynchronous commit to
  86. * hit disk.
  87. */
  88. WriteRqst.Flush = WriteRqst.Write;
  89. lastflush = now;
  90. }
  91. // wal writer write(异步写)累计调度,更新Flush位点
  92. else if (flushbytes >= WalWriterFlushAfter)
  93. {
  94. /* exceeded wal_writer_flush_after blocks, flush */
  95. WriteRqst.Flush = WriteRqst.Write;
  96. lastflush = now;
  97. }
  98. // 否则不执行fsync
  99. else
  100. {
  101. /* no flushing, this time round */
  102. WriteRqst.Flush = 0;
  103. }
  104. #ifdef WAL_DEBUG
  105. if (XLOG_DEBUG)
  106. elog(LOG, "xlog bg flush request write %X/%X; flush: %X/%X, current is write %X/%X; flush %X/%X",
  107. (uint32) (WriteRqst.Write >> 32), (uint32) WriteRqst.Write,
  108. (uint32) (WriteRqst.Flush >> 32), (uint32) WriteRqst.Flush,
  109. (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
  110. (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  111. #endif
  112. START_CRIT_SECTION();
  113. /* now wait for any in-progress insertions to finish and get write lock */
  114. WaitXLogInsertionsToFinish(WriteRqst.Write);
  115. LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  116. LogwrtResult = XLogCtl->LogwrtResult;
  117. if (WriteRqst.Write > LogwrtResult.Write ||
  118. WriteRqst.Flush > LogwrtResult.Flush)
  119. {
  120. XLogWrite(WriteRqst, flexible);
  121. }
  122. LWLockRelease(WALWriteLock);
  123. END_CRIT_SECTION();
  124. /* wake up walsenders now that we've released heavily contended locks */
  125. WalSndWakeupProcessRequests();
  126. /*
  127. * Great, done. To take some work off the critical path, try to initialize
  128. * as many of the no-longer-needed WAL buffers for future use as we can.
  129. */
  130. AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
  131. /*
  132. * If we determined that we need to write data, but somebody else
  133. * wrote/flushed already, it should be considered as being active, to
  134. * avoid hibernating too early.
  135. */
  136. return true;
  137. }
  138. /*
  139. * Write and/or fsync the log at least as far as WriteRqst indicates.
  140. *
  141. * If flexible == TRUE, we don't have to write as far as WriteRqst, but
  142. * may stop at any convenient boundary (such as a cache or logfile boundary).
  143. * This option allows us to avoid uselessly issuing multiple writes when a
  144. * single one would do.
  145. *
  146. * Must be called with WALWriteLock held. WaitXLogInsertionsToFinish(WriteRqst)
  147. * must be called before grabbing the lock, to make sure the data is ready to
  148. * write.
  149. */
  150. static void
  151. XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  152. {
  153. ......
  154. 日志切换时,触发fsync
  155. if (finishing_seg)
  156. {
  157. issue_xlog_fsync(openLogFile, openLogSegNo);
  158. ......
  159. 根据LogwrtResult.Flush位点与请求Flush位点的对比,判断是否需要调用fsync
  160. 即前面的调度
  161. /*
  162. * If asked to flush, do so
  163. */
  164. if (LogwrtResult.Flush < WriteRqst.Flush &&
  165. LogwrtResult.Flush < LogwrtResult.Write)
  166. {
  167. /*
  168. * Could get here without iterating above loop, in which case we might
  169. * have no open file or the wrong one. However, we do not need to
  170. * fsync more than one file.
  171. */
  172. if (sync_method != SYNC_METHOD_OPEN &&
  173. sync_method != SYNC_METHOD_OPEN_DSYNC)
  174. {
  175. if (openLogFile >= 0 &&
  176. !XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
  177. XLogFileClose();
  178. if (openLogFile < 0)
  179. {
  180. XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
  181. openLogFile = XLogFileOpen(openLogSegNo);
  182. openLogOff = 0;
  183. }
  184. issue_xlog_fsync(openLogFile, openLogSegNo);
  185. }
  186. /* signal that we need to wakeup walsenders later */
  187. WalSndWakeupRequest();
  188. LogwrtResult.Flush = LogwrtResult.Write;
  189. }
  190. ......

参考

1. https://www.pgcon.org/2012/schedule/attachments/258\_212\_Internals%20Of%20PostgreSQL%20Wal.pdf

如果要深入了解PostgreSQL redo的内部机制,可以参考以上文档以及源码。

pic3

原文:http://mysql.taobao.org/monthly/2016/10/07/