说明
本文用查找一条数据库表记录的例子来分析InnoDB B+Tree的结构
先用sysbench插入一千万条记录:
sysbench --db-driver=mysql --mysql-user=username --mysql-password=password --mysql-db=sbtest \
--table_size=10000000 --tables=1 oltp_read_write --mysql-host=127.0.0.1 prepare
生成的sbtest1.ibd大小为2.3G,用hexdump来查看内容
随意选一条记录,比如id=3905000,通过手动找到这条记录来分析B+Tree的结构。
mysql> select * from sbtest1 where id=3905000 \G
*************************** 1. row ***************************
id: 3905000
k: 7152454
c: 33061989913-01978996152-96897051302-66804054532-36658200903-75265952777-90162670547-62113775555-84037309450-68725639441
pad: 93314475890-72810819110-74153294523-75348581725-15287112137
1 row in set (0.00 sec)
Page format
详细请查看: https://dev.mysql.com/doc/internals/en/innodb-page-overview.html
Name | Size(bytes) |
---|---|
File Header | 38 |
Page Header | 56 |
Infimum + Supremum Records | 16 |
User Records | 不定 |
Free Space | 不定 |
Page Directory | 不定 |
Fil Trailer | 8 |
查看B+Tree root page
page 3是root page,先用hexdump查看file header
page 3的offset是3*16*1024 = 49152,file header size是38 bytes
$hexdump -s 49152 -n 38 -C sbtest1.ibd
0000c000 58 0f 5c bd 00 00 00 03 ff ff ff ff ff ff ff ff |X.\.............|
0000c010 00 00 00 00 99 4a f4 5a 45 bf 00 00 00 00 00 00 |.....J.ZE.......|
0000c020 00 00 00 00 00 20 |..... |
page type是0x45bf,表示B+Tree节点
再看page header,offset是3161024 + 38 = 49190,page header size是56 bytes
$hexdump -s 49190 -n 56 sbtest1.ibd
0000c026 00 1d 06 4f 80 75 00 00 00 00 06 47 00 02 00 72 |...O.u.....G...r|
0000c036 00 73 00 00 00 00 00 00 00 00 00 02 00 00 00 00 |.s..............|
0000c046 00 00 00 38 00 00 00 20 00 00 00 02 00 f2 00 00 |...8... ........|
0000c056 00 20 00 00 00 02 00 32 |. .....2|
page directory slots数目为29(0x001d)
page level为2(0x0002),这个是根节点。叶子节点level总是为0,所以这棵B+Tree深度为3。
root page directory
每个slot占2 bytes,29个slot共58 bytes。 所以root page directory offset为 3(161024) - 58 - 8(trailer) = 65470
$hexdump -s 65470 -n 58 -C sbtest1.ibd
0000ffbe 00 70 05 ec 05 b8 05 84 05 50 05 1c 04 e8 04 b4 |.p.......P......|
0000ffce 04 80 04 4c 04 18 03 e4 03 b0 03 7c 03 48 03 14 |...L.......|.H..|
0000ffde 02 e0 02 ac 02 78 02 44 02 10 01 dc 01 a8 01 74 |.....x.D.......t|
0000ffee 01 40 01 0c 00 d8 00 a4 00 63 |.@.......c|
page directory按primary key逆序排列,0x0063是infimum record offset,0x0070是supremum record offset
$hexdump -s 49246 -n 12 -C sbtest1.ibd
0000c05e 01 00 02 00 1a 69 6e 66 69 6d 75 6d |.....infimum|
“01 00 02 00 1a” 是record header,最后1个byte 0x1a表示next record offset
3161024+0x63+0x1a=49277,找到第1条记录
$hexdump -s 49277 -n 8 sbtest1.ibd
0000c07d 80 00 00 01 00 00 00 24 |.......$|
slot | offset | record(hex) | primary key | page no |
---|---|---|---|---|
0 | 0x0063 | 69 6e 66 69 6d 75 6d 00 | 1 | 36 |
1 | 0x00a4 | 80 03 59 53 00 00 00 27 | 219475 | 39 |
2 | 0x00d8 | 80 08 b5 7f 00 00 00 2b | 570751 | 43 |
3 | 0x010c | 80 0e 11 ab 00 00 00 2f | 922027 | 47 |
4 | 0x0140 | 80 13 6d d7 00 00 00 33 | 1273303 | 51 |
5 | 0x0174 | 80 18 ca 03 00 00 00 37 | 1624579 | 55 |
6 | 0x01a8 | 80 1e 26 2f 00 00 00 3b | 1975855 | 59 |
7 | 0x01dc | 80 23 82 5b 00 00 00 3f | 2327131 | 63 |
8 | 0x0210 | 80 28 de 87 00 00 90 00 | 2678407 | 36864 |
9 | 0x0244 | 80 2e 3a b3 00 00 90 04 | 3029683 | 36868 |
10 | 0x0278 | 80 33 96 df 00 00 90 08 | 3380959 | 36872 |
11 | 0x02ac | 80 38 f3 0b 00 00 90 0c | 3732235 | 36876 |
12 | 0x02e0 | 80 3e 4f 37 00 00 90 10 | 4083511 | 36880 |
13 | 0x0314 | 80 43 ab 63 00 00 90 14 | 4434787 | 36884 |
14 | 0x0348 | 80 49 07 8f 00 00 90 18 | 4786063 | 36888 |
15 | 0x037c | 80 4e 63 bb 00 00 90 1c | 5137339 | 36892 |
16 | 0x03b0 | 80 53 bf e7 00 00 90 20 | 5488615 | 36896 |
17 | 0x03e4 | 80 59 1c 13 00 00 90 24 | 5839891 | 36900 |
18 | 0x0418 | 80 5e 78 3f 00 00 90 28 | 6191167 | 36904 |
19 | 0x044c | 80 63 d4 6b 00 00 90 2c | 6542443 | 36908 |
20 | 0x0480 | 80 69 30 97 00 00 90 30 | 6893719 | 36912 |
21 | 0x04b4 | 80 6e 8c c3 00 00 90 34 | 7244995 | 36916 |
22 | 0x04e8 | 80 73 e8 ef 00 00 90 38 | 7596271 | 36920 |
23 | 0x051c | 80 79 45 1b 00 00 90 3c | 7947547 | 36924 |
24 | 0x0550 | 80 7e a1 47 00 01 be 00 | 8298823 | 114176 |
25 | 0x0584 | 80 83 fd 73 00 01 be 04 | 8650099 | 114180 |
26 | 0x05b8 | 80 89 59 9f 00 01 be 08 | 9001375 | 114184 |
27 | 0x05ec | 80 8e b5 cb 00 01 be 0c | 9352651 | 114188 |
28 | 0x0070 | 73 75 70 72 65 6d 75 6d |
我们要找的记录primary key为3905000,在以上29个slots中做binary search,发现可能在slot 11和slot 12之间 slot 11的primary key为3732235,并不是我们要找的3905000,需要继续在slot 11的record list中查找。
查看page 0 slot 11的record list
record header共5个bytes,第5个byte记录了next record的相对于current record的offset, 即next record offset = current record offset + current record header的第5个byte 非叶子节点record body为8 bytes
offset | record(hex) | primary key | page no |
---|---|---|---|
684 | 80 38 f3 0b 00 00 90 0c | 3732235 | 36876 |
697 | 80 3a 4a 16 00 00 90 0d | 3820054 | 36877 |
710 | 80 3b a1 21 00 00 90 0e | 3907873 | 36878 |
723 | 80 3c f8 2c 00 00 90 0f | 3995692 | 36879 |
736 | 80 3e 4f 37 00 00 90 10 | 4083511 | 36880 |
… | … | … | … |
1529 | 80 90 0c d6 00 01 be 0d | 9440470 | 114189 |
1542 | 80 91 63 e1 00 01 be 0e | 9528289 | 114190 |
1555 | 80 92 ba ec 00 01 be 0f | 9616108 | 114191 |
1568 | 80 94 11 f7 00 01 be 10 | 9703927 | 114192 |
1581 | 80 95 69 02 00 01 be 11 | 9791746 | 114193 |
1594 | 80 96 c0 0d 00 01 be 12 | 9879565 | 114194 |
1607 | 80 98 17 18 00 01 be 13 | 9967384 | 114195 |
到此发现我们要找的记录(primary key 3905000)可能在offset 697所指向的page 36877
查看page 36877
先看file header, offset = 36877 * (16*1024) = 604192768, size为38 bytes
$hexdump -s 604192768 -n 38 -C sbtest1.ibd
24034000 d1 d6 61 2a 00 00 90 0d 00 00 90 0c 00 00 90 0e |..a*............|
24034010 00 00 00 00 3b ff 8b 03 45 bf 00 00 00 00 00 00 |....;...E.......|
24034020 00 00 00 00 00 20 |..... |
page type是0x45bf,确认了是B+Tree节点
再看page header, offset = 36877 * (16*1024) + 38 = 604192806, size为56 bytes
$hexdump -s 604192806 -n 56 -C sbtest1.ibd
24034026 01 2d 3d 8f 84 b5 00 00 00 00 3d 87 00 02 04 b2 |.-=.......=.....|
24034036 04 b3 00 00 00 00 00 00 00 00 00 01 00 00 00 00 |................|
24034046 00 00 00 38 00 00 00 00 00 00 00 00 00 00 00 00 |...8............|
24034056 00 00 00 00 00 00 00 00 |........|
page directory slots数目为301(0x012d)
page level为1(0x0001),还不是叶子节点。
再看page diretory 每个slot占2 bytes,301个slot共602 bytes。 所以page directory offset为 36877(161024) - 602 - 8(trailer) = 604208542
$hexdump -s 604208542 -n 602 -C sbtest1.ibd
24037d9e 00 70 3d 2c 3c f8 3c c4 3c 90 3c 5c 3c 28 3b f4 |.p=,<.<.<.<\<(;.|
24037dae 3b c0 3b 8c 3b 58 3b 24 3a f0 3a bc 3a 88 3a 54 |;.;.;X;$:.:.:.:T|
24037dbe 3a 20 39 ec 39 b8 39 84 39 50 39 1c 38 e8 38 b4 |: 9.9.9.9P9.8.8.|
24037dce 38 80 38 4c 38 18 37 e4 37 b0 37 7c 37 48 37 14 |8.8L8.7.7.7|7H7.|
24037dde 36 e0 36 ac 36 78 36 44 36 10 35 dc 35 a8 35 74 |6.6.6x6D6.5.5.5t|
24037dee 35 40 35 0c 34 d8 34 a4 34 70 34 3c 34 08 33 d4 |5@5.4.4.4p4<4.3.|
24037dfe 33 a0 33 6c 33 38 33 04 32 d0 32 9c 32 68 32 34 |3.3l383.2.2.2h24|
24037e0e 32 00 31 cc 31 98 31 64 31 30 30 fc 30 c8 30 94 |2.1.1.1d100.0.0.|
24037e1e 30 60 30 2c 2f f8 2f c4 2f 90 2f 5c 2f 28 2e f4 |0`0,/./././\/(..|
24037e2e 2e c0 2e 8c 2e 58 2e 24 2d f0 2d bc 2d 88 2d 54 |.....X.$-.-.-.-T|
24037e3e 2d 20 2c ec 2c b8 2c 84 2c 50 2c 1c 2b e8 2b b4 |- ,.,.,.,P,.+.+.|
24037e4e 2b 80 2b 4c 2b 18 2a e4 2a b0 2a 7c 2a 48 2a 14 |+.+L+.*.*.*|*H*.|
24037e5e 29 e0 29 ac 29 78 29 44 29 10 28 dc 28 a8 28 74 |).).)x)D).(.(.(t|
24037e6e 28 40 28 0c 27 d8 27 a4 27 70 27 3c 27 08 26 d4 |(@(.'.'.'p'<'.&.|
24037e7e 26 a0 26 6c 26 38 26 04 25 d0 25 9c 25 68 25 34 |&.&l&8&.%.%.%h%4|
24037e8e 25 00 24 cc 24 98 24 64 24 30 23 fc 23 c8 23 94 |%.$.$.$d$0#.#.#.|
24037e9e 23 60 23 2c 22 f8 22 c4 22 90 22 5c 22 28 21 f4 |#`#,"."."."\"(!.|
24037eae 21 c0 21 8c 21 58 21 24 20 f0 20 bc 20 88 20 54 |!.!.!X!$ . . . T|
24037ebe 20 20 1f ec 1f b8 1f 84 1f 50 1f 1c 1e e8 1e b4 | .......P......|
24037ece 1e 80 1e 4c 1e 18 1d e4 1d b0 1d 7c 1d 48 1d 14 |...L.......|.H..|
24037ede 1c e0 1c ac 1c 78 1c 44 1c 10 1b dc 1b a8 1b 74 |.....x.D.......t|
24037eee 1b 40 1b 0c 1a d8 1a a4 1a 70 1a 3c 1a 08 19 d4 |.@.......p.<....|
24037efe 19 a0 19 6c 19 38 19 04 18 d0 18 9c 18 68 18 34 |...l.8.......h.4|
24037f0e 18 00 17 cc 17 98 17 64 17 30 16 fc 16 c8 16 94 |.......d.0......|
24037f1e 16 60 16 2c 15 f8 15 c4 15 90 15 5c 15 28 14 f4 |.`.,.......\.(..|
24037f2e 14 c0 14 8c 14 58 14 24 13 f0 13 bc 13 88 13 54 |.....X.$.......T|
24037f3e 13 20 12 ec 12 b8 12 84 12 50 12 1c 11 e8 11 b4 |. .......P......|
24037f4e 11 80 11 4c 11 18 10 e4 10 b0 10 7c 10 48 10 14 |...L.......|.H..|
24037f5e 0f e0 0f ac 0f 78 0f 44 0f 10 0e dc 0e a8 0e 74 |.....x.D.......t|
24037f6e 0e 40 0e 0c 0d d8 0d a4 0d 70 0d 3c 0d 08 0c d4 |.@.......p.<....|
24037f7e 0c a0 0c 6c 0c 38 0c 04 0b d0 0b 9c 0b 68 0b 34 |...l.8.......h.4|
24037f8e 0b 00 0a cc 0a 98 0a 64 0a 30 09 fc 09 c8 09 94 |.......d.0......|
24037f9e 09 60 09 2c 08 f8 08 c4 08 90 08 5c 08 28 07 f4 |.`.,.......\.(..|
24037fae 07 c0 07 8c 07 58 07 24 06 f0 06 bc 06 88 06 54 |.....X.$.......T|
24037fbe 06 20 05 ec 05 b8 05 84 05 50 05 1c 04 e8 04 b4 |. .......P......|
24037fce 04 80 04 4c 04 18 03 e4 03 b0 03 7c 03 48 03 14 |...L.......|.H..|
24037fde 02 e0 02 ac 02 78 02 44 02 10 01 dc 01 a8 01 74 |.....x.D.......t|
24037fee 01 40 01 0c 00 d8 00 a4 00 63 |.@.......c|
解析page directory
slot | offset | record(hex) | primary key | page no |
---|---|---|---|---|
1 | 0x00a4 | 80 3a 4a f1 00 00 cd 8d | 3820273 | 52621 |
2 | 0x00d8 | 80 3a 4c 15 00 00 cd 91 | 3820565 | 52625 |
3 | 0x010c | 80 3a 4d 39 00 00 cd 95 | 3820857 | 52629 |
4 | 0x0140 | 80 3a 4e 5d 00 00 cd 99 | 3821149 | 52633 |
5 | 0x0174 | 80 3a 4f 81 00 00 cd 9d | 3821441 | 52637 |
6 | 0x01a8 | 80 3a 50 a5 00 00 cd a1 | 3821733 | 52641 |
7 | 0x01dc | 80 3a 51 c9 00 00 cd a5 | 3822025 | 52645 |
8 | 0x0210 | 80 3a 52 ed 00 00 cd a9 | 3822317 | 52649 |
9 | 0x0244 | 80 3a 54 11 00 00 cd ad | 3822609 | 52653 |
10 | 0x0278 | 80 3a 55 35 00 00 cd b1 | 3822901 | 52657 |
… | … | … | … | … |
288 | 0x3af0 | 80 3b 92 4d 00 00 d2 09 | 3904077 | 53769 |
289 | 0x3b24 | 80 3b 93 71 00 00 d2 0d | 3904369 | 53773 |
290 | 0x3b58 | 80 3b 94 95 00 00 d2 11 | 3904661 | 53777 |
291 | 0x3b8c | 80 3b 95 b9 00 00 d2 15 | 3904953 | 53781 |
292 | 0x3bc0 | 80 3b 96 dd 00 00 d2 19 | 3905245 | 53785 |
293 | 0x3bf4 | 80 3b 98 01 00 00 d2 1d | 3905537 | 53789 |
294 | 0x3c28 | 80 3b 99 25 00 00 d2 21 | 3905829 | 53793 |
295 | 0x3c5c | 80 3b 9a 49 00 00 d2 25 | 3906121 | 53797 |
296 | 0x3c90 | 80 3b 9b 6d 00 00 d2 29 | 3906413 | 53801 |
297 | 0x3cc4 | 80 3b 9c 91 00 00 d2 2d | 3906705 | 53805 |
298 | 0x3cf8 | 80 3b 9d b5 00 00 d2 31 | 3906997 | 53809 |
299 | 0x3d2c | 80 3b 9e d9 00 00 d2 35 | 3907289 | 53813 |
在以上slots中做binary search,发现我们要找的记录(primary key 3905000)可能在slot 291和slot 292之间 继续查看slot 291的第1个record所指向的page 53781
查看page 53781
先看file header, offset = 53781 * (16*1024) = 881147904, size为38 bytes
$hexdump -s 881147904 -n 38 -C sbtest1.ibd
34854000 09 d0 e4 a7 00 00 d2 15 00 00 d2 14 00 00 d2 16 |................|
34854010 00 00 00 00 3b f4 ac 4b 45 bf 00 00 00 00 00 00 |....;..KE.......|
34854020 00 00 00 00 00 20 |..... |
page type是0x45bf,确认了是B+Tree节点
再看page header, offset = 53781 * (16*1024) + 38 = 881147942, size为56 bytes
$hexdump -s 881147942 -n 56 -C sbtest1.ibd
34854026 00 13 3b c8 80 4b 00 00 00 00 3a ff 00 02 00 48 |..;..K....:....H|
34854036 00 49 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.I..............|
34854046 00 00 00 38 00 00 00 00 00 00 00 00 00 00 00 00 |...8............|
34854056 00 00 00 00 00 00 00 00 |........|
page directory slots数目为19(0x0013)
page level为0(0x0000),这是叶子节点了,是真正存数据的page
再看page diretory 每个slot占2 bytes,19个slot共38 bytes。 所以page directory offset为 53781(161024) - 38 - 8(trailer) = 881164242
$hexdump -s 881164242 -n 38 -C sbtest1.ibd
34857fd2 00 70 36 ef 33 af 30 6f 2d 2f 29 ef 26 af 23 6f |.p6.3.0o-/).&.#o|
34857fe2 20 2f 1c ef 19 af 16 6f 13 2f 0f ef 0c af 09 6f | /.....o./.....o|
34857ff2 06 2f 02 ef 00 63 |./...c|
解析page directory
slot | offset | record(hex) | primary key | page no
----- | ------ | ------ | -------- | --------
1 | 0x02ef | 80 3b 95 bc 00 00 00 00 | 3904956 | 0
2 | 0x062f | 80 3b 95 c0 00 00 00 00 | 3904960 | 0
3 | 0x096f | 80 3b 95 c4 00 00 00 00 | 3904964 | 0
4 | 0x0caf | 80 3b 95 c8 00 00 00 00 | 3904968 | 0
5 | 0x0fef | 80 3b 95 cc 00 00 00 00 | 3904972 | 0
6 | 0x132f | 80 3b 95 d0 00 00 00 00 | 3904976 | 0
7 | 0x166f | 80 3b 95 d4 00 00 00 00 | 3904980 | 0
8 | 0x19af | 80 3b 95 d8 00 00 00 00 | 3904984 | 0
9 | 0x1cef | 80 3b 95 dc 00 00 00 00 | 3904988 | 0
10 | 0x202f | 80 3b 95 e0 00 00 00 00 | 3904992 | 0
11 | 0x236f | 80 3b 95 e4 00 00 00 00 | 3904996 | 0
12 | 0x26af | 80 3b 95 e8 00 00 00 00 | 3905000 | 0
13 | 0x29ef | 80 3b 95 ec 00 00 00 00 | 3905004 | 0
14 | 0x2d2f | 80 3b 95 f0 00 00 00 00 | 3905008 | 0
15 | 0x306f | 80 3b 95 f4 00 00 00 00 | 3905012 | 0
16 | 0x33af | 80 3b 95 f8 00 00 00 00 | 3905016 | 0
17 | 0x36ef | 80 3b 95 fc 00 00 00 00 | 3905020 | 0
OK,在slot 12找到了primary key 3905000
查看primary key 3905000的记录
mysql> select * from sbtest1 where id=3905000 \G
*************************** 1. row ***************************
id: 3905000
k: 7152454
c: 33061989913-01978996152-96897051302-66804054532-36658200903-75265952777-90162670547-62113775555-84037309450-68725639441
pad: 93314475890-72810819110-74153294523-75348581725-15287112137
1 row in set (0.00 sec)
打印256 bytes验证下是不是我们要找的记录 offset为 53781(161024)+0x26af=881157807 id = 3905000 = 0x3b95e8 k = 7152454 = 0x6d2346 剩下的是c和pad的字符串,内容是对的。
hexdump -s 881157807 -n 256 -C /home/ming.lin/one_key_env_57/run_primary/dbs/testdb/sbtest1.ibd
348566af 80 3b 95 e8 00 00 00 00 0e d3 cb 00 00 00 0f 24 |.;.............$|
348566bf ec 80 6d 23 46 33 33 30 36 31 39 38 39 39 31 33 |..m#F33061989913|
348566cf 2d 30 31 39 37 38 39 39 36 31 35 32 2d 39 36 38 |-01978996152-968|
348566df 39 37 30 35 31 33 30 32 2d 36 36 38 30 34 30 35 |97051302-6680405|
348566ef 34 35 33 32 2d 33 36 36 35 38 32 30 30 39 30 33 |4532-36658200903|
348566ff 2d 37 35 32 36 35 39 35 32 37 37 37 2d 39 30 31 |-75265952777-901|
3485670f 36 32 36 37 30 35 34 37 2d 36 32 31 31 33 37 37 |62670547-6211377|
3485671f 35 35 35 35 2d 38 34 30 33 37 33 30 39 34 35 30 |5555-84037309450|
3485672f 2d 36 38 37 32 35 36 33 39 34 34 31 20 39 33 33 |-68725639441 933|
3485673f 31 34 34 37 35 38 39 30 2d 37 32 38 31 30 38 31 |14475890-7281081|
3485674f 39 31 31 30 2d 37 34 31 35 33 32 39 34 35 32 33 |9110-74153294523|
3485675f 2d 37 35 33 34 38 35 38 31 37 32 35 2d 31 35 32 |-75348581725-152|
3485676f 38 37 31 31 32 31 33 37 20 3c 78 00 01 90 00 d0 |87112137 <x.....|
3485677f 80 3b 95 e9 00 00 00 00 0e d3 cb 00 00 00 0f 24 |.;.............$|
3485678f f9 80 57 82 ca 37 38 33 32 36 37 32 37 31 32 39 |..W..78326727129|
3485679f 2d 32 30 30 39 36 39 37 37 37 38 30 2d 38 34 31 |-20096977780-841
至此我们手动找到了primary key为3905000的记录。
小结
page directory很关键,先在page directory中做binary search,定位到某个slot,再遍历slot上的record list