前言

redis的aof持久化本质上是一个redo log,把所有执行过的写命令追加到aof文件中。那么随着redis的运行,aof文件会不断膨胀,当触发收缩条件时就要做aofrewrite。

redis是通过fork子进程来做aofrewrite,同时为了保证aof的连续性,父进程把aofrewrite期间的写命令缓存起来,等收割完子进程之后再追加到新的aof文件。如果期间写入量较大的话收割时就要有大量的写磁盘操作,造成性能下降。

为了提高aofrewrite效率,redis通过在父子进程间建立管道,把aofrewrite期间的写命令通过管道同步给子进程,追加写盘的操作也就转交给了子进程。

aofrewrite详解

1. aofrewrite的基础实现

image.png

上图是aofrewrite的流程,标注为基本的函数调用关系。

  • 1 - 首先,通过命令或是事件触发aofrewrite,调用rewriteAppendOnlyFileBackground()函数

    • 该函数会fork出一个子进程
  • 2 - 父进程记录子进程的pid并开始缓存写命令

    • 当pid不为-1时就会执行aofRewriteBufferAppend()把写命令缓存起来
  • 3 - 子进程调用rewriteAppendOnlyFile(tmpfile)函数创建新的aof文件

    • 调用rewriteAppendOnlyFileRio()函数遍历redis把所有key-value以命令的方式写入新aof文件
    • 完成后调用exitFromChild(0)退出
  • 4 - 子进程退出后父进程调用backgroundRewriteDoneHandler()来处理

    • 调用aofRewriteBufferWrite()函数把积攒的写命令缓存写入子进程创建的临时aof文件
    • 最后rename()用新的aof文件替换掉原来的aof文件

在aofrewrite过程中,如果redis本身数据量较大子进程执行时间较长,或者写入流量较高,就会导致aof-rewrite-buffer积攒较多,父进程就要进行大量写磁盘操作,这对于redis来说显然是不够高效的。

2. 使用pipe优化

为了提高aofrewrite效率,redis使用pipe来优化,下图中红色标注即为优化的部分:

image.png

优化点:

  • 1 - 父进程建立管道

    • 共三条管道,分别为一条数据管道,和两条控制管道
    • 数据管道用来传输数据,控制管道用来做父子进程交互,控制何时停止数据传输
  • 2 - 父进程向管道写数据

    • 注册写事件aofChildWriteDiffData()向数据管道写数据
  • 3 - 子进程从管道读数据

    • 子进程在生成新aof文件时会定期调用aofReadDiffFromParent()从管道读取数据,并缓存下来
  • 4 - 父子进程交互

    • 子进程生成新aof文件后会通过控制管道向父进程发送”!”,发起停止数据传输请求
    • 父进程收到停止信号后激活读事件处理函数aofChildPipeReadable(),设置server.aof_stop_sending_diff=1停止数据传输,并向子进程回复”!”,表示同意停止
    • 子进程收到父进程的应答,调用rioWrite()把积攒的数据追加到新的aof文件,最后退出

细心的读者会发现,aofRewriteBufferAppend()和aofRewriteBufferWrite()这一对函数仍然保留,父进程还是要把aof-rewrite-buffer写盘吗?是的,这是因为父子进程是异步结构,父子间总会有那么一点代沟,aof-rewrite-buffer还是需要保留的,不过这个时候父进程写盘的数据量就很小了,几乎可以忽略。

3. aofrewrite代码剖析

aofrewrite的触发条件

    1. 执行bgrewriteaof命令。
    1. serverCron时间事件检测到aof文件大小超限。

命令的触发不必详述,主要来看下serverCron的触发:

  1. int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
  2. ...
  3. /* Trigger an AOF rewrite if needed */
  4. if (server.rdb_child_pid == -1 &&
  5. server.aof_child_pid == -1 &&
  6. server.aof_rewrite_perc &&
  7. server.aof_current_size > server.aof_rewrite_min_size)
  8. {
  9. long long base = server.aof_rewrite_base_size ?
  10. server.aof_rewrite_base_size : 1;
  11. long long growth = (server.aof_current_size*100/base) - 100;
  12. if (growth >= server.aof_rewrite_perc) {
  13. serverLog(LL_NOTICE,"Starting automatic rewriting of AOF on %lld%% growth",growth);
  14. rewriteAppendOnlyFileBackground();
  15. }
  16. }
  17. ...
  18. }

也就是说aof文件大小超过了server.aof_rewrite_min_size,并且增长率大于server.aof_rewrite_perc时就会触发(增长率计算的基数server.aof_rewrite_base_size是上次aofrewrite完之后aof文件的大小)。

目前云redis设置server.aof_rewrite_min_size为内存规格的1/4,server.aof_rewrite_perc为100。

管道建立

aofrewrite触发之后进入rewriteAppendOnlyFileBackground()函数:

  1. int rewriteAppendOnlyFileBackground(void) {
  2. pid_t childpid;
  3. long long start;
  4. if (server.aof_child_pid != -1 || server.rdb_child_pid != -1) return C_ERR;
  5. if (aofCreatePipes() != C_OK) return C_ERR;
  6. openChildInfoPipe();
  7. start = ustime();
  8. if ((childpid = fork()) == 0) {
  9. ...

OK,重点来了,在fork之前调用了aofCreatePipes()函数来创建管道(openChildInfoPipe()函数只是用来收集子进程copy-on-write用到的内存,就不详细展开了):

  1. int aofCreatePipes(void) {
  2. int fds[6] = {-1, -1, -1, -1, -1, -1};
  3. int j;
  4. if (pipe(fds) == -1) goto error; /* parent -> children data. 父进程向子进程写数据的管道*/
  5. if (pipe(fds+2) == -1) goto error; /* children -> parent ack. 子进程向父进程发起停止传输的控制管道*/
  6. if (pipe(fds+4) == -1) goto error; /* parent -> children ack. 父进程向子进程回复的控制管道*/
  7. /* Parent -> children data is non blocking. */
  8. if (anetNonBlock(NULL,fds[0]) != ANET_OK) goto error;
  9. if (anetNonBlock(NULL,fds[1]) != ANET_OK) goto error;
  10. if (aeCreateFileEvent(server.el, fds[2], AE_READABLE, aofChildPipeReadable, NULL) == AE_ERR) goto error;
  11. //注册读事件处理函数,负责处理子进程要求停止数据传输的消息
  12. server.aof_pipe_write_data_to_child = fds[1]; //父进程向子进程写数据的fd
  13. server.aof_pipe_read_data_from_parent = fds[0]; //子进程从父进程读数据的fd
  14. server.aof_pipe_write_ack_to_parent = fds[3]; //子进程向父进程发起停止消息的fd
  15. server.aof_pipe_read_ack_from_child = fds[2]; //父进程从子进程读取停止消息的fd
  16. server.aof_pipe_write_ack_to_child = fds[5]; //父进程向子进程回复消息的fd
  17. server.aof_pipe_read_ack_from_parent = fds[4]; //子进程从父进程读取回复消息的fd
  18. server.aof_stop_sending_diff = 0; //是否停止管道传输标记位
  19. return C_OK;
  20. ...
  21. }

父进程与管道传输

管道建立起来了我们再来看看fork之后父进程和子进程如何工作,首先看下父进程:

  1. /* Parent */
  2. server.stat_fork_time = ustime()-start;
  3. server.stat_fork_rate = (double) zmalloc_used_memory() * 1000000 / server.stat_fork_time / (1024*1024*1024); /* GB per second. */
  4. latencyAddSampleIfNeeded("fork",server.stat_fork_time/1000);
  5. ...
  6. server.aof_rewrite_scheduled = 0;
  7. server.aof_rewrite_time_start = time(NULL);
  8. server.aof_child_pid = childpid;
  9. updateDictResizePolicy();
  10. /* We set appendseldb to -1 in order to force the next call to the
  11. * feedAppendOnlyFile() to issue a SELECT command, so the differences
  12. * accumulated by the parent into server.aof_rewrite_buf will start
  13. * with a SELECT statement and it will be safe to merge. */
  14. server.aof_selected_db = -1;
  15. ...

父进程这里做的事情并不多,主要是信息的记录和一些标记位设置

  • 记录fork消耗的时间,info命令可以查看上次fork的耗时latest_fork_usec,单位微秒
  • 设置server.aof_rewrite_scheduled = 0,防止serverCron再次触发aofrewrite
  • 设置server.aof_child_pid为子进程pid,其不为-1时redis才会向aof-rewrite-buffer缓存写命令
  • updateDictResizePolicy()禁止所有hash数据结构resize,这是为了尽量避免子进程copy-on-write进行内存拷贝
  • 设置server.aof_selected_db = -1,下一次的aof日志会强制加上select,这是为了保证命令执行到正确的db

接下来就是缓存写命令和管道通信部分了,入口是在feedAppendOnlyFile():

  1. void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv, int argc) {
  2. ...
  3. if (server.aof_child_pid != -1)
  4. aofRewriteBufferAppend((unsigned char*)buf,sdslen(buf));
  5. ...
  6. }

server.aof_child_pid在这时就生效了,开始缓存写命令:

  1. void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
  2. listNode *ln = listLast(server.aof_rewrite_buf_blocks);
  3. aofrwblock *block = ln ? ln->value : NULL;
  4. while(len) {
  5. /* If we already got at least an allocated block, try appending
  6. * at least some piece into it. */
  7. if (block) {
  8. unsigned long thislen = (block->free < len) ? block->free : len;
  9. if (thislen) { /* The current block is not already full. */
  10. memcpy(block->buf+block->used, s, thislen);
  11. block->used += thislen;
  12. block->free -= thislen;
  13. s += thislen;
  14. len -= thislen;
  15. }
  16. }
  17. if (len) { /* First block to allocate, or need another block. */
  18. int numblocks;
  19. block = zmalloc(sizeof(*block));
  20. block->free = AOF_RW_BUF_BLOCK_SIZE;
  21. block->used = 0;
  22. listAddNodeTail(server.aof_rewrite_buf_blocks,block);
  23. /* Log every time we cross more 10 or 100 blocks, respectively
  24. * as a notice or warning. */
  25. numblocks = listLength(server.aof_rewrite_buf_blocks);
  26. if (((numblocks+1) % 10) == 0) {
  27. int level = ((numblocks+1) % 100) == 0 ? LL_WARNING :
  28. LL_NOTICE;
  29. serverLog(level,"Background AOF buffer size: %lu MB",
  30. aofRewriteBufferSize()/(1024*1024));
  31. }
  32. }
  33. }
  34. if (aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0) {
  35. aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
  36. AE_WRITABLE, aofChildWriteDiffData, NULL);
  37. }
  38. }

redis用链表server.aof_rewrite_buf_blocks来缓存aofrewrite期间的写命令,链表的每个节点最大10MB;重点是在最后的写事件注册,当server.aof_pipe_write_data_to_child这个fd没有注册事件时,就注册写事件函数aofChildWriteDiffData:

  1. void aofChildWriteDiffData(aeEventLoop *el, int fd, void *privdata, int mask) {
  2. listNode *ln;
  3. aofrwblock *block;
  4. ssize_t nwritten;
  5. ...
  6. while(1) {
  7. ln = listFirst(server.aof_rewrite_buf_blocks);
  8. block = ln ? ln->value : NULL;
  9. if (server.aof_stop_sending_diff || !block) {
  10. aeDeleteFileEvent(server.el,server.aof_pipe_write_data_to_child,
  11. AE_WRITABLE);
  12. return;
  13. }
  14. if (block->used > 0) {
  15. nwritten = write(server.aof_pipe_write_data_to_child,
  16. block->buf,block->used);
  17. if (nwritten <= 0) return;
  18. memmove(block->buf,block->buf+nwritten,block->used-nwritten);
  19. block->used -= nwritten;
  20. block->free += nwritten;
  21. }
  22. if (block->used == 0) listDelNode(server.aof_rewrite_buf_blocks,ln);
  23. }
  24. }

每次事件循环都会把server.aof_rewrite_buf_blocks积攒的写命令全部同步给子进程,除非server.aof_stop_sending_diff被设置了停止标记。

子进程和管道传输

接下来看下子进程:

  1. ...
  2. /* Child */
  3. char tmpfile[256];
  4. closeListeningSockets(0);
  5. redisSetProcTitle("redis-aof-rewrite");
  6. snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
  7. if (rewriteAppendOnlyFile(tmpfile) == C_OK) {
  8. ...
  9. exitFromChild(0);
  10. } else {
  11. exitFromChild(1);
  12. }
  13. ...

子进程首先关闭监听端口,然后就进入rewriteAppendOnlyFile()函数:

  1. int rewriteAppendOnlyFile(char *filename) {
  2. ...
  3. snprintf(tmpfile,256,"temp-rewriteaof-%d.aof", (int) getpid());
  4. fp = fopen(tmpfile,"w");
  5. ...
  6. server.aof_child_diff = sdsempty();
  7. ...
  8. if (rewriteAppendOnlyFileRio(&aof) == C_ERR) goto werr;
  9. ...

首先打开一个临时aof文件,并初始化server.aof_child_diff缓存准备从父进程读数据,然后就调用rewriteAppendOnlyFileRio()来写aof文件和读取管道中的数据:

  1. int rewriteAppendOnlyFileRio(rio *aof) {
  2. ...
  3. if (aof->processed_bytes > processed+AOF_READ_DIFF_INTERVAL_BYTES) {
  4. processed = aof->processed_bytes;
  5. aofReadDiffFromParent();
  6. }
  7. ...
  8. }

在遍历redis把key-value写入新aof文件过程中,新aof文件每增长10K就会调用aofReadDiffFromParent()从管道中读取数据追加到server.aof_child_diff:

  1. ssize_t aofReadDiffFromParent(void) {
  2. char buf[65536]; /* Default pipe buffer size on most Linux systems. */
  3. ssize_t nread, total = 0;
  4. while ((nread =
  5. read(server.aof_pipe_read_data_from_parent,buf,sizeof(buf))) > 0) {
  6. server.aof_child_diff = sdscatlen(server.aof_child_diff,buf,nread);
  7. total += nread;
  8. }
  9. return total;
  10. }

停止管道传输

子进程在遍历完redis生成好新的aof文件之后就要准备退出了,那么退出前要先告诉父进程停止管道传输,依然回到rewriteAppendOnlyFile()函数来看:

  1. int rewriteAppendOnlyFile(char *filename) {
  2. ...
  3. /* Ask the master to stop sending diffs. */
  4. if (write(server.aof_pipe_write_ack_to_parent,"!",1) != 1) goto werr;
  5. if (anetNonBlock(NULL,server.aof_pipe_read_ack_from_parent) != ANET_OK)
  6. goto werr;
  7. /* We read the ACK from the server using a 10 seconds timeout. Normally
  8. * it should reply ASAP, but just in case we lose its reply, we are sure
  9. * the child will eventually get terminated. */
  10. if (syncRead(server.aof_pipe_read_ack_from_parent,&byte,1,5000) != 1 ||
  11. byte != '!') goto werr;
  12. serverLog(LL_NOTICE,"Parent agreed to stop sending diffs. Finalizing AOF...");
  13. /* Read the final diff if any. */
  14. aofReadDiffFromParent();
  15. /* Write the received diff to the file. */
  16. serverLog(LL_NOTICE,
  17. "Concatenating %.2f MB of AOF diff received from parent.",
  18. (double) sdslen(server.aof_child_diff) / (1024*1024));
  19. if (rioWrite(&aof,server.aof_child_diff,sdslen(server.aof_child_diff)) == 0)
  20. goto werr;
  21. /* Make sure data will not remain on the OS's output buffers */
  22. if (fflush(fp) == EOF) goto werr;
  23. if (fsync(fileno(fp)) == -1) goto werr;
  24. if (fclose(fp) == EOF) goto werr;
  25. ...
  26. }

这里写的就很直接了:

  • 使用write向控制管道写入”!”发起停止请求,然后读取返回结果,超时时间为10s
  • 超时就goto werr异常退出,10s内读取到”!”就继续
  • 再次调用aofReadDiffFromParent()从数据管道读取数据确保管道中没有遗留
  • 最后rioWrite()把server.aof_child_diff积攒的数据追加到新的aof文件

那么父进程是如何处理”!”的呢,还记得之前注册的读事件aofChildPipeReadable()吧,子进程向控制管道发送”!”就会激活:

  1. void aofChildPipeReadable(aeEventLoop *el, int fd, void *privdata, int mask) {
  2. char byte;
  3. ...
  4. if (read(fd,&byte,1) == 1 && byte == '!') {
  5. serverLog(LL_NOTICE,"AOF rewrite child asks to stop sending diffs.");
  6. server.aof_stop_sending_diff = 1;
  7. if (write(server.aof_pipe_write_ack_to_child,"!",1) != 1) {
  8. /* If we can't send the ack, inform the user, but don't try again
  9. * since in the other side the children will use a timeout if the
  10. * kernel can't buffer our write, or, the children was
  11. * terminated. */
  12. serverLog(LL_WARNING,"Can't send ACK to AOF child: %s",
  13. strerror(errno));
  14. }
  15. }
  16. /* Remove the handler since this can be called only one time during a
  17. * rewrite. */
  18. aeDeleteFileEvent(server.el,server.aof_pipe_read_ack_from_child,AE_READABLE);
  19. }

很简单,标记server.aof_stop_sending_diff=1,给子进程回复”!”,并且把自己从事件循环删掉,自此父子进程间通信完成,剩下的就是父进程等待子进程退出进行收尾工作。

父进程收尾

serverCron()中会调用wait3()来收割子进程:

  1. /* Check if a background saving or AOF rewrite in progress terminated. */
  2. if (server.rdb_child_pid != -1 || server.aof_child_pid != -1 ||
  3. ldbPendingChildren())
  4. {
  5. int statloc;
  6. pid_t pid;
  7. if ((pid = wait3(&statloc,WNOHANG,NULL)) != 0) {
  8. int exitcode = WEXITSTATUS(statloc);
  9. int bysignal = 0;
  10. if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);
  11. if (pid == -1) {
  12. serverLog(LL_WARNING,"wait3() returned an error: %s. "
  13. "rdb_child_pid = %d, aof_child_pid = %d",
  14. strerror(errno),
  15. (int) server.rdb_child_pid,
  16. (int) server.aof_child_pid);
  17. } else if (pid == server.rdb_child_pid) {
  18. backgroundSaveDoneHandler(exitcode,bysignal);
  19. if (!bysignal && exitcode == 0) receiveChildInfo();
  20. } else if (pid == server.aof_child_pid) {
  21. backgroundRewriteDoneHandler(exitcode,bysignal);
  22. if (!bysignal && exitcode == 0) receiveChildInfo();
  23. } else {
  24. if (!ldbRemoveChild(pid)) {
  25. serverLog(LL_WARNING,
  26. "Warning, detected child with unmatched pid: %ld",
  27. (long)pid);
  28. }
  29. }
  30. updateDictResizePolicy();
  31. closeChildInfoPipe();
  32. }

如果收割到的pid是server.aof_child_pid就进入backgroundRewriteDoneHandler():

  1. void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
  2. ...
  3. /* Flush the differences accumulated by the parent to the
  4. * rewritten AOF. */
  5. latencyStartMonitor(latency);
  6. snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof",
  7. (int)server.aof_child_pid);
  8. newfd = open(tmpfile,O_WRONLY|O_APPEND);
  9. if (newfd == -1) {
  10. serverLog(LL_WARNING,
  11. "Unable to open the temporary AOF produced by the child: %s", strerror(errno));
  12. goto cleanup;
  13. }
  14. if (aofRewriteBufferWrite(newfd) == -1) {
  15. serverLog(LL_WARNING,
  16. "Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
  17. close(newfd);
  18. goto cleanup;
  19. }
  20. latencyEndMonitor(latency);
  21. latencyAddSampleIfNeeded("aof-rewrite-diff-write",latency);

首先会打开子进程生成的新aof文件,并调用aofRewriteBufferWrite()把server.aof_rewrite_buf_blocks中剩余的数据追加到新aof文件。

  1. /* Rename the temporary file. This will not unlink the target file if
  2. * it exists, because we reference it with "oldfd". */
  3. latencyStartMonitor(latency);
  4. if (rename(tmpfile,server.aof_filename) == -1) {
  5. serverLog(LL_WARNING,
  6. "Error trying to rename the temporary AOF file %s into %s: %s",
  7. tmpfile,
  8. server.aof_filename,
  9. strerror(errno));
  10. close(newfd);
  11. if (oldfd != -1) close(oldfd);
  12. goto cleanup;
  13. }
  14. latencyEndMonitor(latency);
  15. latencyAddSampleIfNeeded("aof-rename",latency);

之后把新aof文件rename为server.aof_filename记录的文件名。

  1. /* Asynchronously close the overwritten AOF. */
  2. if (oldfd != -1) bioCreateBackgroundJob(BIO_CLOSE_FILE,(void*)(long)oldfd,NULL,NULL);

使用bio后台线程来close原来的aof文件。

  1. cleanup:
  2. aofClosePipes();
  3. aofRewriteBufferReset();
  4. aofRemoveTempFile(server.aof_child_pid);
  5. server.aof_child_pid = -1;
  6. server.aof_rewrite_time_last = time(NULL)-server.aof_rewrite_time_start;
  7. server.aof_rewrite_time_start = -1;
  8. /* Schedule a new rewrite if we are waiting for it to switch the AOF ON. */
  9. if (server.aof_state == AOF_WAIT_REWRITE)
  10. server.aof_rewrite_scheduled = 1;

最后是清理工作,包括关闭管道、重置aof-rewrite-buffer、复位server.aof_child_pid=-1等,自此aofrewrite完成。

后记

本文介绍了redis的aofrewrite基础实现以及利用pipe的优化,云Redis4.0已经上线,欢迎使用。