简介

在之前的文章中我们知道RocksDB每一次写入,都是先写WAL,然后写Memtable,这次我们就来分析下MemTable的实现.

在RocksDB中,每个ColumnFamily都有自己的Memtable,互不影响.而在RocksDB中Memtable有多种实现(SkipList/HashSkipList/HashLinkList/Vector),具体的区别可以看这里,我们这次主要来分析默认的实现skiplist(只有skiplist是可以并发插入的).

实现

首先从创建Memtable开始,Memtable的创建(ColumnFamilyData::CreateNewMemtable)是在创建ColumnFamily(VersionSet::CreateColumnFamily)的时候创建的.这里就是创建memtable,然后设置到ColumnFamilyData的mem_域中.

  1. MemTable* ColumnFamilyData::ConstructNewMemtable(
  2. const MutableCFOptions& mutable_cf_options, SequenceNumber earliest_seq) {
  3. return new MemTable(internal_comparator_, ioptions_, mutable_cf_options,
  4. write_buffer_manager_, earliest_seq, id_);
  5. }
  6. void ColumnFamilyData::CreateNewMemtable(
  7. const MutableCFOptions& mutable_cf_options, SequenceNumber earliest_seq) {
  8. if (mem_ != nullptr) {
  9. delete mem_->Unref();
  10. }
  11. SetMemtable(ConstructNewMemtable(mutable_cf_options, earliest_seq));
  12. mem_->Ref();
  13. }

上面所提及的,RocksDB有多种MemTable的实现,那么它是如何来做的呢,RocksDB通过memtable_factory来根据用户的设置来创建不同的memtable.这里要注意的是核心的memtable实现是在MemTable这个类的table_域中.

  1. MemTable::MemTable:
  2. table_(ioptions.memtable_factory->CreateMemTableRep(
  3. comparator_, &arena_, ioptions.prefix_extractor, ioptions.info_log,
  4. column_family_id)),
  5. class MemTableRepFactory {
  6. public:
  7. virtual ~MemTableRepFactory() {}
  8. virtual MemTableRep* CreateMemTableRep(const MemTableRep::KeyComparator&,
  9. Allocator*, const SliceTransform*,
  10. Logger* logger) = 0;
  11. virtual MemTableRep* CreateMemTableRep(
  12. const MemTableRep::KeyComparator& key_cmp, Allocator* allocator,
  13. const SliceTransform* slice_transform, Logger* logger,
  14. uint32_t /* column_family_id */) {
  15. return CreateMemTableRep(key_cmp, allocator, slice_transform, logger);
  16. }
  17. ........................

然后最后会调用对应的实现的CreateMemTableRep方法,这里我们就来看SkipList的实现.

  1. MemTableRep* SkipListFactory::CreateMemTableRep(
  2. const MemTableRep::KeyComparator& compare, Allocator* allocator,
  3. const SliceTransform* transform, Logger* /*logger*/) {
  4. return new SkipListRep(compare, allocator, transform, lookahead_);
  5. }

最终就是创建SkipListRep对象,在这个对象里面会创建SkipList(class InlineSkipList).

  1. class SkipListRep : public MemTableRep {
  2. InlineSkipList<const MemTableRep::KeyComparator&> skip_list_;
  3. ...................................
  4. public:
  5. explicit SkipListRep(const MemTableRep::KeyComparator& compare,
  6. Allocator* allocator, const SliceTransform* transform,
  7. const size_t lookahead)
  8. : MemTableRep(allocator),
  9. skip_list_(compare, allocator),
  10. cmp_(compare),
  11. transform_(transform),
  12. lookahead_(lookahead) {}

这里SkipList就不分析实现了,如果对这个数据结构不了解的,可以去WIKI看一下.这里我们只需要知道最终所有的memtable数据都是保存在SkipList中就可以了.

在之前的分析中我们知道Memtable的插入是通过WriteBatch然后遍历ColumnFamily来插入的,而最终则是会调用MemTable::Add这个函数.

  1. bool MemTable::Add(SequenceNumber s, ValueType type,
  2. const Slice& key, /* user key */
  3. const Slice& value, bool allow_concurrent,
  4. MemTablePostProcessInfo* post_process_info) {
  5. bool res = table->InsertKeyConcurrently(handle);
  6. if (UNLIKELY(!res)) {
  7. return res;
  8. }
  9. ..............................
  10. }

最终会调用InlineSkipList来对数据进行插入.

  1. template <class Comparator>
  2. bool InlineSkipList<Comparator>::InsertConcurrently(const char* key) {
  3. Node* prev[kMaxPossibleHeight];
  4. Node* next[kMaxPossibleHeight];
  5. Splice splice;
  6. splice.prev_ = prev;
  7. splice.next_ = next;
  8. return Insert<true>(key, &splice, false);
  9. }

看到这里或许会有疑问了,那就是skiplist里面只有key,而RocksDB是一个KV存储,那么这个KV是如何存储的呢,这里是这样的,RocksDB会将KV打包成一个key传递给SkipList, 对应的KEY的结构是这样的.

  1. // Format of an entry is concatenation of:
  2. // key_size : varint32 of internal_key.size()
  3. // key bytes : char[internal_key.size()]
  4. // value_size : varint32 of value.size()
  5. // value bytes : char[value.size()]

而数据的格式化就在之前的MemTable::Add中实现的.

  1. uint32_t key_size = static_cast<uint32_t>(key.size());
  2. uint32_t val_size = static_cast<uint32_t>(value.size());
  3. uint32_t internal_key_size = key_size + 8;
  4. const uint32_t encoded_len = VarintLength(internal_key_size) +
  5. internal_key_size + VarintLength(val_size) +
  6. val_size;
  7. char* buf = nullptr;
  8. std::unique_ptr<MemTableRep>& table =
  9. type == kTypeRangeDeletion ? range_del_table_ : table_;
  10. KeyHandle handle = table->Allocate(encoded_len, &buf);
  11. char* p = EncodeVarint32(buf, internal_key_size);
  12. memcpy(p, key.data(), key_size);
  13. Slice key_slice(p, key_size);
  14. p += key_size;
  15. uint64_t packed = PackSequenceAndType(s, type);
  16. EncodeFixed64(p, packed);
  17. p += 8;
  18. p = EncodeVarint32(p, val_size);
  19. memcpy(p, value.data(), val_size);

而对于真正的KEY的解析是在SkipList的Comparator中实现的(compare_).下面的代码片段可以看到会解析出来真正的key,然后再进行查找以及插入.

  1. bool InlineSkipList<Comparator>::Insert(const char* key, Splice* splice,
  2. bool allow_partial_splice_fix) {
  3. Node* x = reinterpret_cast<Node*>(const_cast<char*>(key)) - 1;
  4. const DecodedKey key_decoded = compare_.decode_key(key);
  5. ...............................
  6. }

原文:http://mysql.taobao.org/monthly/2018/08/08/