Placement

Note: The words placement and topology are used interchangeably throughout the M3DB documentation and codebase.

A M3DB cluster has exactly one Placement. That placement maps the cluster’s shard replicas to nodes. A cluster also has 0 or more namespaces (analogous to tables in other databases), and each node serves every namespace for the shards it owns. In other words, if the cluster topology states that node A owns shards 1, 2, and 3 then node A will own shards 1, 2, 3 for all configured namespaces in the cluster.

M3DB stores its placement (mapping of which NODES are responsible for which shards) in etcd. There are three possible states that each node/shard pair can be in:

  1. Initializing
  2. Available
  3. Leaving

Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard (achieved goal state). For example, in a new cluster all the nodes will begin with all of their shards in the Initializing state. Once all the nodes finish bootstrapping, they will mark all of their shards as Available. If all the M3DB nodes are stopped at the same time, the cluster placement will still show all of the shards for all of the nodes as Available.

Initializing State

The Initializing state is the state in which all new node/shard combinations begin. For example, upon creating a new placement all the node/shard pairs will begin in the Initializing state and only once they have successfully bootstrapped will they transition to the Available state.

The Initializing state is not limited to new placement, however, as it can also occur during placement changes. For example, during a node add/replace the new node will begin with all of its shards in the Initializing state until it can stream the data it is missing from its peers. During a node removal, all of the nodes who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as Initializing until they can stream in the data from the node leaving the cluster, or one of its peers.

Available State

Once a node with a shard in the Initializing state successfully bootstraps all of the data for that shard, it will mark that shard as Available (for the single node) in the cluster placement.

Leaving State

The Leaving state indicates that a node has been marked for removal from the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for the nodes that are taking over its responsibilities to stream data from it.

Sample Cluster State Transitions - Node Add

Node adds are performed by adding the new node to the placement. Some portion of the existing shards will be assigned to the new node based on its weight, and they will begin in the Initializing state. Similarly, the shards will be marked as Leaving on the node that are destined to lose ownership of them. Once the new node finishes bootstrapping the shards, it will update the placement to indicate that the shards it owns are Available and that the Leaving node should no longer own that shard in the placement.

  1. Replication factor: 3
  2. ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
  3. Node A Node B Node C Node D
  4. ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐
  5. ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐│
  6. ││
  7. ││
  8. Shard 1: Available Shard 1: Available Shard 1: Available ││
  9. 1) Initial Placement Shard 2: Available Shard 2: Available Shard 2: Available ││
  10. Shard 3: Available Shard 3: Available Shard 3: Available ││
  11. ││
  12. ││
  13. └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘│
  14. ├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤
  15. ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐│┌──────────────────────┐
  16. │││
  17. │││
  18. Shard 1: Leaving Shard 1: Available Shard 1: Available │││Shard 1: Initializing
  19. 2) Begin Node Add Shard 2: Available Shard 2: Leaving Shard 2: Available │││Shard 2: Initializing
  20. Shard 3: Available Shard 3: Available Shard 3: Leaving │││Shard 3: Initializing
  21. │││
  22. │││
  23. └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘│└──────────────────────┘
  24. ├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤
  25. ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐│┌──────────────────────┐
  26. │││
  27. │││
  28. Shard 2: Available Shard 1: Available Shard 1: Available │││ Shard 1: Available
  29. 3) Complete Node Add Shard 3: Available Shard 3: Available Shard 2: Available │││ Shard 2: Available
  30. │││ Shard 3: Available
  31. │││
  32. │││
  33. └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘│└──────────────────────┘
  34. └──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘

Sample Cluster State Transitions - Node Remove

Node removes are performed by updating the placement such that all the shards on the node that will be removed from the cluster are marked as Leaving and those shards are distributed to the remaining nodes (based on their weight) and assigned a state of Initializing. Once the existing nodes that are taking ownership of the leaving nodes shards finish bootstrapping, they will update the placement to indicate that the shards that they just acquired are Available and that the leaving node should no longer own those shards in the placement.

  1. Replication factor: 3
  2. ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
  3. Node A Node B Node C Node D
  4. ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐
  5. ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐│┌──────────────────────┐
  6. │││
  7. │││
  8. Shard 2: Available Shard 1: Available Shard 1: Available │││ Shard 1: Available
  9. 1) Initial Placement Shard 3: Available Shard 3: Available Shard 2: Available │││ Shard 2: Available
  10. │││ Shard 3: Available
  11. │││
  12. │││
  13. └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘│└──────────────────────┘
  14. ├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤
  15. ┌─────────────────────────┐ ┌───────────────────────┐ │┌───────────────────────┐│┌──────────────────────┐
  16. ││ │││
  17. ││ │││
  18. Shard 1: Initializing Shard 1: Available ││ Shard 1: Available │││ Shard 1: Leaving
  19. 2) Begin Node Remove Shard 2: Available Shard 2: Initializing ││ Shard 2: Available │││ Shard 2: Leaving
  20. Shard 3: Available Shard 3: Available ││ Shard 3: Initializing│││ Shard 3: Leaving
  21. ││ │││
  22. ││ │││
  23. └─────────────────────────┘ └───────────────────────┘ │└───────────────────────┘│└──────────────────────┘
  24. ├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤
  25. ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐│
  26. ││
  27. ││
  28. Shard 1: Avaiable Shard 1: Available Shard 1: Available ││
  29. 3) Complete Node Remove Shard 2: Available Shard 2: Available Shard 2: Available ││
  30. Shard 3: Available Shard 3: Available Shard 3: Available ││
  31. ││
  32. ││
  33. └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘│
  34. └──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘

Sample Cluster State Transitions - Node Replace

Node replaces are performed by updating the placement such that all the shards on the node that will be removed from the cluster are marked as Leaving and those shards are all added to the node that is being added and assigned a state of Initializing. Once the replacement node finishes bootstrapping, it will update the placement to indicate that the shards that it acquired are Available and that the leaving node should no longer own those shards in the placement.

  1. Replication factor: 3
  2. ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
  3. Node A Node B Node C Node D
  4. ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐
  5. ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐│
  6. ││
  7. ││
  8. Shard 1: Available Shard 1: Available Shard 1: Available ││
  9. 1) Initial Placement Shard 2: Available Shard 2: Available Shard 2: Available ││
  10. Shard 3: Available Shard 3: Available Shard 3: Available ││
  11. ││
  12. ││
  13. └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘│
  14. ├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤
  15. ┌─────────────────────────┐ ┌───────────────────────┐ │┌───────────────────────┐│┌──────────────────────┐
  16. ││ │││
  17. ││ │││
  18. Shard 1: Available Shard 1: Available ││ Shard 1: Leaving │││Shard 1: Initializing
  19. 2) Begin Node Replace Shard 2: Available Shard 2: Available ││ Shard 2: Leaving │││Shard 2: Initializing
  20. Shard 3: Available Shard 3: Available ││ Shard 3: Leaving │││Shard 3: Initializing
  21. ││ │││
  22. ││ │││
  23. └─────────────────────────┘ └───────────────────────┘ │└───────────────────────┘│└──────────────────────┘
  24. ├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤
  25. ┌─────────────────────────┐ ┌───────────────────────┐ │┌──────────────────────┐
  26. ││
  27. ││
  28. Shard 1: Avaiable Shard 1: Available ││ Shard 1: Available
  29. 3) Complete Node Replace Shard 2: Available Shard 2: Available ││ Shard 2: Available
  30. Shard 3: Available Shard 3: Available ││ Shard 3: Available
  31. ││
  32. ││
  33. └─────────────────────────┘ └───────────────────────┘ │└──────────────────────┘
  34. └──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘

Cluster State Transitions - Placement Updates Initiation

The diagram below depicts the sequence of events that happen during a node replace and illustrates which entity is performing the placement update (in etcd) at each step.

  1. ┌────────────────────────────────┐
  2. Node A
  3. Shard 1: Available
  4. Shard 2: Available Operator performs node replace by
  5. Shard 3: Available updating placement in etcd such
  6. that shards on node A are marked
  7. └────────────────────────────────┤ Leaving and shards on node B are
  8. marked Initializing
  9. └─────────────────────────────────┐
  10. ┌────────────────────────────────┐
  11. Node A
  12. Shard 1: Leaving
  13. Shard 2: Leaving
  14. Shard 3: Leaving
  15. └────────────────────────────────┘
  16. ┌────────────────────────────────┐
  17. Node B
  18. Shard 1: Initializing
  19. ┌────────────────────────────────┐ Shard 2: Initializing
  20. Shard 3: Initializing
  21. Node A └────────────────────────────────┘
  22. └────────────────────────────────┘
  23. ┌────────────────────────────────┐
  24. Node B
  25. Shard 1: Available Node B completes bootstrapping and
  26. Shard 2: Available │◀────updates placement (via etcd) to
  27. Shard 3: Available indicate shard state is Available and
  28. that Node A should no longer own any shards
  29. └────────────────────────────────┘