Topology Service
This document describes the Topology Service, a key part of the Vitess architecture. This service is exposed to all Vitess processes, and is used to store small pieces of configuration data about the Vitess cluster, and provide cluster-wide locks. It also supports watches, and master election.
Vitess uses a plugin implementation to support multiple backend technologies for the Topology Service (etcd, ZooKeeper, Consul). Concretely, the Topology Service handles two functions: it is both a distributed lock manager and a repository for topology metadata. In earlier versions of Vitess, the Topology Serice was also referred to as the Lock Service.
Requirements and usage
The Topology Service is used to store information about the Keyspaces, theShards, the Tablets, the Replication Graph, and the Serving Graph. We storesmall data structures (a few hundred bytes) per object.
The main contract for the Topology Service is to be very highly available andconsistent. It is understood it will come at a higher latency cost and very lowthroughput.
We never use the Topology Service as an RPC or queuing mechanism or as a storagesystem for logs. We never depend on the Topology Service being responsive andfast to serve every query.
The Topology Service must also support a Watch interface, to signal when certainconditions occur on a node. This is used, for instance, to know when the Keyspacetopology changes (e.g. for resharding).
Global vs Local
We differentiate two instances of the Topology Service: the Global instance, andthe per-cell Local instance:
- The Global instance is used to store global data about the topology thatdoesn’t change very often, e.g. information about Keyspaces and Shards.The data is independent of individual instances and cells, and needsto survive a cell going down entirely.
- There is one Local instance per cell, that contains cell-specific information,and also rolled-up data from the Global + Local cell to make it easier forclients to find the data. The Vitess local processes should not use the Globaltopology instance, but instead the rolled-up data in the Local topologyserver as much as possible.The Global instance can go down for a while and not impact the local cells (anexception to that is if a reparent needs to be processed, it might not work). Ifa Local instance goes down, it only affects the local tablets in that instance(and then the cell is usually in bad shape, and should not be used).
Vitess will not use the global or local topology service as part of serving individual queries. The Topology Service is only used to get the topology information at startup and in the background.
Recovery
If a Local Topology Service dies and is not recoverable, it can be wiped out. Allthe tablets in that cell then need to be restarted so they re-initialize theirtopology records (but they won’t lose any MySQL data).
If the Global Topology Service dies and is not recoverable, this is more of aproblem. All the Keyspace / Shard objects have to be recreated or be restored.Then the cells should recover.
Global data
This section describes the data structures stored in the Global instance of thetopology service.
Keyspace
The Keyspace object contains various information, mostly about sharding: how isthis Keyspace sharded, what is the name of the sharding key column, is thisKeyspace serving data yet, how to split incoming queries, …
An entire Keyspace can be locked. We use this during resharding for instance,when we change which Shard is serving what inside a Keyspace. That way weguarantee only one operation changes the Keyspace data concurrently.
Shard
A Shard contains a subset of the data for a Keyspace. The Shard record in theGlobal topology service contains:
- the Master tablet alias for this shard (that has the MySQL master).
- the sharding key range covered by this Shard inside the Keyspace.
- the tablet types this Shard is serving (master, replica, batch, …), per cellif necessary.
- if using filtered replication, the source shards this shard is replicatingfrom.
- the list of cells that have tablets in this shard.
- shard-global tablet controls, like blacklisted tables no tablet should servein this shard.A Shard can be locked. We use this during operations that affect either theShard record, or multiple tablets within a Shard (like reparenting), so multipletasks cannot concurrently alter the data.
VSchema data
The VSchema data contains sharding and routing information forthe VTGate V3 API.
Local data
This section describes the data structures stored in the Local instance (percell) of the topology service.
Tablets
The Tablet record has a lot of information about each vttablet processmaking up each tablet (along with the MySQL process):
- the Tablet Alias (cell+unique id) that uniquely identifies the Tablet.
- the Hostname, IP address and port map of the Tablet.
- the current Tablet type (master, replica, batch, spare, …).
- which Keyspace / Shard the tablet is part of.
- the sharding Key Range served by this Tablet.
user-specified tag map (e.g. to store per-installation data).A Tablet record is created before a tablet can be running (either by
vtctlInitTablet
or by passing theinit_*
parameters to the vttablet process).The only way a Tablet record will be updated is one of:The vttablet process itself owns the record while it is running, and canchange it.
- At init time, before the tablet starts.
- After shutdown, when the tablet gets deleted.
- If a tablet becomes unresponsive, it may be forced to spare to make itunhealthy when it restarts.
Replication graph
The Replication Graph allows us to find Tablets in a given Cell / Keyspace /Shard. It used to contain information about which Tablet is replicating fromwhich other Tablet, but that was too complicated to maintain. Now it is just alist of Tablets.
Serving graph
The Serving Graph is what the clients use to find the per-cell topology of aKeyspace. It is a roll-up of global data (Keyspace + Shard). vtgates only open asmall number of these objects and get all the information they need quickly.
SrvKeyspace
It is the local representation of a Keyspace. It contains information on whatshard to use for getting to the data (but not information about each individualshard):
- the partitions map is keyed by the tablet type (master, replica, batch, …) andthe value is a list of shards to use for serving.
- it also contains the global Keyspace fields, copied for fast access.It can be rebuilt by running
vtctl RebuildKeyspaceGraph <keyspace>
. It isautomatically rebuilt when a tablet starts up in a cell and the SrvKeyspacefor that cell / keyspace does not exist yet. It will also be changedduring horizontal and vertical splits.
SrvVSchema
It is the local roll-up for the VSchema. It contains the VSchema for allkeyspaces in a single object.
It can be rebuilt by running vtctl RebuildVSchemaGraph
. It is automaticallyrebuilt when using vtctl ApplyVSchema
(unless prevented by flags).
Workflows involving the Topology Service
The Topology Service is involved in many Vitess workflows.
When a Tablet is initialized, we create the Tablet record, and add the Tablet tothe Replication Graph. If it is the master for a Shard, we update the globalShard record as well.
Administration tools need to find the tablets for a given Keyspace / Shard. To retrieve this:
- first we get the list of Cells that have Tablets for the Shard (global topologyShard record has these)
- then we use the Replication Graph for that Cell /Keyspace / Shard to find all the tablets then we can read each tablet record.When a Shard is reparented, we need to update the global Shard record with thenew master alias.
Finding a tablet to serve the data is done in two stages:
- vtgate maintains a health check connection to all possible tablets, and theyreport which Keyspace / Shard / Tablet type they serve.
- vtgate also reads the SrvKeyspace object, to find out the shard map.With these two pieces of information, vtgate can route the query to the right vttablet.
During resharding events, we also change the topology significantly. A horizontal splitwill change the global Shard records, and the local SrvKeyspace records. Avertical split will change the global Keyspace records, and the localSrvKeyspace records.
Exploring the data in a Topology Service
We store the proto3 serialized binary data for each object.
We use the following paths for the data, in all implementations:
Global Cell:
- CellInfo path:
cells/<cell name>/CellInfo
- Keyspace:
keyspaces/<keyspace>/Keyspace
- Shard:
keyspaces/<keyspace>/shards/<shard>/Shard
VSchema:
keyspaces/<keyspace>/VSchema
Local Cell:Tablet:
tablets/<cell>-<uid>/Tablet
- Replication Graph:
keyspaces/<keyspace>/shards/<shard>/ShardReplication
- SrvKeyspace:
keyspaces/<keyspace>/SrvKeyspace
- SrvVSchema:
SvrVSchema
Thevtctl TopoCat
utility can decode these files when using the-decode_proto
option:
TOPOLOGY="-topo_implementation zk2 -topo_global_server_address global_server1,global_server2 -topo_global_root /vitess/global"
$ vtctl $TOPOLOGY TopoCat -decode_proto -long /keyspaces/*/Keyspace
path=/keyspaces/ks1/Keyspace version=53
sharding_column_name: "col1"
path=/keyspaces/ks2/Keyspace version=55
sharding_column_name: "col2"
The vtctld
web tool also contains a topology browser (use the Topology
tab on the left side). It will display the various proto files, decoded.
Implementations
The Topology Service interfaces are defined in our code in go/vt/topo/
,specific implementations are in go/vt/topo/<name>
, and we also havea set of unit tests for it in go/vt/topo/test
.
This part describes the implementations we have, and their specificbehavior.
If starting from scratch, please use the zk2
, etcd2
or consul
implementations. We deprecated the old zookeeper
and etcd
implementations. See the migration section below if you want to migrate.
Zookeeper zk2 implementation
This is the current implementation when using Zookeeper. (The old zookeeper
implementation is deprecated).
The global cell typically has around 5 servers, distributed one in eachcell. The local cells typically have 3 or 5 servers, in different server racks /sub-networks for higher resilience. For our integration tests, we use a singleZK server that serves both global and local cells.
We provide the zk
utility for easy access to the topology data inZookeeper. It can list, read and write files inside any Zoopeeker server. Justspecify the -server
parameter to point to the Zookeeper servers. Note thevtctld UI can also be used to see the contents of the topology data.
To configure a Zookeeper installation, let’s start with the global cellservice. It is described by the addresses of the servers (comma separated list),and by the root directory to put the Vitess data in. For instance, assuming wewant to use servers global_server1,global_server2
in path /vitess/global
:
# The root directory in the global server will be created
# automatically, same as when running this command:
# zk -server global_server1,global_server2 touch -p /vitess/global
# Set the following flags to let Vitess use this global server:
# -topo_implementation zk2
# -topo_global_server_address global_server1,global_server2
# -topo_global_root /vitess/global
Then to add a cell whose local topology service cell1_server1,cell1_server2
will store their data under the directory /vitess/cell1
:
TOPOLOGY="-topo_implementation zk2 -topo_global_server_address global_server1,global_server2 -topo_global_root /vitess/global"
# Reference cell1 in the global topology service:
vtctl $TOPOLOGY AddCellInfo \
-server_address cell1_server1,cell1_server2 \
-root /vitess/cell1 \
cell1
If only one cell is used, the same Zookeeper instance can be used for bothglobal and local data. A local cell record still needs to be created, just usethe same server address, and very importantly a different root directory.
Zookeeper Observers canalso be used to limit the load on the global Zookeeper. They are configured byspecifying the addresses of the observers in the server address, after a |
,for instance:global_server1:p1,global_server2:p2|observer1:po1,observer2:po2
.
Implementation details
We use the following paths for Zookeeper specific data, in addition to theregular files:
- Locks sub-directory:
locks/
(for instance:keyspaces/<keyspace>/Keyspace/locks/
for a keyspace) - Master election path:
elections/<name>
Both locks and master election are implemented using ephemeral, sequential fileswhich are stored in their respective directory.
etcd etcd2 implementation (new version of etcd)
This topology service plugin is meant to use etcd clusters as storage backendfor the topology data. This topology service supports version 3 and up of theetcd server.
This implementation is named etcd2
because it supersedes our previousimplementation etcd
. Note that the storage format has been changed with theetcd2
implementation, i.e. existing data created by the previous etcd
implementation must be migrated manually (See migration section below).
To configure an etcd2
installation, let’s start with the global cellservice. It is described by the addresses of the servers (comma separated list),and by the root directory to put the Vitess data in. For instance, assuming wewant to use servers http://global_server1,http://global_server2
in path/vitess/global
:
# Set the following flags to let Vitess use this global server,
# and simplify the example below:
# -topo_implementation etcd2
# -topo_global_server_address http://global_server1,http://global_server2
# -topo_global_root /vitess/global
TOPOLOGY="-topo_implementation etcd2 -topo_global_server_address http://global_server1,http://global_server2 -topo_global_root /vitess/global
Then to add a cell whose local topology servicehttp://cell1_server1,http://cell1_server2
will store their data under thedirectory /vitess/cell1
:
# Reference cell1 in the global topology service:
# (the TOPOLOGY variable is defined in the previous section)
vtctl $TOPOLOGY AddCellInfo \
-server_address http://cell1_server1,http://cell1_server2 \
-root /vitess/cell1 \
cell1
If only one cell is used, the same etcd instances can be used for bothglobal and local data. A local cell record still needs to be created, just usethe same server address and, very importantly, a different root directory.
Implementation details
For locks, we use a subdirectory named locks
in the directory to lock, and anephemeral file in that subdirectory (it is associated with a lease, whose TTLcan be set with the -topo_etcd_lease_duration
flag, defaults to 30seconds). The ephemeral file with the lowest ModRevision has the lock, theothers wait for files with older ModRevisions to disappear.
Master elections also use a subdirectory, named after the election Name, and usea similar method as the locks, with ephemeral files.
We store the proto3 binary data for each object (as the v3 API allows us to storebinary data). Note that this means that if you want to interact with etcd usingthe etcdctl
tool, you will have to tell it to use the v3 API, e.g.:
ETCDCTL_API=3 etcdctl get / --prefix --keys-only
Consul consul implementation
This topology service plugin is meant to use Consul clusters as storage backendfor the topology data.
To configure a consul
installation, let’s start with the global cellservice. It is described by the address of a server,and by the root node path to put the Vitess data in (it cannot start with /
). For instance, assuming wewant to use servers global_server:global_port
with node pathvitess/global
:
# Set the following flags to let Vitess use this global server,
# and simplify the example below:
# -topo_implementation consul
# -topo_global_server_address global_server:global_port
# -topo_global_root vitess/global
TOPOLOGY="-topo_implementation consul -topo_global_server_address global_server:global_port -topo_global_root vitess/global
Then to add a cell whose local topology servicecell1_server1:cell1_port
will store their data under thedirectory vitess/cell1
:
# Reference cell1 in the global topology service:
# (the TOPOLOGY variable is defined in the previous section)
vtctl $TOPOLOGY AddCellInfo \
-server_address cell1_server1:cell1_port \
-root vitess/cell1 \
cell1
If only one cell is used, the same consul instances can be used for bothglobal and local data. A local cell record still needs to be created, just usethe same server address and, very importantly, a different root node path.
Implementation details
For locks, we use a file named Lock
in the directory to lock, and the regularConsul Lock API.
Master elections use a single lock file (the Election path) and the regularConsul Lock API. The contents of the lock file is the ID of the current master.
Watches use the Consul long polling Get call. They cannot be interrupted, so weuse a long poll whose duration is set by the-topo_consul_watch_poll_duration
flag. Canceling a watch may have towait until the end of a polling cycle with that duration before returning.
Running in only one cell
The topology service is meant to be distributed across multiple cells, andsurvive single cell outages. However, one common usage is to run a Vitesscluster in only one cell / region. This part explains how to do this, and lateron upgrade to multiple cells / regions.
If running in a single cell, the same topology service can be used for bothglobal and local data. A local cell record still needs to be created, just usethe same server address and, very importantly, a different root node path.
In that case, just running 3 servers for topology service quorum is probablysufficient. For instance, 3 etcd servers. And use their address for the localcell as well. Let’s use a short cell name, like local
, as the local data inthat topology service will later on be moved to a different topology service,which will have the real cell name.
Extending to more cells
To then run in multiple cells, the current topology service needs to be splitinto a global instance and one local instance per cell. Whereas, the initialsetup had 3 topology servers (used for global and local data), we recommend torun 5 global servers across all cells (for global topology data) and 3 localservers per cell (for per-cell topology data).
To migrate to such a setup, start by adding the 3 local servers in the secondcell and run vtctl AddCellinfo
as was done for the first cell. Tablets andvtgates can now be started in the second cell, and used normally.
vtgate can then be configured with a list of cells to watch for tablets usingthe -cells_to_watch
command line parameter. It can then use all tablets inall cells to route traffic. Note this is necessary to access the master inanother cell.
After the extension to two cells, the original topo service contains both theglobal topology data, and the first cell topology data. The more symmetricalconfiguration we are after would be to split that original service into two: aglobal one that only contains the global data (spread across both cells), and alocal one to the original cells. To achieve that split:
- Start up a new local topology service in that original cell (3 more localservers in that cell).
- Pick a name for that cell, different from
local
. - Use
vtctl AddCellInfo
to configure it. - Make sure all vtgates can see that new local cell (again, using
-cells_to_watch
). - Restart all vttablets to be in that new cell, instead of the
local
cell nameused before. - Use
vtctl RemoveKeyspaceCell
to remove all mentions of thelocal
cell inall keyspaces. - Use
vtctl RemoveCellInfo
to remove the global configurations for thatlocal
cell. Remove all remaining data in the global topology service that are in the oldlocal server root.After this split, the configuration is completely symmetrical:
a global topology service, with servers in all cells. Only contains globaltopology data about Keyspaces, Shards and VSchema. Typically it has 5 serversacross all cells.
- a local topology service to each cell, with servers only in that cell. Onlycontains local topology data about Tablets, and roll-ups of global data forefficient access. Typically, it has 3 servers in each cell.
Migration between implementations
We provide the topo2topo
utility to migrate between one implementationand another of the topology service.
The process to follow in that case is:
- Start from a stable topology, where no resharding or reparenting is ongoing.
- Configure the new topology service so it has at least all the cells of thesource topology service. Make sure it is running.
- Run the
topo2topo
program with the right flags.-from_implementation
,-from_root
,-from_server
describe the source (old) topologyservice.-to_implementation
,-to_root
,-to_server
describe thedestination (new) topology service. - Run
vtctl RebuildKeyspaceGraph
for each keyspace using the new topologyservice flags. - Run
vtctl RebuildVSchemaGraph
using the new topology service flags. - Restart all
vtgate
processes using the new topology service flags. Theywill see the same Keyspaces / Shards / Tablets / VSchema as before, as thetopology was copied over. - Restart all
vttablet
processes using the new topology service flags.They may use the same ports or not, but they will update the new topologywhen they start up, and be visible fromvtgate
. - Restart all
vtctld
processes using the new topology service flags. So thatthe UI also shows the new data.Sample commands to migrate from deprecatedzookeeper
tozk2
topology would be:
# Let's assume the zookeeper client config file is already
# exported in $ZK_CLIENT_CONFIG, and it contains a global record
# pointing to: global_server1,global_server2
# an a local cell cell1 pointing to cell1_server1,cell1_server2
#
# The existing directories created by Vitess are:
# /zk/global/vt/...
# /zk/cell1/vt/...
#
# The new zk2 implementation can use any root, so we will use:
# /vitess/global in the global topology service, and:
# /vitess/cell1 in the local topology service.
# Create the new topology service roots in global and local cell.
zk -server global_server1,global_server2 touch -p /vitess/global
zk -server cell1_server1,cell1_server2 touch -p /vitess/cell1
# Store the flags in a shell variable to simplify the example below.
TOPOLOGY="-topo_implementation zk2 -topo_global_server_address global_server1,global_server2 -topo_global_root /vitess/global"
# Reference cell1 in the global topology service:
vtctl $TOPOLOGY AddCellInfo \
-server_address cell1_server1,cell1_server2 \
-root /vitess/cell1 \
cell1
# Now copy the topology. Note the old zookeeper implementation does not need
# any server or root parameter, as it reads ZK_CLIENT_CONFIG.
topo2topo \
-from_implementation zookeeper \
-to_implementation zk2 \
-to_server global_server1,global_server2 \
-to_root /vitess/global \
# Rebuild SvrKeyspace objects in new service, for each keyspace.
vtctl $TOPOLOGY RebuildKeyspaceGraph keyspace1
vtctl $TOPOLOGY RebuildKeyspaceGraph keyspace2
# Rebuild SrvVSchema objects in new service.
vtctl $TOPOLOGY RebuildVSchemaGraph
# Now restart all vtgate, vttablet, vtctld processes replacing:
# -topo_implementation zookeeper
# With:
# -topo_implementation zk2
# -topo_global_server_address global_server1,global_server2
# -topo_global_root /vitess/global
#
# After this, the ZK_CLIENT_CONF file and environment variables are not needed
# any more.
Migration using the Tee implementation
If your migration is more complex, or has special requirements, we also supporta ‘tee’ implementation of the topo service interface. It is defined ingo/vt/topo/helpers/tee.go
. It allows communicating to two topo services,and the migration uses multiple phases:
- Start with the old topo service implementation we want to replace.
- Bring up the new topo service, with the same cells.
- Use
topo2topo
to copy the current data from the old to the new topo. - Configure a
Tee
topo implementation to maintain both services.- Note we do not expose a plugin for this, so a small code change is necessary.
- all updates will go to both services.
- the
primary
topo service is the one we will get errors from, if any. - the
secondary
topo service is just kept in sync. - at first, use the old topo service as
primary
, and the new one assecondary
. - then, change the configuration to use the new one as
primary
, and theold one assecondary
. Reverse the lock order here. - then rollout a configuration to just use the new service.