Replication

A replica set in MongoDB is a group of mongod processesthat maintain the same data set. Replica sets provide redundancy andhigh availability, and are the basis for all productiondeployments. This section introduces replication in MongoDB as well asthe components and architecture of replica sets. The section alsoprovides tutorials for common tasks related to replica sets.

Redundancy and Data Availability

Replication provides redundancy and increasesdata availability. Withmultiple copies of data on different database servers, replicationprovides a level of fault tolerance against the loss of a singledatabase server.

In some cases, replication can provide increased read capacity asclients can send read operations to different servers. Maintainingcopies of data in different data centers can increase data localityand availability for distributed applications. You can also maintainadditional copies for dedicated purposes, such as disaster recovery,reporting, or backup.

Replication in MongoDB

A replica set is a group of mongod instances that maintainthe same data set. A replica set contains several data bearing nodesand optionally one arbiter node. Of the data bearing nodes, one andonly one member is deemed the primary node, while the other nodes aredeemed secondary nodes.

The primary node receives all writeoperations. A replica set can have only one primary capable ofconfirming writes with { w: "majority" }write concern; although in some circumstances, another mongod instancemay transiently believe itself to also be primary.[1] The primary records all changes to its datasets in its operation log, i.e. oplog. For more information on primary nodeoperation, see Replica Set Primary.

Diagram of default routing of reads and writes to the primary.

The secondaries replicate theprimary’s oplog and apply the operations to their data sets such thatthe secondaries’ data sets reflect the primary’s data set. If theprimary is unavailable, an eligible secondary will hold an election toelect itself the new primary. For more information on secondarymembers, see Replica Set Secondary Members.

Diagram of a 3 member replica set that consists of a primary and two secondaries.

You may add an extra mongod instance to a replica set as anarbiter. Arbiters do not maintain adata set. The purpose of an arbiter is to maintain a quorum in areplica set by responding to heartbeat and election requests by otherreplica set members. Because they do not store a data set, arbiters canbe a good way to provide replica set quorum functionality with acheaper resource cost than a fully functional replica set member with adata set. If your replica set has an even number of members, add anarbiter to obtain a majority of votes in an election forprimary. Arbiters do not require dedicated hardware. For moreinformation on arbiters, see Replica Set Arbiter.

Diagram of a replica set that consists of a primary, a secondary, and an arbiter.

An arbiter will always be an arbiterwhereas a primary may step down andbecome a secondary and asecondary may become the primaryduring an election.

Asynchronous Replication

Secondaries replicate the primary’s oplog and apply the operations totheir data sets asynchronously. By having the secondaries’ data setsreflect the primary’s data set, the replica set can continue tofunction despite the failure of one or more members.

For more information on replication mechanics, seeReplica Set Oplog and Replica Set Data Synchronization.

Slow Operations

Starting in version 4.2 (also available starting in 4.0.6), secondary members of a replica set nowlog oplog entries that take longer than the slowoperation threshold to apply. These slow oplog messages are loggedfor the secondaries in the diagnostic log under the REPL component with the text appliedop: <oplog entry> took <num>ms. These slow oplog entries dependonly on the slow operation threshold. They do not depend on the loglevels (either at the system or component level), or the profilinglevel, or the slow operation sample rate. The profiler does notcapture slow oplog entries.

Replication Lag and Flow Control

Replication lag refers to the amount of timethat it takes to copy (i.e. replicate) a write operation on theprimary to a secondary. Some small delay period may beacceptable, but significant problems emerge as replication lag grows,including building cache pressure on the primary.

Starting in MongoDB 4.2, administrators can limit the rate at whichthe primary applies its writes with the goal of keeping the majoritycommitted lag undera configurable maximum value flowControlTargetLagSeconds.

By default, flow control is enabled.

Note

For flow control to engage, the replica set/sharded cluster musthave: featureCompatibilityVersion (FCV) of4.2 and read concern majority enabled. That is, enabled flowcontrol has no effect if FCV is not 4.2 or if read concernmajority is disabled.

With flow control enabled, as the lag grows close to theflowControlTargetLagSeconds, writes on the primary must obtaintickets before taking locks to apply writes. By limiting the number oftickets issued per second, the flow control mechanism attempts to keep thethe lag under the target.

For more information, see Check the Replication Lag andFlow Control.

Automatic Failover

When a primary does not communicate with the other members of the setfor more than the configured electionTimeoutMillis period(10 seconds by default), an eligible secondary calls for an electionto nominate itself as the new primary. The cluster attempts tocomplete the election of a new primary and resume normal operations.

Diagram of an election of a new primary. In a three member replica set with two secondaries, the primary becomes unreachable. The loss of a primary triggers an election where one of the secondaries becomes the new primary

The replica set cannot process write operationsuntil the election completes successfully. The replica set can continueto serve read queries if such queries are configured torun on secondaries while theprimary is offline.

The median time before a cluster elects a new primary should nottypically exceed 12 seconds, assuming default replicaconfiguration settings. This includes time required tomark the primary as unavailable andcall and complete an election.You can tune this time period by modifying thesettings.electionTimeoutMillis replication configurationoption. Factors such as network latency may extend the time requiredfor replica set elections to complete, which in turn affects the amountof time your cluster may operate without a primary. These factors aredependent on your particular cluster architecture.

Lowering the electionTimeoutMillisreplication configuration option from the default 10000 (10 seconds)can result in faster detection of primary failure. However,the cluster may call elections more frequently due to factors such astemporary network latency even if the primary is otherwise healthy.This can result in increased rollbacks forw : 1 write operations.

Your application connection logic should include tolerance for automaticfailovers and the subsequent elections. Starting in MongoDB 3.6, MongoDB driverscan detect the loss of the primary and automaticallyretry certain write operations a single time,providing additional built-in handling of automatic failovers and elections:

  • MongoDB 4.2-compatible drivers enable retryable writes by default
  • MongoDB 4.0 and 3.6-compatible drivers must explicitly enableretryable writes by including retryWrites=true in the connection string.

See Replica Set Elections for complete documentation onreplica set elections.

To learn more about MongoDB’s failover process, see:

Read Operations

Read Preference

By default, clients read from the primary [1];however, clients can specify a read preference to send read operations to secondaries.

Diagram of an application that uses read preference secondary.

Asynchronous replication tosecondaries means that reads from secondaries may return data that doesnot reflect the state of the data on the primary.

Multi-document transactions that containread operations must use read preference primary. Alloperations in a given transaction must route to the same member.

For information on reading from replica sets, seeRead Preference.

Data Visibility

Depending on the read concern, clients can see the results of writesbefore the writes are durable:

  • Regardless of a write’s write concern, otherclients using "local" or "available"read concern can see the result of a write operation before the writeoperation is acknowledged to the issuing client.
  • Clients using "local" or "available"read concern can read data which may be subsequently rolledback during replica set failovers.

For operations in a multi-document transaction, when a transaction commits, all data changesmade in the transaction are saved and visible outside the transaction.That is, a transaction will not commit some of its changes whilerolling back others.

Until a transaction commits, the data changes made in thetransaction are not visible outside the transaction.

However, when a transaction writes to multiple shards, not alloutside read operations need to wait for the result of the committedtransaction to be visible across the shards. For example, if atransaction is committed and write 1 is visible on shard A but write2 is not yet visible on shard B, an outside read at read concern"local" can read the results of write 1 withoutseeing write 2.

For more information on read isolations, consistency and recency forMongoDB, see Read Isolation, Consistency, and Recency.

Transactions

Starting in MongoDB 4.0, multi-document transactions are available for replica sets.

Multi-document transactions that containread operations must use read preference primary. Alloperations in a given transaction must route to the same member.

Until a transaction commits, the data changes made in thetransaction are not visible outside the transaction.

However, when a transaction writes to multiple shards, not alloutside read operations need to wait for the result of the committedtransaction to be visible across the shards. For example, if atransaction is committed and write 1 is visible on shard A but write2 is not yet visible on shard B, an outside read at read concern"local" can read the results of write 1 withoutseeing write 2.

Change Streams

Starting in MongoDB 3.6, change streams areavailable for replica sets and sharded clusters. Change streams allowapplications to access real-time data changes without the complexityand risk of tailing the oplog. Applications can use change streams tosubscribe to all data changes on a collection or collections.

Additional Features

Replica sets provide a number of options to support applicationneeds. For example, you may deploy a replica set with members inmultiple data centers, orcontrol the outcome of elections by adjusting themembers[n].priority of somemembers. Replica sets also support dedicated members for reporting,disaster recovery, or backup functions.

See Priority 0 Replica Set Members,Hidden Replica Set Members andDelayed Replica Set Members for more information.

[1](1, 2) In some circumstances, two nodes in a replica setmay transiently believe that they are the primary, but at most, oneof them will be able to complete writes with { w:"majority" } write concern. The node that can complete{ w: "majority" } writes is the currentprimary, and the other node is a former primary that has not yetrecognized its demotion, typically due to a network partition.When this occurs, clients that connect to the former primary mayobserve stale data despite having requested read preferenceprimary, and new writes to the former primary willeventually roll back.