Master leases
Some applications have strict requirements about the consistency of data read on a master site. Berkeley DB provides a mechanism called master leases to provide such consistency. Without master leases, it is sometimes possible for Berkeley DB to return old data to an application when newer data is available due to unfortunate scheduling as illustrated below:
- Application on master site: Read data item foo via Berkeley DB DB->get() or DBC->get() call.
- Application on master site: sleep, get descheduled, etc.
- System: Master changes role, becomes a client.
- System: New site is elected master.
- System: New master modifies data item foo.
- Application: Berkeley DB returns old data for foo to application.
By using master leases, Berkeley DB can provide guarantees about the consistency of data read on a master site. The master site can be considered a recognized authority for the data and consequently can provide authoritative reads. Clients grant master leases to a master site. By doing so, clients acknowledge the right of that site to retain the role of master for a period of time. During that period of time, clients cannot elect a new master, become master, or grant their lease to another site.
By holding a collection of granted leases, a master site can guarantee to the application that the data returned is the current, authoritative value. As a master performs operations, it continually requests updated grants from the clients. When a read operation is required, the master guarantees that it holds a valid collection of lease grants from clients before returning data to the application. By holding leases, Berkeley DB provides several guarantees to the application:
- Authoritative reads: A guarantee that the data being read by the application is the current value.
Durability from rollback: A guarantee that the data being written or read by the application is permanent across a majority of sites and will never be rolled back.
The rollback guarantee also depends on the DB_TXN_NOSYNC flag. The guarantee is effective as long as there isn’t a failure of half of the replication group while clients have granted leases but are holding the updates in their cache. The application must weigh the performance impact of synchronous transactions against the risk of the failure of at least half of the replication group. If clients grant a lease while holding updated data in cache, and failure occurs, then the data is no longer present on the clients and rollback can occur if a sufficient number of other sites also crash.
The guarantee that data will not be rolled back applies only to data successfully committed on a master. Data read on a client, or read while ignoring leases can be rolled back.
Freshness: A guarantee that the data being read by the application on the master is up-to-date and has not been modified or removed during the read.
The read authority is only on the master. Read operations on a client always ignore leases and consequently, these operations can return stale data.
Master viability: A guarantee that a current master with valid leases cannot encounter a duplicate master situation.
Leases remove the possibility of a duplicate master situation that forces the current master to downgrade to a client. However, it is still possible that old masters with expired leases can discover a later master and return DB_REP_DUPMASTER to the application.
There are several requirements of the application using leases:
- Replication Manager applications must configure a majority (or larger) acknowledgement policy via the DB_ENV->repmgr_set_ack_policy() method. Base API applications must implement and enforce such a policy on their own.
- Base API applications must return an error from the send callback function when the majority acknowledgement policy is not met for permanent records marked with DB_REP_PERMANENT. Note that the Replication Manager automatically fulfills this requirement.
- Base API applications must set the number of sites in the group using the DB_ENV->rep_set_nsites() method before starting replication and cannot change it during operation.
- Using leases in a replication group is all or none. Behavior is undefined when some sites configure leases and others do not. Use the DB_ENV->rep_set_config() method to turn on leases.
- The configured lease timeout value must be the same on all sites in a replication group, set via the DB_ENV->rep_set_timeout() method.
- The configured clock skew ratio must be the same on all sites in a replication group. This value defaults to no skew, but can be set via the DB_ENV->rep_set_clockskew() method.
- Applications that care about read guarantees must perform all read operations on the master. Reading on a client does not guarantee freshness.
- The application must use elections to choose a master site. It must never simply declare a master without having won an election (as is allowed without Master Leases).
- Unelectable (zero priority) sites never grant leases and cannot be used to guarantee data durability. A majority of sites in the replication group must be electable in order to meet the requirement of getting lease grants from a majority of sites. Minimizing the number of unelectable sites improves replication group availability.
Master leases are based on timeouts. Berkeley DB assumes that time always runs forward. Users who change the system clock on either client or master sites when leases are in use void all guarantees and can get undefined behavior. See the DB_ENV->rep_set_timeout() method for more information.
Applications using master leases should be prepared to handle DB_REP_LEASE_EXPIRED
errors from read operations on a master and from the DB_TXN->commit() method.
Read operations on a master that should not be subject to leases can use the DB_IGNORE_LEASE flag to the DB->get() method. Read operations on a client always imply leases are ignored.
Master lease checks cannot succeed until a majority of sites have completed client synchronization. Read operations on a master performed before this condition is met can use the DB_IGNORE_LEASE flag to avoid errors.
Clients are forbidden from participating in elections while they have an outstanding lease granted to a master. Therefore, if the DB_ENV->rep_elect() method is called, then Berkeley DB will block, waiting until its lease grant expires before participating in any election. While it waits, the client attempts to contact the current master. If the client finds a current master, then it returns from the DB_ENV->rep_elect() method. When leases are configured and the lease has never yet been granted (on start-up), clients must wait a full lease timeout before participating in an election.
Changing group size
If you are using master leases and you change the size of your replication group, there is a remote possibility that you can lose some data previously thought to be durable. This is only true for users of the Base API.
The problem can arise if you are removing sites from your replication group. (You might be increasing the size of your group overall, but if you remove all of the wrong sites you can lose data.)
Suppose you have a replication group with five sites; A, B, C, D and E; and you are using a quorum acknowledgement policy. Then:
Master A replicates a transaction to replicas B and C. Those sites acknowledge the write activity.
Sites D and E do not receive the transaction. However, B and C have acknowledged the transaction, which means the acknowledgement policy is met and so the transaction is considered durable.
You shut down sites B and C. Now only A has the transaction.
You decrease the size of your replication group to 3 using DB_ENV->rep_set_nsites().
You shut down or otherwise lose site A.
Sites D and E hold an election. Because the size of the replication group is 3, they have enough sites to successfully hold an election. However, neither site has the transaction in question. In this way, the transaction can become lost.
An alternative scenario exists where you do not change the size of your replication group, or you actually increase the size of your replication group, but in the process you happen to remove the exact wrong sites:
Master A replicates a transaction to replicas B and C. Those sites acknowledge the write activity.
Sites D and E do not receive the transaction. However, B and C have acknowledged the transaction, which means the acknowledgement policy is met and so the transaction is considered durable.
You shut down sites B and C. Now only A has the transaction.
You add three new sites to your replication group: F, G and H, increasing the size of your replication group to 6 using DB_ENV->rep_set_nsites().
You shut down or otherwise lose site A before F, G and H can be fully populated with data.
Sites D, E, F, G and H hold an election. Because the size of the replication group is 6, they have enough sites to successfully hold an election. However, none of these sites has the transaction in question. In this way, the transaction can become lost.
This scenario represents a race condition that would be highly unlikely to be seen outside of a lab environment. To minimize the chance of this race condition occurring to the absolute minimum, do one or more of the following when using master leases with the Base API:
Require all sites to acknowledge transaction commits.
Never change the size of your replication group unless all sites in the group are running and communicating normally with one another.
Don’t remove (or replace) a large percentage of your sites from your replication group unless all sites in the group are running and communicating normally with one another. If you are going to remove a large percentage of your sites from your replication group, try removing just one site at a time, pausing in between each removal to give the replication group a chance to fully distribute all writes before removing the next site.