A split brain is a condition that occurs when two different brokers are serving the same messages at the same time. When this happens instead of client applications all sharing the same broker as they ought, they may become divided between the two split brain brokers. This is problematic because it can lead to:

  • Duplicate messages e.g. when multiple consumers on the same JMS queue split between both brokers and receive the same message(s)

  • Missed messages e.g. when multiple consumers on the same JMS topic split between both brokers and producers are only sending messages to one broker

Split brain most commonly happens when a pair of brokers in an HA replication configuration lose the replication connection linking them together. When this connection is lost the backup assumes that the primary has died and therefore activates. At this point there are two brokers on the network which are isolated from each other and since the backup has a copy of all the messages from the primary they are each serving the same messages.

What about shared store configurations?

While it is technically possible for split brain to happen with a pair of brokers in an HA shared store configuration it would require a failure in the file-locking mechanism of the storage device which the brokers are sharing.

One of the benefits of using a shared store is that the storage device itself acts as an arbiter to ensure consistency and mitigate split brain.

Recovering from a split brain may be as simple as stopping the broker which activated by mistake. However, this solution is only viable if no client application connected to it and performed messaging operations. The longer client applications are allowed to interact with split brain brokers the more difficult it will be to understand and remediate the resulting problems.

There are several different configurations you can choose from that will help mitigate split brain.

1. Pluggable Lock Manager

A pluggable lock manager configuration requires a 3rd party to establish a shared lock between primary and backup brokers. The shared lock ensures that either the primary or backup is active at any given point in time, similar to how the file lock functions in the shared storage use-case.

The plugin decides what 3rd party implementation is used. It could be something as simple as a shared file on a network file system that supports locking (e.g. NFS) or it could be something more complex like etcd.

The broker ships with a reference plugin implementation based on Apache ZooKeeper - a common implementation used for this kind of task.

The main benefit of a pluggable lock manager is that is releases the broker from the responsibility of establishing a reliable vote. This means that a single HA pair of brokers can be reliably protected against split-brain.

2. Quorum Voting

Quorum voting is a process by which one node in a cluster can determine whether another node in the cluster is active without directly communicating with that node. Then the broker initiating the vote can take action based on the result (e.g. shutting itself down to avoid split-brain).

Quorum voting requires the participation of the other active brokers in the cluster. Of course this requires that there are, in fact, other active brokers in the cluster which means quorum voting won’t work with a single HA pair of brokers. Furthermore, it also won’t work with just two HA pairs of brokers either because that’s still not enough for a legitimate quorum. There must be at least three HA pairs to establish a proper quorum with quorum voting.

2.1. Voting Mechanics

When the replication connection between a primary and backup is lost the backup and/or the primary may initiate a vote.

For a vote to pass a majority of affirmative responses is required. For example, in a 3 node cluster a vote will pass with 2 affirmatives. For a 4 node cluster this would be 3 affirmatives and so on.

2.1.1. Backup Voting

By default, if a backup loses its replication connection to its primary it will activate automatically. However, it can be configured via the vote-on-replication-failure property to initiate a quorum vote in order to decide whether to activate or not. If this is done then the backup will keep voting until it either receives a vote allowing it to start or it detects that the primary is still active. In the latter case it will then restart as a backup.

See the section on Replication Configuration for more details on configuration.

2.1.2. Primary Voting

By default, if the primary server loses its replication connection to the backup then it will just carry on and wait for a backup to reconnect and start replicating again. However, this may mean that the primary remains active even though the backup has activated so this behavior is configurable via the vote-on-replication-failure property.

See the section on Replication Configuration for more details on configuration.

3. Pinging the network

You may configure one more addresses in broker.xml that that will be pinged throughout the life of the server. The server will stop itself if it can’t ping one or more of the addresses in the list.

If you execute the create command using the --ping argument you will create a default XML that is ready to be used with network checks:

  1. $ ./artemis create /myDir/myServer --ping 10.0.0.1

This XML will be added to your broker.xml:

  1. <!--
  2. You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
  3. <network-check-NIC>theNicName</network-check-NIC>
  4. -->
  5. <!--
  6. Use this to use an HTTP server to validate the network
  7. <network-check-URL-list>http://www.apache.org</network-check-URL-list> -->
  8. <network-check-period>10000</network-check-period>
  9. <network-check-timeout>1000</network-check-timeout>
  10. <!-- this is a comma separated list, no spaces, just DNS or IPs
  11. it should accept IPV6
  12. Warning: Make sure you understand your network topology as this is meant to check if your network is up.
  13. Using IPs that could eventually disappear or be partially visible may defeat the purpose.
  14. You can use a list of multiple IPs, any successful ping will make the server OK to continue running -->
  15. <network-check-list>10.0.0.1</network-check-list>
  16. <!-- use this to customize the ping used for ipv4 addresses -->
  17. <network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command>
  18. <!-- use this to customize the ping used for ipv6 addresses -->
  19. <network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>

Once you lose connectivity towards 10.0.0.1 on the given example the broker will log something like this:

  1. 09:49:24,562 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable
  2. 09:49:36,577 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0
  3. 09:49:36,625 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds
  4. 09:50:00,653 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host
  5. 09:50:10,656 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down
  6. at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73]
  7. at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73]
  8. at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73]
  9. at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  10. at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  11. at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  12. at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  13. at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  14. at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73]
  15. at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73]
  16. at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73]
  17. at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73]
  18. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73]
  19. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73]
  20. at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73]

Once you reestablish your network connections towards the configured check-list:

  1. 09:53:23,461 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl::
  2. 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221000: primary Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging)
  3. 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
  4. 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
  5. 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
  6. 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
  7. 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
  8. 09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
  9. 09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
  10. 09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ
  11. 09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue
  12. 09:53:23,549 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
  13. 09:53:23,550 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
  14. 09:53:23,554 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP]
  15. 09:53:23,555 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT]
  16. 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP]
  17. 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now active
  18. 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221001: Apache ActiveMQ Artemis Message Broker version 1.6.0 [0.0.0.0, nodeID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0]

Make sure you understand your network topology as this is meant to validate your network. Using IPs that could eventually disappear or be partially visible may defeat the purpose. You can use a list of multiple IPs. Any successful ping will make the server OK to continue running