High Availability & Failover - Network Isolation - 《Apache ActiveMQ Artemis v2.32.0 User Manual》

1. Quorum Voting
- 1.1. Backup Voting
- 1.2. Primary Voting
2. Pinging the network

It is possible that if a primary or backup configured for replication becomes isolated in a network that failover will occur and you will end up with 2 active brokers serving messages. This we call split brain. There are different configurations you can choose from that will help mitigate this problem.

1. Quorum Voting

Quorum voting is used by both the primary and the backup to decide what to do if a replication connection is disconnected. Basically the server will request each active server in the cluster to vote as to whether it thinks the server it is replicating to or from is still alive. You can also configure the time for which the quorum manager will wait for the quorum vote response. The default time is 30 seconds you can configure like so for primary and also for the backup:

<ha-policy>
  <replication>
    <primary>
       <quorum-vote-wait>12</quorum-vote-wait>
    </primary>
  </replication>
</ha-policy>

This being the case the minimum number of live/backup pairs needed is 3. If less than 3 pairs are used then the only option is to use a Network Pinger which is explained later in this chapter or choose how you want each server to react which the following details:

1.1. Backup Voting

By default if a backup loses its replication connection to its primary it makes a decision as to whether to start or not with a quorum vote. This of course requires that there be at least 3 pairs of primary/backup nodes in the cluster. For a 3 node cluster it will start if it gets 2 votes back saying that its primary server is no longer available, for 4 nodes this would be 3 votes and so on. When a backup loses connection to the primary it will keep voting for a quorum until it either receives a vote allowing it to start or it detects that the primary is still active. for the latter it will then restart as a backup. How many votes and how long between each vote the backup should wait is configured like so:

<ha-policy>
  <replication>
    <backup>
       <vote-retries>12</vote-retries>
       <vote-retry-wait>5000</vote-retry-wait>
    </backup>
  </replication>
</ha-policy>

It’s also possible to statically set the quorum size that should be used for the case where the cluster size is known up front, this is done on the Replica Policy like so:

<ha-policy>
  <replication>
    <backup>
       <quorum-size>2</quorum-size>
    </backup>
  </replication>
</ha-policy>

In this example the quorum size is set to 2 so if you were using a single pair and the backup lost connectivity it would never start.

1.2. Primary Voting

By default, if the primary server loses its replication connection then it will just carry on and wait for a backup to reconnect and start replicating again. In the event of a possible split brain scenario this may mean that the primary stays active even though the backup has been activated. It is possible to configure the primary server to vote for a quorum if this happens, in this way if the primary server does not receive a majority vote then it will shutdown. This is done by setting the vote-on-replication-failure to true.

<ha-policy>
  <replication>
    <primary>
       <vote-on-replication-failure>true</vote-on-replication-failure>
       <quorum-size>2</quorum-size>
    </primary>
  </replication>
</ha-policy>

As in the backup policy it is also possible to statically configure the quorum size.

2. Pinging the network

You may configure one more addresses on the broker.xml that are part of your network topology, that will be pinged through the life cycle of the server.

The server will stop itself until the network is back on such case.

If you execute the create command passing a -ping argument, you will create a default xml that is ready to be used with network checks:

./artemis create /myDir/myServer --ping 10.0.0.1

This XML part will be added to your broker.xml:

<!--
You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
 <network-check-NIC>theNicName</network-check-NIC>
-->
<!--
Use this to use an HTTP server to validate the network
 <network-check-URL-list>http://www.apache.org</network-check-URL-list> -->
<network-check-period>10000</network-check-period>
<network-check-timeout>1000</network-check-timeout>
<!-- this is a comma separated list, no spaces, just DNS or IPs
   it should accept IPV6
   Warning: Make sure you understand your network topology as this is meant to check if your network is up.
            Using IPs that could eventually disappear or be partially visible may defeat the purpose.
            You can use a list of multiple IPs, any successful ping will make the server OK to continue running -->
<network-check-list>10.0.0.1</network-check-list>
<!-- use this to customize the ping used for ipv4 addresses -->
<network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command>
<!-- use this to customize the ping used for ipv addresses -->
<network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>

Once you lose connectivity towards 10.0.0.1 on the given example, you will see see this output at the server:

09:49:24,562 WARN  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable
09:49:36,577 INFO  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0
09:49:36,625 INFO  [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds
09:50:00,653 WARN  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host
09:50:10,656 WARN  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down
    at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73]
    at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73]
    at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73]
    at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
    at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
    at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
    at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
    at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73]
    at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73]

Once you re establish your network connections towards the configured check list:

09:53:23,461 INFO  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl::
09:53:23,462 INFO  [org.apache.activemq.artemis.core.server] AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging)
09:53:23,462 INFO  [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
09:53:23,462 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
09:53:23,463 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
09:53:23,463 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
09:53:23,463 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
09:53:23,464 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
09:53:23,464 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
09:53:23,541 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ
09:53:23,541 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue
09:53:23,549 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
09:53:23,550 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
09:53:23,554 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP]
09:53:23,555 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT]
09:53:23,556 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP]
09:53:23,556 INFO  [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live
09:53:23,556 INFO  [org.apache.activemq.artemis.core.server] AMQ221001: Apache ActiveMQ Artemis Message Broker version 1.6.0 [0.0.0.0, nodeID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0]

Make sure you understand your network topology as this is meant to validate your network. Using IPs that could eventually disappear or be partially visible may defeat the purpose. You can use a list of multiple IPs. Any successful ping will make the server OK to continue running