Monitoring replication slave
Note: this recipe is working with ArangoDB 2.5, you need a collectd curl_json plugin with correct boolean type mapping.
Problem
How to monitor the slave status using the collectd curl_JSON
plugin.
Solution
Since arangodb reports the replication status in JSON, integrating it with the collectd curl_JSON plugin should be an easy exercise. However, only very recent versions of collectd will handle boolean flags correctly.
Our test master/slave setup runs with the master listening on tcp://127.0.0.1:8529
and the slave (which we query) listening on tcp://127.0.0.1:8530
. They replicate a database by the name testDatabase
.
Since replication appliers are active per database and our example doesn’t use the default _system
, we need to specify its name in the URL like this: _db/testDatabase
.
We need to parse a document from a request like this:
curl --dump - http://localhost:8530/_db/testDatabase/_api/replication/applier-state
If the replication is not running the document will look like that:
{
"state": {
"running": false,
"lastAppliedContinuousTick": null,
"lastProcessedContinuousTick": null,
"lastAvailableContinuousTick": null,
"safeResumeTick": null,
"progress": {
"time": "2015-11-02T13:24:07Z",
"message": "applier shut down",
"failedConnects": 0
},
"totalRequests": 1,
"totalFailedConnects": 0,
"totalEvents": 0,
"totalOperationsExcluded": 0,
"lastError": {
"time": "2015-11-02T13:24:07Z",
"errorMessage": "no start tick",
"errorNum": 1413
},
"time": "2015-11-02T13:31:53Z"
},
"server": {
"version": "2.7.0",
"serverId": "175584498800385"
},
"endpoint": "tcp://127.0.0.1:8529",
"database": "testDatabase"
}
A running replication will return something like this:
{
"state": {
"running": true,
"lastAppliedContinuousTick": "1150610894145",
"lastProcessedContinuousTick": "1150610894145",
"lastAvailableContinuousTick": "1151639153985",
"safeResumeTick": "1150610894145",
"progress": {
"time": "2015-11-02T13:49:56Z",
"message": "fetching master log from tick 1150610894145",
"failedConnects": 0
},
"totalRequests": 12,
"totalFailedConnects": 0,
"totalEvents": 2,
"totalOperationsExcluded": 0,
"lastError": {
"errorNum": 0
},
"time": "2015-11-02T13:49:57Z"
},
"server": {
"version": "2.7.0",
"serverId": "175584498800385"
},
"endpoint": "tcp://127.0.0.1:8529",
"database": "testDatabase"
}
We create a simple collectd configuration in /etc/collectd/collectd.conf.d/slave_testDatabase.conf
that matches our API:
TypesDB "/etc/collectd/collectd.conf.d/slavestate_types.db"
<Plugin curl_json>
# Adjust the URL so collectd can reach your arangod slave instance:
<URL "http://localhost:8530/_db/testDatabase/_api/replication/applier-state">
# Set your authentication to that database here:
# User "foo"
# Password "bar"
<Key "state/running">
Type "boolean"
</Key>
<Key "state/totalOperationsExcluded">
Type "counter"
</Key>
<Key "state/totalRequests">
Type "counter"
</Key>
<Key "state/totalFailedConnects">
Type "counter"
</Key>
</URL>
</Plugin>
To get nice metric names, we specify our own types.db
file in /etc/collectd/collectd.conf.d/slavestate_types.db
:
boolean value:ABSOLUTE:0:1
So, basically state/running
will give you 0
/1
if its (not / ) running through the collectd monitor.