Diving Deep, Use External Tools
Machine access allows operators to dive even deeper than logs and nodetool
allow. While every Cassandra operator may have their personal favorite toolsets for troubleshooting issues, this page contains some of the most common operator techniques and examples of those tools. Many of these commands work only on Linux, but if you are deploying on a different operating system you may have access to other substantially similar tools that assess similar OS level metrics and processes.
JVM Tooling
The JVM ships with a number of useful tools. Some of them are useful for debugging Cassandra issues, especially related to heap and execution stacks.
NOTE: There are two common gotchas with JVM tooling and Cassandra:
By default Cassandra ships with
-XX:+PerfDisableSharedMem
set to prevent long pauses (seeCASSANDRA-9242
andCASSANDRA-9483
for details). If you want to use JVM tooling you can instead have/tmp
mounted on an in memorytmpfs
which also effectively works aroundCASSANDRA-9242
.Make sure you run the tools as the same user as Cassandra is running as, e.g. if the database is running as
cassandra
the tool also has to be run ascassandra
, e.g. viasudo -u cassandra <cmd>
.
Garbage Collection State (jstat)
If you suspect heap pressure you can use jstat
to dive deep into the garbage collection state of a Cassandra process. This command is always safe to run and yields detailed heap information including eden heap usage (E), old generation heap usage (O), count of eden collections (YGC), time spend in eden collections (YGCT), old/mixed generation collections (FGC) and time spent in old/mixed generation collections (FGCT):
jstat -gcutil <cassandra pid> 500ms
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 0.00 81.53 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 82.36 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 82.36 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 83.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 83.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 84.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 84.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 85.03 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 85.03 31.16 93.07 88.20 12 0.151 3 0.257 0.408
0.00 0.00 85.94 31.16 93.07 88.20 12 0.151 3 0.257 0.408
In this case we see we have a relatively healthy heap profile, with 31.16% old generation heap usage and 83% eden. If the old generation routinely is above 75% then you probably need more heap (assuming CMS with a 75% occupancy threshold). If you do have such persistently high old gen that often means you either have under-provisioned the old generation heap, or that there is too much live data on heap for Cassandra to collect (e.g. because of memtables). Another thing to watch for is time between young garbage collections (YGC), which indicate how frequently the eden heap is collected. Each young gc pause is about 20-50ms, so if you have a lot of them your clients will notice in their high percentile latencies.
Thread Information (jstack)
To get a point in time snapshot of exactly what Cassandra is doing, run jstack
against the Cassandra PID. Note that this does pause the JVM for a very brief period (<20ms).:
$ jstack <cassandra pid> > threaddump
# display the threaddump
$ cat threaddump
# look at runnable threads
$grep RUNNABLE threaddump -B 1
"Attach Listener" #15 daemon prio=9 os_prio=0 tid=0x00007f829c001000 nid=0x3a74 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
--
"DestroyJavaVM" #13 prio=5 os_prio=0 tid=0x00007f82e800e000 nid=0x2a19 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
--
"JPS thread pool" #10 prio=5 os_prio=0 tid=0x00007f82e84d0800 nid=0x2a2c runnable [0x00007f82d0856000]
java.lang.Thread.State: RUNNABLE
--
"Service Thread" #9 daemon prio=9 os_prio=0 tid=0x00007f82e80d7000 nid=0x2a2a runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
--
"C1 CompilerThread3" #8 daemon prio=9 os_prio=0 tid=0x00007f82e80cc000 nid=0x2a29 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
--
# Note that the nid is the Linux thread id
Some of the most important information in the threaddumps are waiting/blocking threads, including what locks or monitors the thread is blocking/waiting on.
Basic OS Tooling
A great place to start when debugging a Cassandra issue is understanding how Cassandra is interacting with system resources. The following are all resources that Cassandra makes heavy uses of:
CPU cores. For executing concurrent user queries
CPU processing time. For query activity (data decompression, row merging, etc.)
CPU processing time (low priority). For background tasks (compaction, streaming, etc …)
RAM for Java Heap. Used to hold internal data-structures and by default the Cassandra memtables. Heap space is a crucial component of write performance as well as generally.
RAM for OS disk cache. Used to cache frequently accessed SSTable blocks. OS disk cache is a crucial component of read performance.
Disks. Cassandra cares a lot about disk read latency, disk write throughput, and of course disk space.
Network latency. Cassandra makes many internode requests, so network latency between nodes can directly impact performance.
Network throughput. Cassandra (as other databases) frequently have the so called “incast” problem where a small request (e.g.
SELECT * from foo.bar
) returns a massively large result set (e.g. the entire dataset). In such situations outgoing bandwidth is crucial.
Often troubleshooting Cassandra comes down to troubleshooting what resource the machine or cluster is running out of. Then you create more of that resource or change the query pattern to make less use of that resource.
High Level Resource Usage (top/htop)
Cassandra makes signifiant use of system resources, and often the very first useful action is to run top
or htop
(website)to see the state of the machine.
Useful things to look at:
System load levels. While these numbers can be confusing, generally speaking if the load average is greater than the number of CPU cores, Cassandra probably won’t have very good (sub 100 millisecond) latencies. See Linux Load Averages for more information.
CPU utilization.
htop
in particular can help break down CPU utilization intouser
(low and normal priority),system
(kernel), andio-wait
. Cassandra query threads execute as normal priorityuser
threads, while compaction threads execute as low priorityuser
threads. Highsystem
time could indicate problems like thread contention, and highio-wait
may indicate slow disk drives. This can help you understand what Cassandra is spending processing resources doing.Memory usage. Look for which programs have the most resident memory, it is probably Cassandra. The number for Cassandra is likely inaccurately high due to how Linux (as of 2018) accounts for memory mapped file memory.
IO Usage (iostat)
Use iostat to determine how data drives are faring, including latency distributions, throughput, and utilization:
$ sudo iostat -xdm 2
Linux 4.13.0-13-generic (hostname) 07/03/2018 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.28 0.32 5.42 0.01 0.13 48.55 0.01 2.21 0.26 2.32 0.64 0.37
sdb 0.00 0.00 0.00 0.00 0.00 0.00 79.34 0.00 0.20 0.20 0.00 0.16 0.00
sdc 0.34 0.27 0.76 0.36 0.01 0.02 47.56 0.03 26.90 2.98 77.73 9.21 1.03
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 2.00 32.00 0.01 4.04 244.24 0.54 16.00 0.00 17.00 1.06 3.60
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 24.50 0.00 114.00 0.00 11.62 208.70 5.56 48.79 0.00 48.79 1.12 12.80
In this case we can see that /dev/sdc1
is a very slow drive, having an await
close to 50 milliseconds and an avgqu-sz
close to 5 ios. The drive is not particularly saturated (utilization is only 12.8%), but we should still be concerned about how this would affect our p99 latency since 50ms is quite long for typical Cassandra operations. That being said, in this case most of the latency is present in writes (typically writes are more latent than reads), which due to the LSM nature of Cassandra is often hidden from the user.
Important metrics to assess using iostat:
Reads and writes per second. These numbers will change with the workload, but generally speaking the more reads Cassandra has to do from disk the slower Cassandra read latencies are. Large numbers of reads per second can be a dead giveaway that the cluster has insufficient memory for OS page caching.
Write throughput. Cassandra’s LSM model defers user writes and batches them together, which means that throughput to the underlying medium is the most important write metric for Cassandra.
Read latency (
r_await
). When Cassandra missed the OS page cache and reads from SSTables, the read latency directly determines how fast Cassandra can respond with the data.Write latency. Cassandra is less sensitive to write latency except when it syncs the commit log. This typically enters into the very high percentiles of write latency.
Note that to get detailed latency breakdowns you will need a more advanced tool such as bcc-tools.
OS page Cache Usage
As Cassandra makes heavy use of memory mapped files, the health of the operating system’s Page Cache is crucial to performance. Start by finding how much available cache is in the system:
$ free -g
total used free shared buff/cache available
Mem: 15 9 2 0 3 5
Swap: 0 0 0
In this case 9GB of memory is used by user processes (Cassandra heap) and 8GB is available for OS page cache. Of that, 3GB is actually used to cache files. If most memory is used and unavailable to the page cache, Cassandra performance can suffer significantly. This is why Cassandra starts with a reasonably small amount of memory reserved for the heap.
If you suspect that you are missing the OS page cache frequently you can use advanced tools like cachestat or vmtouch to dive deeper.
Network Latency and Reliability
Whenever Cassandra does writes or reads that involve other replicas, LOCAL_QUORUM
reads for example, one of the dominant effects on latency is network latency. When trying to debug issues with multi machine operations, the network can be an important resource to investigate. You can determine internode latency using tools like ping
and traceroute
or most effectively mtr
:
$ mtr -nr www.google.com
Start: Sun Jul 22 13:10:28 2018
HOST: hostname Loss% Snt Last Avg Best Wrst StDev
1.|-- 192.168.1.1 0.0% 10 2.0 1.9 1.1 3.7 0.7
2.|-- 96.123.29.15 0.0% 10 11.4 11.0 9.0 16.4 1.9
3.|-- 68.86.249.21 0.0% 10 10.6 10.7 9.0 13.7 1.1
4.|-- 162.141.78.129 0.0% 10 11.5 10.6 9.6 12.4 0.7
5.|-- 162.151.78.253 0.0% 10 10.9 12.1 10.4 20.2 2.8
6.|-- 68.86.143.93 0.0% 10 12.4 12.6 9.9 23.1 3.8
7.|-- 96.112.146.18 0.0% 10 11.9 12.4 10.6 15.5 1.6
9.|-- 209.85.252.250 0.0% 10 13.7 13.2 12.5 13.9 0.0
10.|-- 108.170.242.238 0.0% 10 12.7 12.4 11.1 13.0 0.5
11.|-- 74.125.253.149 0.0% 10 13.4 13.7 11.8 19.2 2.1
12.|-- 216.239.62.40 0.0% 10 13.4 14.7 11.5 26.9 4.6
13.|-- 108.170.242.81 0.0% 10 14.4 13.2 10.9 16.0 1.7
14.|-- 72.14.239.43 0.0% 10 12.2 16.1 11.0 32.8 7.1
15.|-- 216.58.195.68 0.0% 10 25.1 15.3 11.1 25.1 4.8
In this example of mtr
, we can rapidly assess the path that your packets are taking, as well as what their typical loss and latency are. Packet loss typically leads to between 200ms
and 3s
of additional latency, so that can be a common cause of latency issues.
Network Throughput
As Cassandra is sensitive to outgoing bandwidth limitations, sometimes it is useful to determine if network throughput is limited. One handy tool to do this is iftop which shows both bandwidth usage as well as connection information at a glance. An example showing traffic during a stress run against a local ccm
cluster:
$ # remove the -t for ncurses instead of pure text
$ sudo iftop -nNtP -i lo
interface: lo
IP address is: 127.0.0.1
MAC address is: 00:00:00:00:00:00
Listening on lo
# Host name (port/service if enabled) last 2s last 10s last 40s cumulative
--------------------------------------------------------------------------------------------
1 127.0.0.1:58946 => 869Kb 869Kb 869Kb 217KB
127.0.0.3:9042 <= 0b 0b 0b 0B
2 127.0.0.1:54654 => 736Kb 736Kb 736Kb 184KB
127.0.0.1:9042 <= 0b 0b 0b 0B
3 127.0.0.1:51186 => 669Kb 669Kb 669Kb 167KB
127.0.0.2:9042 <= 0b 0b 0b 0B
4 127.0.0.3:9042 => 3.30Kb 3.30Kb 3.30Kb 845B
127.0.0.1:58946 <= 0b 0b 0b 0B
5 127.0.0.1:9042 => 2.79Kb 2.79Kb 2.79Kb 715B
127.0.0.1:54654 <= 0b 0b 0b 0B
6 127.0.0.2:9042 => 2.54Kb 2.54Kb 2.54Kb 650B
127.0.0.1:51186 <= 0b 0b 0b 0B
7 127.0.0.1:36894 => 1.65Kb 1.65Kb 1.65Kb 423B
127.0.0.5:7000 <= 0b 0b 0b 0B
8 127.0.0.1:38034 => 1.50Kb 1.50Kb 1.50Kb 385B
127.0.0.2:7000 <= 0b 0b 0b 0B
9 127.0.0.1:56324 => 1.50Kb 1.50Kb 1.50Kb 383B
127.0.0.1:7000 <= 0b 0b 0b 0B
10 127.0.0.1:53044 => 1.43Kb 1.43Kb 1.43Kb 366B
127.0.0.4:7000 <= 0b 0b 0b 0B
--------------------------------------------------------------------------------------------
Total send rate: 2.25Mb 2.25Mb 2.25Mb
Total receive rate: 0b 0b 0b
Total send and receive rate: 2.25Mb 2.25Mb 2.25Mb
--------------------------------------------------------------------------------------------
Peak rate (sent/received/total): 2.25Mb 0b 2.25Mb
Cumulative (sent/received/total): 576KB 0B 576KB
============================================================================================
In this case we can see that bandwidth is fairly shared between many peers, but if the total was getting close to the rated capacity of the NIC or was focussed on a single client, that may indicate a clue as to what issue is occurring.
Advanced tools
Sometimes as an operator you may need to really dive deep. This is where advanced OS tooling can come in handy.
bcc-tools
Most modern Linux distributions (kernels newer than 4.1
) support bcc-tools for diving deep into performance problems. First install bcc-tools
, e.g. via apt
on Debian:
$ apt install bcc-tools
Then you can use all the tools that bcc-tools
contains. One of the most useful tools is cachestat
(cachestat examples) which allows you to determine exactly how many OS page cache hits and misses are happening:
$ sudo /usr/share/bcc/tools/cachestat -T 1
TIME TOTAL MISSES HITS DIRTIES BUFFERS_MB CACHED_MB
18:44:08 66 66 0 64 88 4427
18:44:09 40 40 0 75 88 4427
18:44:10 4353 45 4308 203 88 4427
18:44:11 84 77 7 13 88 4428
18:44:12 2511 14 2497 14 88 4428
18:44:13 101 98 3 18 88 4428
18:44:14 16741 0 16741 58 88 4428
18:44:15 1935 36 1899 18 88 4428
18:44:16 89 34 55 18 88 4428
In this case there are not too many page cache MISSES
which indicates a reasonably sized cache. These metrics are the most direct measurement of your Cassandra node’s “hot” dataset. If you don’t have enough cache, MISSES
will be high and performance will be slow. If you have enough cache, MISSES
will be low and performance will be fast (as almost all reads are being served out of memory).
You can also measure disk latency distributions using biolatency
(biolatency examples) to get an idea of how slow Cassandra will be when reads miss the OS page Cache and have to hit disks:
$ sudo /usr/share/bcc/tools/biolatency -D 10
Tracing block device I/O... Hit Ctrl-C to end.
disk = 'sda'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 12 |****************************************|
32 -> 63 : 9 |****************************** |
64 -> 127 : 1 |*** |
128 -> 255 : 3 |********** |
256 -> 511 : 7 |*********************** |
512 -> 1023 : 2 |****** |
disk = 'sdc'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 41 |************ |
128 -> 255 : 17 |***** |
256 -> 511 : 13 |*** |
512 -> 1023 : 2 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 56 |***************** |
8192 -> 16383 : 131 |****************************************|
16384 -> 32767 : 9 |** |
In this case most ios on the data drive (sdc
) are fast, but many take between 8 and 16 milliseconds.
Finally biosnoop
(examples) can be used to dive even deeper and see per IO latencies:
$ sudo /usr/share/bcc/tools/biosnoop | grep java | head
0.000000000 java 17427 sdc R 3972458600 4096 13.58
0.000818000 java 17427 sdc R 3972459408 4096 0.35
0.007098000 java 17416 sdc R 3972401824 4096 5.81
0.007896000 java 17416 sdc R 3972489960 4096 0.34
0.008920000 java 17416 sdc R 3972489896 4096 0.34
0.009487000 java 17427 sdc R 3972401880 4096 0.32
0.010238000 java 17416 sdc R 3972488368 4096 0.37
0.010596000 java 17427 sdc R 3972488376 4096 0.34
0.011236000 java 17410 sdc R 3972488424 4096 0.32
0.011825000 java 17427 sdc R 3972488576 16384 0.65
... time passes
8.032687000 java 18279 sdc R 10899712 122880 3.01
8.033175000 java 18279 sdc R 10899952 8192 0.46
8.073295000 java 18279 sdc R 23384320 122880 3.01
8.073768000 java 18279 sdc R 23384560 8192 0.46
With biosnoop
you see every single IO and how long they take. This data can be used to construct the latency distributions in biolatency
but can also be used to better understand how disk latency affects performance. For example this particular drive takes ~3ms to service a memory mapped read due to the large default value (128kb
) of read_ahead_kb
. To improve point read performance you may may want to decrease read_ahead_kb
on fast data volumes such as SSDs while keeping the a higher value like 128kb
value is probably right for HDs. There are tradeoffs involved, see queue-sysfs docs for more information, but regardless biosnoop
is useful for understanding how Cassandra uses drives.
vmtouch
Sometimes it’s useful to know how much of the Cassandra data files are being cached by the OS. A great tool for answering this question is vmtouch.
First install it:
$ git clone https://github.com/hoytech/vmtouch.git
$ cd vmtouch
$ make
Then run it on the Cassandra data directory:
$ ./vmtouch /var/lib/cassandra/data/
Files: 312
Directories: 92
Resident Pages: 62503/64308 244M/251M 97.2%
Elapsed: 0.005657 seconds
In this case almost the entire dataset is hot in OS page Cache. Generally speaking the percentage doesn’t really matter unless reads are missing the cache (per e.g. cachestat in which case having additional memory may help read performance.
CPU Flamegraphs
Cassandra often uses a lot of CPU, but telling what it is doing can prove difficult. One of the best ways to analyze Cassandra on CPU time is to use CPU Flamegraphs which display in a useful way which areas of Cassandra code are using CPU. This may help narrow down a compaction problem to a “compaction problem dropping tombstones” or just generally help you narrow down what Cassandra is doing while it is having an issue. To get CPU flamegraphs follow the instructions for Java Flamegraphs.
Generally:
Enable the
-XX:+PreserveFramePointer
option in Cassandra’sjvm.options
configuation file. This has a negligible performance impact but allows you actually see what Cassandra is doing.Run
perf
to get some data.Send that data through the relevant scripts in the FlameGraph toolset and convert the data into a pretty flamegraph. View the resulting SVG image in a browser or other image browser.
For example just cloning straight off github we first install the perf-map-agent
to the location of our JVMs (assumed to be /usr/lib/jvm
):
$ sudo bash
$ export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
$ cd /usr/lib/jvm
$ git clone --depth=1 https://github.com/jvm-profiling-tools/perf-map-agent
$ cd perf-map-agent
$ cmake .
$ make
Now to get a flamegraph:
$ git clone --depth=1 https://github.com/brendangregg/FlameGraph
$ sudo bash
$ cd FlameGraph
$ # Record traces of Cassandra and map symbols for all java processes
$ perf record -F 49 -a -g -p <CASSANDRA PID> -- sleep 30; ./jmaps
$ # Translate the data
$ perf script > cassandra_stacks
$ cat cassandra_stacks | ./stackcollapse-perf.pl | grep -v cpu_idle | \
./flamegraph.pl --color=java --hash > cassandra_flames.svg
The resulting SVG is searchable, zoomable, and generally easy to introspect using a browser.
Packet Capture
Sometimes you have to understand what queries a Cassandra node is performing right now to troubleshoot an issue. For these times trusty packet capture tools like tcpdump
and Wireshark can be very helpful to dissect packet captures. Wireshark even has native CQL support although it sometimes has compatibility issues with newer Cassandra protocol releases.
To get a packet capture first capture some packets:
$ sudo tcpdump -U -s0 -i <INTERFACE> -w cassandra.pcap -n "tcp port 9042"
Now open it up with wireshark:
$ wireshark cassandra.pcap
If you don’t see CQL like statements try telling to decode as CQL by right clicking on a packet going to 9042 → Decode as
→ select CQL from the dropdown for port 9042.
If you don’t want to do this manually or use a GUI, you can also use something like cqltrace to ease obtaining and parsing CQL packet captures.