Benchmarking Dragonfly
Dragonfly is a high-performance, distributed key-value store designed for scalability and low latency. It’s a drop-in replacement for Redis 6.x and memcached servers. This document outlines a benchmarking methodology and results achieved using Dragonfly with the memtier_benchmark
load testing tool.
We benchmarked Dragonfly using the memtier_benchmark load testing tool. A prebuilt container is also available on Docker Hub Although Redis offers the redis-benchmark
tool in its repository, it has not been as efficient as memtier_benchmark
and it often becomes the bottleneck instead of Dragonfly.
We also developed our own tool dfly_bench, which can be built from source in the Dragonfly repository.
Methodology
- Remote Deployment: Dragonfly is a multi-threaded server designed to run remotely. Therefore, we recommend running the load testing client and server on separate machines for a more accurate representation of real-world performance.
- Minimizing Latency: Locate client and server within the same Availability Zone and use private IPs for optimal network performance. If you benchmark in the AWS cloud, consider an AWS Cluster placement group for the lowest possible latency. The rationale behind this - to remove any environmental factors that might skew the test results.
- Server vs. Client Resources: Use a more powerful instance for the load testing client to avoid client-side bottlenecks.
The remainder of this document will discuss how to set up a benchmark in the AWS cloud to observe millions of QPS from a single instance.
Load testing configuration
We used Dragonfly v1.15.0 (latest at the time of writing) with the following arguments: ./dragonfly --logtostderr --dbfilename=
Please notice that Dragonfly uses of all the available vCPUs by default on the server. If you want to control explicitly number of threads in Dragonfly you can add --proactor_threads=<N>
. Both the client and server instances run Ubuntu 23.04
OS with kernel version 6.2. In line with our recommendations above, we used internal IPs for connecting and used stronger c7gn.16xlarge
instance with 64 vCPUs for the load-testing program (i.e. the client).
Dragongly on c6gn.12xlarge
Write-only test
On the loadtest instance (c7gn.16xlarge
with 64 vCPUs): memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000
The run ended with the following summary:
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 4195628.23 --- --- 0.39283 0.37500 0.68700 2.54300 323231.06
In this test, we reached almost 4.2M queries per second (QPS) with the average latency of 0.4ms between the memtier_benchmark
and Dragonfly. Consequently, the P50 latency was 0.4ms, P99 - 0.7ms and P99.9 was 2.54ms. It is a very short and simple test, but it still gives some perspective about the performance of Dragonfly.
Reads-only test
Without flushing the database: memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000
Note that the ratio changed to “0:1”, meaning only GET
commands and no SET
commands.
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 4109802.84 4109802.84 0.00 0.40126 0.38300 0.67900 0.90300 296551.68
We can observe that both Ops
and Hits
are the same, meaning all of the GET requests coming from the load test hit the existing keys. Dragonfly responded with returning values for each request and its average QPS was 4.1M qps, with P99.9 latency - 903us (less than 1 millisecond).
Read test with pipelining
Here’s another way to loadtest Dragonfly. Below is one with sending SET
s with pipeline (--pipeline
) of batch size 10. Pipeline means that the client sends multiple commands (10 in this case) and only then waits for the responses. memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5 -n 200000 --distinct-client-seed --hide-histogram --pipeline=10
ALL STATS
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 7083583.57 7083583.57 0.00 0.45821 0.44700 0.69500 1.53500 511131.14
During the pipelining mode, memtier_benchmark sends K
(in this case 10) requests in batch without before waiting for them to complete. Pipelining reduces the CPU load spent in the networking stack, and as a result, Dragonfly can reach 7M qps with sub-millisecond latency. Please note, that for real world usecases, pipelining requires cooperation of a client side app, which must send multiple requests on a single connection before waiting for the server to respond.
Some asynchronous client libraries like StackExchange.Redis
or ioredis
allow multiplexing requests on a single connection. They can still provide a simplified synchronous interface to their users while benefitting from performance improvements of pipelining.
Load testing Dragonfly c7gn.12xlarge
Next, we tried running Dragonfly on the next generation instance (c7gn
) with the same number of vCPUs (48). We used the same c7gn.16xlarge
for running memtier_benchmark
and we used the same commands to test writes, reads and pipelined reads:
Test | Ops/sec | Avg. Latency (us) | P99.9 Latency (us) |
---|---|---|---|
Write-Only | 5.2M | 250 | 631 |
Read-Only | 6M | 271 | 623 |
Pipelined Read | 8.9M | 323 | 839 |
Writes
memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 5195097.56 --- --- 0.26012 0.24700 0.49500 0.63100 400230.15
Reads
memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 6078632.89 6078632.89 0.00 0.27177 0.26300 0.49500 0.62300 438616.86
Pipelined Reads
memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5 -n 200000 --distinct-client-seed --hide-histogram --pipeline=10
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Gets 8975121.86 8975121.86 0.00 0.32325 0.31100 0.52700 0.83900 647619.14
Comparison with Garnet
Microsoft Research recently released Garnet, a remote cache store. Due to interest within the Dragonfly community, we decided to compare Garnet’s performance with Dragonfly’s. This comparison focuses on performance results and does not delve into architectural differences or Redis compatibility implications.
Note: Unfortunately, Garnet does not have aarch64 build available, therefore we run both Garnet and Dragonfly on x86_64 server c6in.12xlarge
. We run Garnet via docker with host networking enabled via docker run --network=host ghcr.io/romange/garnet:latest --port=6379
command. The docker container was built using the Garnet docker build file for ubuntu, located in their repository.
Garnet on c6in.12xlarge
Similarly to previous tests we run memtier_benchmark
on c7gn.16xlarge
with cluster
placement policy for both instances. For writes we used the following command:
memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 1:0 -t 60 -c 20 -n 200000
Similarly, for reads we used
memtier_benchmark -s $SERVER_PRIVATE_IP --distinct-client-seed --hide-histogram --ratio 0:1 -t 60 -c 20 -n 200000
and for pipelined reads we used
memtier_benchmark -s $SERVER_PRIVATE_IP --ratio 0:1 -t 60 -c 5 -n 2000000 --distinct-client-seed --hide-histogram --pipeline=10
Note that we increased number of requests to 2000000
per client connection in the latter case.
Results:
Test | Ops/sec | Avg. Latency (us) | P99.9 Latency (us) |
---|---|---|---|
Write-Only | 3.5M | 346 | 4287 |
Read-Only | 3.7M | 327 | 2623 |
Pipelined Read | 25.4M !!! | 119 | 375 |
The interesting part is around pipelined reads, where Garnet scaled linearly to more than 25M qps which is a really impressive performance.
On the other hand, a curious and random finding - a single “dbsize” command took 3 seconds to run on Garnet.
Dragonfly on c6in.12xlarge
We run Dragonfly on the same instances with the same test configurations. Below are the results for Dragonfly.
Test | Ops/sec | Avg. Latency (us) | P99.9 Latency (us) |
---|---|---|---|
Write-Only | 3.6M | 291 | 6815 |
Read-Only | 5.1M | 299 | 7615 |
Pipelined Read | 6.9M | 358 | 1127 |
As you can see Dragonfly shows a comparable throughput for non-pipelined access, but its P99.9 was worse. For pipelined commands, Dragonfly had x3.7 less throughput than Garnet.