3. Introduction to HAProxy - 3.5. Sizing - 《HAProxy 2.0 Starter Guide》

3.5. Sizing
3.5. Sizing

Typical CPU usage figures show 15% of the processing time spent in HAProxy
versus 85% in the kernel in TCP or HTTP close mode, and about 30% for HAProxy
versus 70% for the kernel in HTTP keep-alive mode. This means that the operating
system and its tuning have a strong impact on the global performance.
 
Usages vary a lot between users, some focus on bandwidth, other ones on request
rate, others on connection concurrency, others on SSL performance. This section
aims at providing a few elements to help with this task.
 
It is important to keep in mind that every operation comes with a cost, so each
individual operation adds its overhead on top of the other ones, which may be
negligible in certain circumstances, and which may dominate in other cases.
 
When processing the requests from a connection, we can say that :
 
  - forwarding data costs less than parsing request or response headers;
 
  - parsing request or response headers cost less than establishing then closing
    a connection to a server;
 
  - establishing an closing a connection costs less than a TLS resume operation;
 
  - a TLS resume operation costs less than a full TLS handshake with a key
    computation;
 
  - an idle connection costs less CPU than a connection whose buffers hold data;
 
  - a TLS context costs even more memory than a connection with data;
 
So in practice, it is cheaper to process payload bytes than header bytes, thus
it is easier to achieve high network bandwidth with large objects (few requests
per volume unit) than with small objects (many requests per volume unit). This
explains why maximum bandwidth is always measured with large objects, while
request rate or connection rates are measured with small objects.
 
Some operations scale well on multiple processes spread over multiple CPUs,
and others don't scale as well. Network bandwidth doesn't scale very far because
the CPU is rarely the bottleneck for large objects, it's mostly the network
bandwidth and data buses to reach the network interfaces. The connection rate
doesn't scale well over multiple processors due to a few locks in the system
when dealing with the local ports table. The request rate over persistent
connections scales very well as it doesn't involve much memory nor network
bandwidth and doesn't require to access locked structures. TLS key computation
scales very well as it's totally CPU-bound. TLS resume scales moderately well,
but reaches its limits around 4 processes where the overhead of accessing the
shared table offsets the small gains expected from more power.
 
The performance numbers one can expect from a very well tuned system are in the
following range. It is important to take them as orders of magnitude and to
expect significant variations in any direction based on the processor, IRQ
setting, memory type, network interface type, operating system tuning and so on.
 
The following numbers were found on a Core i7 running at 3.7 GHz equipped with
a dual-port 10 Gbps NICs running Linux kernel 3.10, HAProxy 1.6 and OpenSSL
1.0.2. HAProxy was running as a single process on a single dedicated CPU core,
and two extra cores were dedicated to network interrupts :
 
  - 20 Gbps of maximum network bandwidth in clear text for objects 256 kB or
    higher, 10 Gbps for 41kB or higher;
 
  - 4.6 Gbps of TLS traffic using AES256-GCM cipher with large objects;
 
  - 83000 TCP connections per second from client to server;
 
  - 82000 HTTP connections per second from client to server;
 
  - 97000 HTTP requests per second in server-close mode (keep-alive with the
    client, close with the server);
 
  - 243000 HTTP requests per second in end-to-end keep-alive mode;
 
  - 300000 filtered TCP connections per second (anti-DDoS)
 
  - 160000 HTTPS requests per second in keep-alive mode over persistent TLS
    connections;
 
  - 13100 HTTPS requests per second using TLS resumed connections;
 
  - 1300 HTTPS connections per second using TLS connections renegotiated with
    RSA2048;
 
  - 20000 concurrent saturated connections per GB of RAM, including the memory
    required for system buffers; it is possible to do better with careful tuning
    but this result it easy to achieve.
 
  - about 8000 concurrent TLS connections (client-side only) per GB of RAM,
    including the memory required for system buffers;
 
  - about 5000 concurrent end-to-end TLS connections (both sides) per GB of
    RAM including the memory required for system buffers;
 
Thus a good rule of thumb to keep in mind is that the request rate is divided
by 10 between TLS keep-alive and TLS resume, and between TLS resume and TLS
renegotiation, while it's only divided by 3 between HTTP keep-alive and HTTP
close. Another good rule of thumb is to remember that a high frequency core
with AES instructions can do around 5 Gbps of AES-GCM per core.
 
Having more cores rarely helps (except for TLS) and is even counter-productive
due to the lower frequency. In general a small number of high frequency cores
is better.
 
Another good rule of thumb is to consider that on the same server, HAProxy will
be able to saturate :
 
  - about 5-10 static file servers or caching proxies;
 
  - about 100 anti-virus proxies;
 
  - and about 100-1000 application servers depending on the technology in use.