3.5. Sizing
- Typical CPU usage figures show 15% of the processing time spent in HAProxy
- versus 85% in the kernel in TCP or HTTP close mode, and about 30% for HAProxy
- versus 70% for the kernel in HTTP keep-alive mode. This means that the operating
- system and its tuning have a strong impact on the global performance.
-
- Usages vary a lot between users, some focus on bandwidth, other ones on request
- rate, others on connection concurrency, others on SSL performance. This section
- aims at providing a few elements to help with this task.
-
- It is important to keep in mind that every operation comes with a cost, so each
- individual operation adds its overhead on top of the other ones, which may be
- negligible in certain circumstances, and which may dominate in other cases.
-
- When processing the requests from a connection, we can say that :
-
- - forwarding data costs less than parsing request or response headers;
-
- - parsing request or response headers cost less than establishing then closing
- a connection to a server;
-
- - establishing an closing a connection costs less than a TLS resume operation;
-
- - a TLS resume operation costs less than a full TLS handshake with a key
- computation;
-
- - an idle connection costs less CPU than a connection whose buffers hold data;
-
- - a TLS context costs even more memory than a connection with data;
-
- So in practice, it is cheaper to process payload bytes than header bytes, thus
- it is easier to achieve high network bandwidth with large objects (few requests
- per volume unit) than with small objects (many requests per volume unit). This
- explains why maximum bandwidth is always measured with large objects, while
- request rate or connection rates are measured with small objects.
-
- Some operations scale well on multiple processes spread over multiple CPUs,
- and others don't scale as well. Network bandwidth doesn't scale very far because
- the CPU is rarely the bottleneck for large objects, it's mostly the network
- bandwidth and data buses to reach the network interfaces. The connection rate
- doesn't scale well over multiple processors due to a few locks in the system
- when dealing with the local ports table. The request rate over persistent
- connections scales very well as it doesn't involve much memory nor network
- bandwidth and doesn't require to access locked structures. TLS key computation
- scales very well as it's totally CPU-bound. TLS resume scales moderately well,
- but reaches its limits around 4 processes where the overhead of accessing the
- shared table offsets the small gains expected from more power.
-
- The performance numbers one can expect from a very well tuned system are in the
- following range. It is important to take them as orders of magnitude and to
- expect significant variations in any direction based on the processor, IRQ
- setting, memory type, network interface type, operating system tuning and so on.
-
- The following numbers were found on a Core i7 running at 3.7 GHz equipped with
- a dual-port 10 Gbps NICs running Linux kernel 3.10, HAProxy 1.6 and OpenSSL
- 1.0.2. HAProxy was running as a single process on a single dedicated CPU core,
- and two extra cores were dedicated to network interrupts :
-
- - 20 Gbps of maximum network bandwidth in clear text for objects 256 kB or
- higher, 10 Gbps for 41kB or higher;
-
- - 4.6 Gbps of TLS traffic using AES256-GCM cipher with large objects;
-
- - 83000 TCP connections per second from client to server;
-
- - 82000 HTTP connections per second from client to server;
-
- - 97000 HTTP requests per second in server-close mode (keep-alive with the
- client, close with the server);
-
- - 243000 HTTP requests per second in end-to-end keep-alive mode;
-
- - 300000 filtered TCP connections per second (anti-DDoS)
-
- - 160000 HTTPS requests per second in keep-alive mode over persistent TLS
- connections;
-
- - 13100 HTTPS requests per second using TLS resumed connections;
-
- - 1300 HTTPS connections per second using TLS connections renegotiated with
- RSA2048;
-
- - 20000 concurrent saturated connections per GB of RAM, including the memory
- required for system buffers; it is possible to do better with careful tuning
- but this result it easy to achieve.
-
- - about 8000 concurrent TLS connections (client-side only) per GB of RAM,
- including the memory required for system buffers;
-
- - about 5000 concurrent end-to-end TLS connections (both sides) per GB of
- RAM including the memory required for system buffers;
-
- Thus a good rule of thumb to keep in mind is that the request rate is divided
- by 10 between TLS keep-alive and TLS resume, and between TLS resume and TLS
- renegotiation, while it's only divided by 3 between HTTP keep-alive and HTTP
- close. Another good rule of thumb is to remember that a high frequency core
- with AES instructions can do around 5 Gbps of AES-GCM per core.
-
- Having more cores rarely helps (except for TLS) and is even counter-productive
- due to the lower frequency. In general a small number of high frequency cores
- is better.
-
- Another good rule of thumb is to consider that on the same server, HAProxy will
- be able to saturate :
-
- - about 5-10 static file servers or caching proxies;
-
- - about 100 anti-virus proxies;
-
- - and about 100-1000 application servers depending on the technology in use.