7. CPU usage
7. CPU usage

HAProxy normally spends most of its time in the system and a smaller part in
userland. A finely tuned 3.5 GHz CPU can sustain a rate about 80000 end-to-end
connection setups and closes per second at 100% CPU on a single core. When one
core is saturated, typical figures are :
  - 95% system, 5% user for long TCP connections or large HTTP objects
  - 85% system and 15% user for short TCP connections or small HTTP objects in
    close mode
  - 70% system and 30% user for small HTTP objects in keep-alive mode
 
The amount of rules processing and regular expressions will increase the user
land part. The presence of firewall rules, connection tracking, complex routing
tables in the system will instead increase the system part.
 
On most systems, the CPU time observed during network transfers can be cut in 4
parts :
  - the interrupt part, which concerns all the processing performed upon I/O
    receipt, before the target process is even known. Typically Rx packets are
    accounted for in interrupt. On some systems such as Linux where interrupt
    processing may be deferred to a dedicated thread, it can appear as softirq,
    and the thread is called ksoftirqd/0 (for CPU 0). The CPU taking care of
    this load is generally defined by the hardware settings, though in the case
    of softirq it is often possible to remap the processing to another CPU.
    This interrupt part will often be perceived as parasitic since it's not
    associated with any process, but it actually is some processing being done
    to prepare the work for the process.
 
  - the system part, which concerns all the processing done using kernel code
    called from userland. System calls are accounted as system for example. All
    synchronously delivered Tx packets will be accounted for as system time. If
    some packets have to be deferred due to queues filling up, they may then be
    processed in interrupt context later (eg: upon receipt of an ACK opening a
    TCP window).
 
  - the user part, which exclusively runs application code in userland. HAProxy
    runs exclusively in this part, though it makes heavy use of system calls.
    Rules processing, regular expressions, compression, encryption all add to
    the user portion of CPU consumption.
 
  - the idle part, which is what the CPU does when there is nothing to do. For
    example HAProxy waits for an incoming connection, or waits for some data to
    leave, meaning the system is waiting for an ACK from the client to push
    these data.
 
In practice regarding HAProxy's activity, it is in general reasonably accurate
(but totally inexact) to consider that interrupt/softirq are caused by Rx
processing in kernel drivers, that user-land is caused by layer 7 processing
in HAProxy, and that system time is caused by network processing on the Tx
path.
 
Since HAProxy runs around an event loop, it waits for new events using poll()
(or any alternative) and processes all these events as fast as possible before
going back to poll() waiting for new events. It measures the time spent waiting
in poll() compared to the time spent doing processing events. The ratio of
polling time vs total time is called the "idle" time, it's the amount of time
spent waiting for something to happen. This ratio is reported in the stats page
on the "idle" line, or "Idle_pct" on the CLI. When it's close to 100%, it means
the load is extremely low. When it's close to 0%, it means that there is
constantly some activity. While it cannot be very accurate on an overloaded
system due to other processes possibly preempting the CPU from the haproxy
process, it still provides a good estimate about how HAProxy considers it is
working : if the load is low and the idle ratio is low as well, it may indicate
that HAProxy has a lot of work to do, possibly due to very expensive rules that
have to be processed. Conversely, if HAProxy indicates the idle is close to
100% while things are slow, it means that it cannot do anything to speed things
up because it is already waiting for incoming data to process. In the example
below, haproxy is completely idle :
 
  $ echo "show info" | socat - /var/run/haproxy.sock | grep ^Idle
  Idle_pct: 100
 
When the idle ratio starts to become very low, it is important to tune the
system and place processes and interrupts correctly to save the most possible
CPU resources for all tasks. If a firewall is present, it may be worth trying
to disable it or to tune it to ensure it is not responsible for a large part
of the performance limitation. It's worth noting that unloading a stateful
firewall generally reduces both the amount of interrupt/softirq and of system
usage since such firewalls act both on the Rx and the Tx paths. On Linux,
unloading the nf_conntrack and ip_conntrack modules will show whether there is
anything to gain. If so, then the module runs with default settings and you'll
have to figure how to tune it for better performance. In general this consists
in considerably increasing the hash table size. On FreeBSD, "pfctl -d" will
disable the "pf" firewall and its stateful engine at the same time.
 
If it is observed that a lot of time is spent in interrupt/softirq, it is
important to ensure that they don't run on the same CPU. Most systems tend to
pin the tasks on the CPU where they receive the network traffic because for
certain workloads it improves things. But with heavily network-bound workloads
it is the opposite as the haproxy process will have to fight against its kernel
counterpart. Pinning haproxy to one CPU core and the interrupts to another one,
all sharing the same L3 cache tends to sensibly increase network performance
because in practice the amount of work for haproxy and the network stack are
quite close, so they can almost fill an entire CPU each. On Linux this is done
using taskset (for haproxy) or using cpu-map (from the haproxy config), and the
interrupts are assigned under /proc/irq. Many network interfaces support
multiple queues and multiple interrupts. In general it helps to spread them
across a small number of CPU cores provided they all share the same L3 cache.
Please always stop irq_balance which always does the worst possible thing on
such workloads.
 
For CPU-bound workloads consisting in a lot of SSL traffic or a lot of
compression, it may be worth using multiple processes dedicated to certain
tasks, though there is no universal rule here and experimentation will have to
be performed.
 
In order to increase the CPU capacity, it is possible to make HAProxy run as
several processes, using the "nbproc" directive in the global section. There
are some limitations though :
  - health checks are run per process, so the target servers will get as many
    checks as there are running processes ;
  - maxconn values and queues are per-process so the correct value must be set
    to avoid overloading the servers ;
  - outgoing connections should avoid using port ranges to avoid conflicts
  - stick-tables are per process and are not shared between processes ;
  - each peers section may only run on a single process at a time ;
  - the CLI operations will only act on a single process at a time.
 
With this in mind, it appears that the easiest setup often consists in having
one first layer running on multiple processes and in charge for the heavy
processing, passing the traffic to a second layer running in a single process.
This mechanism is suited to SSL and compression which are the two CPU-heavy
features. Instances can easily be chained over UNIX sockets (which are cheaper
than TCP sockets and which do not waste ports), and the proxy protocol which is
useful to pass client information to the next stage. When doing so, it is
generally a good idea to bind all the single-process tasks to process number 1
and extra tasks to next processes, as this will make it easier to generate
similar configurations for different machines.
 
On Linux versions 3.9 and above, running HAProxy in multi-process mode is much
more efficient when each process uses a distinct listening socket on the same
IP:port ; this will make the kernel evenly distribute the load across all
processes instead of waking them all up. Please check the "process" option of
the "bind" keyword lines in the configuration manual for more information.