Linux网络收包流程
Linux网卡通过软中断通知内核接收新的数据包
当网卡设备收到包时,一般通过软中断的方式通知内核,并由内核的softIRQ系统来处理收到的包:
- softIRQ kernel threads are created (one per CPU) in
spawn_ksoftirqd
in kernel/softirq.c with a call tosmpboot_register_percpu_thread
from kernel/smpboot.c. As seen in the code, the functionrun_ksoftirqd
is listed asthread_fn
, which is the function that will be executed in a loop. - The ksoftirqd threads begin executing their processing loops in the
run_ksoftirqd
function. - Next, the
softnet_data
structures are created, one per CPU. These structures hold references to important data structures for processing network data. One we’ll see again is thepoll_list
. Thepoll_list
is where NAPI poll worker structures will be added by calls tonapi_schedule
or other NAPI APIs from device drivers. net_dev_init
then registers theNET_RX_SOFTIRQ
softirq with the softirq system by callingopen_softirq
, as shown here. The handler function that is registered is callednet_rx_action
. This is the function the softirq kernel threads will execute to process packets.
当有新的数据包来的时候,网卡通过DMA将数据写入RAM(ring buffer),触发软中断,接着网卡驱动中的IRQ handler被调用
When network data arrives at a NIC, the NIC will use DMA to write the packet data to RAM. In the case of the igb
network driver, a ring buffer is setup in RAM that points to received packets. It is important to note that some NICs are “multiqueue” NICs, meaning that they can DMA incoming packets to one of many ring buffers in RAM. As we’ll see soon, such NICs are able to make use of multiple processors for processing incoming network data. Read more about multiqueue NICs. The diagram above shows just a single ring buffer for simplicity, but depending on the NIC you are using and your hardware settings you may have multiple queues on your system.
Read more detail about the process describe below in this section of the networking blog post.
Let’s walk through the process of receiving data:
- Data is received by the NIC from the network.
- The NIC uses DMA to write the network data to RAM.
- The NIC raises an IRQ.
- The device driver’s registered IRQ handler is executed.
- The IRQ is cleared on the NIC, so that it can generate IRQs for new packet arrivals.
- NAPI softIRQ poll loop is started with a call to
napi_schedule
.
The call to napi_schedule
triggers the start of steps 5 - 8 in the previous diagram. As we’ll see, the NAPI softIRQ poll loop is started by simply flipping a bit in a bitfield and adding a structure to the poll_list
for processing. No other work is done by napi_schedule
and this is precisely how a driver defers processing to the softIRQ system.
Continuing on to the diagram in the previous section, using the numbers found there:
- The call to
napi_schedule
in the driver adds the driver’s NAPI poll structure to thepoll_list
for the current CPU. - The softirq pending bit is set so that the
ksoftirqd
process on this CPU knows that there are packets to process. run_ksoftirqd
function (which is being run in a loop by theksoftirq
kernel thread) executes.__do_softirq
is called which checks the pending bitfield, sees that a softIRQ is pending, and calls the handler registered for the pending softIRQ:net_rx_action
which does all the heavy lifting for incoming network data processing.
It is important to note that the softIRQ kernel thread is executing net_rx_action
, not the device driver IRQ handler.
内核开始处理
Now, data processing begins. The net_rx_action
function (called from the ksoftirqd
kernel thread) will start to process any NAPI poll structures that have been added to the poll_list
for the current CPU. Poll structures are added in two general cases:
- From device drivers with calls to
napi_schedule
. - With an Inter-processor Interrupt in the case of Receive Packet Steering. Read more about how Receive Packet Steering uses IPIs to process packets.
We’re going to start by walking through what happens when a driver’s NAPI structure is retreived from the poll_list
. (The next section how NAPI structures registered with IPIs for RPS work).
net_rx_action
loop starts by checking the NAPI poll list for NAPI structures.- The
budget
and elapsed time are checked to ensure that the softIRQ will not monopolize CPU time. - The registered
poll
function is called. In this case, the functionigb_poll
was registered by theigb
driver. - The driver’s
poll
function harvests packets from the ring buffer in RAM. - Packets are handed over to
napi_gro_receive
, which will deal with possible Generic Receive Offloading. - Packets are either held for GRO and the call chain ends here or packets are passed on to
net_receive_skb
to proceed up toward the protocol stacks.
We’ll see next how net_receive_skb
deals with Receive Packet steering to distribute the packet processing load amongst multiple CPUs.
Network data processing continues from netif_receive_skb
, but the path of the data depends on whether or not Receive Packet Steering (RPS) is enabled or not. An “out of the box” Linux kernel will not have RPS enabled by default and it will need to be explicitly enabled and configured if you want to use it.
In the case where RPS is disabled, using the numbers in the above diagram:
netif_receive_skb
passes the data on to__netif_receive_core
.__netif_receive_core
delivers data to any taps (like PCAP).__netif_receive_core
delivers data to registered protocol layer handlers. In many cases, this would be theip_rcv
function that the IPv4 protocol stack has registered.
In the case where RPS is enabled:
netif_receive_skb
passes the data on toenqueue_to_backlog
.- Packets are placed on a per-CPU input queue for processing.
- The remote CPU’s NAPI structure is added to that CPU’s
poll_list
and an IPI is queued which will trigger the softIRQ kernel thread on the remote CPU to wake-up if it is not running already. - When the
ksoftirqd
kernel thread on the remote CPU runs, it follows the same pattern describe in the previous section, but this time, the registeredpoll
function isprocess_backlog
which harvests packets from the current CPU’s input queue. - Packets are passed on toward
__net_receive_skb_core
. __netif_receive_core
delivers data to any taps (like PCAP).__netif_receive_core
delivers data to registered protocol layer handlers. In many cases, this would be theip_rcv
function that the IPv4 protocol stack has registered.
Protocol stacks and userland sockets
Next up are the protocol stacks, netfilter, berkley packet filters, and finally the userland socket. This code path is long, but linear and relatively straightforward.
You can continue following the detailed path for network data. A very brief, high level summary of the path is:
- Packets are received by the IPv4 protocol layer with
ip_rcv
. - Netfilter and a routing optimization are performed.
- Data destined for the current system is delivered to higher-level protocol layers, like UDP.
- Packets are received by the UDP protocol layer with
udp_rcv
and are queued to the receive buffer of a userland socket byudp_queue_rcv_skb
andsock_queue_rcv
. Prior to queuing to the receive buffer, berkeley packet filters are processed.
Note that netfilter is consulted multiple times throughout this process. The exact locations can be found in our detailed walk-through.