atom feed47 messages in org.freebsd.freebsd-currentRe: Some performance measurements on ...
FromSent OnAttachments
Luigi RizzoApr 19, 2012 6:12 am 
Slawa OlhovchenkovApr 19, 2012 11:53 am 
Andre OppermannApr 19, 2012 1:05 pm 
Luigi RizzoApr 19, 2012 1:26 pm 
K. MacyApr 19, 2012 1:34 pm 
Luigi RizzoApr 19, 2012 2:03 pm 
K. MacyApr 19, 2012 2:06 pm 
Andre OppermannApr 19, 2012 2:11 pm 
K. MacyApr 19, 2012 2:17 pm 
Andre OppermannApr 19, 2012 2:19 pm 
Andre OppermannApr 19, 2012 2:26 pm 
K. MacyApr 19, 2012 2:35 pm 
K. MacyApr 19, 2012 2:36 pm 
Luigi RizzoApr 19, 2012 2:43 pm 
Andre OppermannApr 19, 2012 3:36 pm 
Luigi RizzoApr 19, 2012 11:16 pm 
Alexander V. ChernikovApr 20, 2012 1:26 am 
Andre OppermannApr 20, 2012 2:00 am 
Andre OppermannApr 20, 2012 2:25 am 
John BaldwinApr 20, 2012 5:11 am 
Luigi RizzoApr 20, 2012 7:26 am 
K. MacyApr 20, 2012 9:28 am 
Luigi RizzoApr 20, 2012 11:46 am 
Bruce EvansApr 20, 2012 11:33 pm 
Adrian ChaddApr 21, 2012 7:14 pm 
K. MacyApr 22, 2012 7:04 am 
Andre OppermannApr 24, 2012 6:16 am 
Luigi RizzoApr 24, 2012 6:44 am 
Li, QingApr 24, 2012 7:15 am 
K. MacyApr 24, 2012 8:03 am 
K. MacyApr 24, 2012 8:05 am 
Luigi RizzoApr 24, 2012 9:16 am 
K. MacyApr 24, 2012 9:18 am 
Fabien ThomasApr 24, 2012 9:34 am 
Li, QingApr 24, 2012 10:39 am 
Li, QingApr 24, 2012 10:42 am 
Bjoern A. ZeebApr 24, 2012 5:01 pm 
Maxim KonovalovApr 25, 2012 2:21 am 
Slawa OlhovchenkovApr 25, 2012 3:19 am 
K. MacyApr 25, 2012 8:44 am 
Bjoern A. ZeebApr 25, 2012 11:53 am 
George Neville-NeilMay 1, 2012 7:27 am 
Luigi RizzoMay 1, 2012 8:21 am 
George Neville-NeilMay 1, 2012 10:33 am 
Bjoern A. ZeebMay 1, 2012 2:08 pm 
Luigi RizzoMay 1, 2012 2:22 pm 
Luigi RizzoMay 3, 2012 9:32 am 
Subject:Re: Some performance measurements on the FreeBSD network stack
From:Luigi Rizzo (riz@iet.unipi.it)
Date:Apr 19, 2012 1:26:40 pm
List:org.freebsd.freebsd-current

On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:

On 19.04.2012 15:30, Luigi Rizzo wrote:

I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting.

Jumping over very interesting analysis...

- the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet).

I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release.

It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.

The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages.

indeed i was pleased to see no difference between 1 and 4 threads. This also suggests that the global pool is accessed very seldom, and for short times, otherwise you'd see the effect with 4 threads.

What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. The allocation happens while the code has already an exclusive lock on so->snd_buf so a pool of fresh buffers could be attached there.

But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free.

- another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved.

This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which

i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ?

cheers luigi