|Luigi Rizzo||Apr 19, 2012 6:12 am|
|Slawa Olhovchenkov||Apr 19, 2012 11:53 am|
|Andre Oppermann||Apr 19, 2012 1:05 pm|
|Luigi Rizzo||Apr 19, 2012 1:26 pm|
|K. Macy||Apr 19, 2012 1:34 pm|
|Luigi Rizzo||Apr 19, 2012 2:03 pm|
|K. Macy||Apr 19, 2012 2:06 pm|
|Andre Oppermann||Apr 19, 2012 2:11 pm|
|K. Macy||Apr 19, 2012 2:17 pm|
|Andre Oppermann||Apr 19, 2012 2:19 pm|
|Andre Oppermann||Apr 19, 2012 2:26 pm|
|K. Macy||Apr 19, 2012 2:35 pm|
|K. Macy||Apr 19, 2012 2:36 pm|
|Luigi Rizzo||Apr 19, 2012 2:43 pm|
|Andre Oppermann||Apr 19, 2012 3:36 pm|
|Luigi Rizzo||Apr 19, 2012 11:16 pm|
|Alexander V. Chernikov||Apr 20, 2012 1:26 am|
|Andre Oppermann||Apr 20, 2012 2:00 am|
|Andre Oppermann||Apr 20, 2012 2:25 am|
|John Baldwin||Apr 20, 2012 5:11 am|
|Luigi Rizzo||Apr 20, 2012 7:26 am|
|K. Macy||Apr 20, 2012 9:28 am|
|Luigi Rizzo||Apr 20, 2012 11:46 am|
|Bruce Evans||Apr 20, 2012 11:33 pm|
|Adrian Chadd||Apr 21, 2012 7:14 pm|
|K. Macy||Apr 22, 2012 7:04 am|
|Andre Oppermann||Apr 24, 2012 6:16 am|
|Luigi Rizzo||Apr 24, 2012 6:44 am|
|Li, Qing||Apr 24, 2012 7:15 am|
|K. Macy||Apr 24, 2012 8:03 am|
|K. Macy||Apr 24, 2012 8:05 am|
|Luigi Rizzo||Apr 24, 2012 9:16 am|
|K. Macy||Apr 24, 2012 9:18 am|
|Fabien Thomas||Apr 24, 2012 9:34 am|
|Li, Qing||Apr 24, 2012 10:39 am|
|Li, Qing||Apr 24, 2012 10:42 am|
|Bjoern A. Zeeb||Apr 24, 2012 5:01 pm|
|Maxim Konovalov||Apr 25, 2012 2:21 am|
|Slawa Olhovchenkov||Apr 25, 2012 3:19 am|
|K. Macy||Apr 25, 2012 8:44 am|
|Bjoern A. Zeeb||Apr 25, 2012 11:53 am|
|George Neville-Neil||May 1, 2012 7:27 am|
|Luigi Rizzo||May 1, 2012 8:21 am|
|George Neville-Neil||May 1, 2012 10:33 am|
|Bjoern A. Zeeb||May 1, 2012 2:08 pm|
|Luigi Rizzo||May 1, 2012 2:22 pm|
|Luigi Rizzo||May 3, 2012 9:32 am|
|Subject:||Re: Some performance measurements on the FreeBSD network stack|
|From:||Luigi Rizzo (riz...@iet.unipi.it)|
|Date:||Apr 19, 2012 1:26:40 pm|
On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
On 19.04.2012 15:30, Luigi Rizzo wrote:
I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting.
Jumping over very interesting analysis...
- the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet).
I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release.
It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.
The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages.
indeed i was pleased to see no difference between 1 and 4 threads. This also suggests that the global pool is accessed very seldom, and for short times, otherwise you'd see the effect with 4 threads.
What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. The allocation happens while the code has already an exclusive lock on so->snd_buf so a pool of fresh buffers could be attached there.
But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free.
- another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved.
This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which
i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ?
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "free...@freebsd.org"