| From | Sent On | Attachments |
|---|---|---|
| Poul-Henning Kamp | Nov 13, 2006 8:55 pm | |
| Luigi Rizzo | Nov 13, 2006 9:39 pm | |
| Colin Percival | Nov 13, 2006 9:43 pm | |
| Poul-Henning Kamp | Nov 13, 2006 9:44 pm | |
| Poul-Henning Kamp | Nov 13, 2006 9:47 pm | |
| John-Mark Gurney | Nov 13, 2006 11:07 pm | |
| Luigi Rizzo | Nov 14, 2006 7:42 am | |
| Poul-Henning Kamp | Nov 14, 2006 7:45 am | |
| Poul-Henning Kamp | Nov 14, 2006 8:40 am | |
| Ivan Voras | Nov 14, 2006 11:56 am | |
| Poul-Henning Kamp | Nov 14, 2006 12:03 pm | |
| Ivan Voras | Nov 14, 2006 12:13 pm | |
| Poul-Henning Kamp | Nov 14, 2006 3:08 pm | |
| Andre Oppermann | Nov 14, 2006 3:11 pm | |
| Andre Oppermann | Nov 14, 2006 3:26 pm | |
| Luigi Rizzo | Nov 14, 2006 3:27 pm | |
| Andre Oppermann | Nov 14, 2006 3:38 pm | |
| Poul-Henning Kamp | Nov 14, 2006 3:47 pm | |
| Andre Oppermann | Nov 14, 2006 4:08 pm | |
| John Polstra | Nov 14, 2006 5:23 pm | |
| John Polstra | Nov 14, 2006 5:28 pm | |
| Ricardo Nabinger Sanchez | Nov 14, 2006 6:29 pm | |
| Poul-Henning Kamp | Nov 14, 2006 6:31 pm | |
| Peter Jeremy | Nov 15, 2006 8:40 am | |
| Poul-Henning Kamp | Nov 15, 2006 9:04 am | |
| gn...@freebsd.org | Nov 15, 2006 1:21 pm | |
| John Baldwin | Nov 28, 2006 1:40 pm | |
| Julian Elischer | Nov 28, 2006 2:54 pm | |
| Julian Elischer | Nov 28, 2006 3:03 pm | |
| Poul-Henning Kamp | Nov 28, 2006 3:28 pm | |
| Ricardo Nabinger Sanchez | Nov 28, 2006 5:10 pm | |
| Poul-Henning Kamp | Nov 29, 2006 2:21 am | |
| Robert Watson | Nov 29, 2006 2:45 am | |
| Ricardo Nabinger Sanchez | Nov 29, 2006 8:10 am | |
| Matthew Dillon | Nov 29, 2006 11:24 am | |
| John Baldwin | Nov 29, 2006 11:29 am | |
| Poul-Henning Kamp | Nov 29, 2006 11:32 am | |
| Poul-Henning Kamp | Nov 29, 2006 11:45 am | |
| Matthew Dillon | Nov 29, 2006 1:17 pm | |
| Poul-Henning Kamp | Nov 29, 2006 1:23 pm | |
| Matthew Dillon | Nov 29, 2006 1:51 pm | |
| John Baldwin | Nov 29, 2006 1:55 pm | |
| Poul-Henning Kamp | Nov 29, 2006 2:00 pm | |
| Poul-Henning Kamp | Nov 29, 2006 2:01 pm | |
| Matthew Dillon | Nov 29, 2006 2:43 pm | |
| Matthew Dillon | Nov 29, 2006 2:59 pm | |
| M. Warner Losh | Nov 29, 2006 10:35 pm | |
| M. Warner Losh | Nov 29, 2006 10:47 pm | |
| Poul-Henning Kamp | Nov 29, 2006 10:53 pm | |
| M. Warner Losh | Nov 29, 2006 10:59 pm | |
| Poul-Henning Kamp | Nov 29, 2006 11:05 pm | |
| Hans Petter Selasky | Nov 30, 2006 1:42 am | |
| Ivan Voras | Nov 30, 2006 6:56 am | |
| John Baldwin | Nov 30, 2006 11:25 am | |
| Poul-Henning Kamp | Nov 30, 2006 1:56 pm | |
| John Birrell | Nov 30, 2006 3:33 pm | |
| Robert Watson | Nov 30, 2006 5:39 pm | |
| George V. Neville-Neil | Nov 30, 2006 9:04 pm | |
| Matthew Dillon | Dec 1, 2006 1:30 am | |
| Matthew Dillon | Dec 1, 2006 2:09 am | |
| Poul-Henning Kamp | Dec 1, 2006 2:21 am | |
| Matthew Dillon | Dec 1, 2006 10:39 am | |
| Julian Elischer | Dec 1, 2006 11:02 am | |
| M. Warner Losh | Dec 1, 2006 11:17 am | |
| Gary Palmer | Dec 1, 2006 2:33 pm |
| Subject: | a proposed callout API | |
|---|---|---|
| From: | Poul-Henning Kamp (ph...@phk.freebsd.dk) | |
| Date: | Nov 13, 2006 8:55:08 pm | |
| List: | org.freebsd.freebsd-arch | |
A number of problems have been identified with our current callout code and I have been thinking about and discussed various aspects with people during the EuroBSDcon2007 conference.
A lot of people are interested in this, so here is a quick sketch of what I'm thinking about:
The Problems
------------
1. We need better resolution than a periodic "hz" clock can give us. Highspeed networking, gaming servers and other real-time apps want this.
2. We "pollute" our call-wheel with tons of callouts that we know are unlikely to happen.
3. We have many operations on the callout wheel because certain callouts gets rearmed for later in the future. (TCP keepalives).
4. We execute all callouts on one CPU only.
5. Most of the specified timeouts are bogus, because of the imprecision inheret in the current 1/hz method of scheduling them.
and a number of other issues.
The proposed API
----------------
tick_t XXX_ns_tick(unsigned nsec, unsigned *low, unsigned *high); Caculate the tick value for a given timeout. Optionally return theoretical lower and upper limits to actual value,
tick_t XXX_s_tick(unsigned seconds) Caculate the tick value for a given timeout.
The point behind these two functions is that we do not want to incur a scaling operating at every arming of a callout. Very few callouts use varying timeouts (and for those, no avoidance is possible), but for the rest, precalculating the correct (opaque) number is a good optimization.
XXX_arm(struct xxx*, tick_t, func *, arg *, int flag, struct mtx *); Arm timer. Struct xxx must be zeroed before first call.
If mtx pointer is non-NULL, acq mutex before calling.
flags: XXX_REPEAT XXX_UNLIKELY
Arm a callout with a number of optional behaviours specified.
XXX_rearm(struct xxx*, tick_t) Rearm timer.
XXX_disarm(struct xxx*) Unarm the timer.
XXX_drain(struct xxx*) Drain the timer.
The functions above will actually be wrappers for a more generic set of the same family, which also takes a pointer to a callout-group.
This is so that we can have different groups of callouts, for instance one group for the/each netstack and one for the disk-I/O stuff etc.
Implementation
--------------
Behind the scenes, we will have to support hardware that only has a HZ style periodic interrupt but also hardware that can do deadline interrupts (like HPET).
Short callouts, less than 2 seconds, will be stored in a binary heap (A tree where a node is numerically lower than its parents.)
The depth of the heap is Log2(nodes) and there are very efficient ways to store and access the heap.
Locking will be with one mutex for the heap.
The top element which is always the next one to be executed, will be left in place during execution, so that any rescheduling (automatic or explicit) will only have to do as little work as necessary to trickle it down to the right place in the heap.
Rescheduling a callout is a matter of trickling it up- or downwards in the tree to a correct position.
For the long callouts, and for callouts unlikely to happen, a group of lists are used for storage.
Imagine the number of lists is four, we then label them with Tnow+2s, Tnow+8s, Tnow+32s and "the rest".
Armed callouts go into the first list that is labeled with a higher strike time.
If a callout is rescheduled to later, it's timeout is updated, but it is not moved in the list.
If a callout is cancled, it is removed from the list.
Twice per second, the first list is scanned, and any due callouts called and any callouts later than the lists strike time is moved into the right list.
When "Tnow+2s" rolls around, the lists are rotated to the left: the Tnow+8s becomes "Tnow+2s" etc.
The idea behind this (untried) scheme for the long callouts, is to distribute the callouts somewhat evenly in the lists, while maintaining only the relevant entries in the first list, the point being that most of them (TCP keepalive, CAM etc) will never happen, so spending time sorting them more than necessary is pointless. Obviously, this algorithm needs to be tested in practice and tuned/changed/discared depending on the results.
All numbers given are subject to tuning.
-- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 ph...@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.





