|Alfred Perlstein||Jun 30, 1999 8:47 am|
|Peter Wemm||Jun 30, 1999 9:04 am|
|Terry Lambert||Jun 30, 1999 6:00 pm|
|Matthew Dillon||Jun 30, 1999 6:54 pm|
|Tim Vanderhoek||Jun 30, 1999 6:56 pm|
|Matthew Dillon||Jun 30, 1999 7:18 pm|
|John Milford||Jun 30, 1999 10:26 pm|
|Tim Tsai||Jul 1, 1999 12:02 am|
|Terry Lambert||Jul 1, 1999 11:26 am|
|Subject:||Re: async call gates|
|From:||Terry Lambert (tlam...@primenet.com)|
|Date:||Jun 30, 1999 6:00:14 pm|
Mr Lambert, I noticed you've been pushing for async call gates functionality in FreeBSD. I have several assumptions about how this would work that makes it expensive to accomplish, perhaps you or someone on this list can clarify by explanation and/or a URL to some paper or a book reference.
It's my design, from around 1993. I wanted to think of CPU's as resources which crawled over code, like spiders over a web. It steals liberally from the VMS idea of asynchronous system traps.
Peter's explanation is pretty much spot-on. The assumption that the call contexts (you called them 'threads') have to exist for the calls to be made in the first place is a wrong turn.
At the simplest level, there are three kinds of system calls:
1) Calls that will never block 2) Calls that will always block 3) Calls that may or may not block
It would be trivial to add a flag to the sysent structure for each system call to indicate what category it was in, if you had a mind to optimize behaviour for type #1 calls.
Now for each system call, there is some context:
1) The calling process' address space 2) Certain aspects of the calling process, unrelated to the address space, per se. This is a subset of the contents of the proc structure: credentials, PID, etc. 3) A kernel stack 4) A list of system resources, accumulated as they are held
This arrangement is symmetric between the kernel and user space, as we will see below.
Now for the process model in user space.
Let's assume that all existing system calls which are of type "will always block" or of type "may or may not block" can now be executed asynchronously, and that this is not an attribute of a particular system call, but is an attribute of the call mechanism itself. We might also place in this category some of the "will never block" calls, if they take "a long time" to complete.
This, in a nutshell, is the "async call gate".
This provides us with two interesting concurrency models which we can use in user space:
1) We can have a process with multiple outstanding asynchronous system calls pending simultaneously.
This is the "aio" model, except we are not limited to the POSIX subset "aioread" and "aiowrite". We can, instead, take "takes a long time" operations, such as "open", "sync", "getdents", "connect", "syslog", and so on, and also make *them* asynchronously.
2) We can implement on top of the asynchronous (non-blocking) calls, a set of primitives that emulate blocking calls.
This emulation is done by making an async call and then putting the caller to sleep until the async call returns. Rather than giving up our quantum, however, we change to another set of registers, another program counter, and another stack.
This implementation is called "user space threading using a call conversion scheduler".
The advantage of this threads model over a kernel threads model is that, so long as we have work pending, we can utilize the full measure of our CPU quantum without taking a full context switch overhead. This was the original promise that the threads model made to us when it first appeared, and then renigged upon when threads moved into the kernel.
The main problem with kernel threads is that kernel threads are not really threads, per se, they are KSE's -- Kernel Schedulable Entities. We call them KSE's to make no distinction between the work-to-do for a thread vs. a traditional process. The result is that kernel threads compete for quantum as processes, both with other threads in the same thread group ("multithreaded process"), and with other threads in other thread groups (a thread group with but a single member is a traditional UNIX process).
The main benefit of kernel threads is SMP scalability for programmer parallelized processes: processes which have been intentionally multithreaded (as opposed to using asynchronous calls) in order to increase their concurrency.
There is really no benefit (and in fact, a large amount of cache busting and context switch overhead) to using kernel threads as opposed to asynchronous calls in order to achieve SMP scaling. It is a lot of work to try and get the scheduler to intentionally minimize address space changes. You can do opportunistic state change avoidance when you are switching between one thread in a process group and another thread in the same process group, but that really buys you very little. More complex soloutions lead to starvation deadlocks and other bad behaviour.
This leaves one area where kernel threads still have better SMP scalability: when most of the cycles are spent in user space code.
Right now, in SMP FreeBSD, each CPU can be in user space at the same time; in fact, given the Big Giant Lock(tm), multiple CPU's can only be in the kernel simultaneously under some rigidly controlled circumstances. BSDI gets around the reentrancy issue for interrupts by moving the interrupt handling to kernel threads; that's one way to do it, but it has some real bad behaviour compared to "top" and "bottom" drivers, like in NT and Solaris' SMP implementations.
What kernel threads promise, and what make people willing to put up with their obvious drawbacks, is the ability for multiple CPU's to be in user space _in the same process_ at the same time.
This really impresses some people, when it probably shouldn't.
This seems to really be a feat of legerdemain; in fact, it's not that hard, and you don't need kernel threads to get there, only the ability to queue scheduler reservations for user space code already in a ready-to-run state. This is best accomplished with a user space hook into a scheduler activation mechanism. We can see this rather handily by asking ourselves "What is a user space process?"; the answer is:
1) A set of registers, including:
o A stack o A program counter
2) An address space
In other words, the same mechanism that is used by the call conversion to switch between one "blocked" user space thread and another "ready to run" user space thread can be used to dispatch a second (or third or fourth or Nth) CPU into the user space code as well. All we are missing is a tiny bit of glue in the user space call conversion scheduler. This glue says:
1) When you get a chance, "return" a quantum to user space to this address, which happens to be my user space scheduler dispatcher.
2) I have an aversion to this request being enqueued on any of the ready-to-run lists for CPU's where I already have one of these requests enqueued, or which are currently running in my user space at my behest.
The first is trivial to code up: it's called "vfork". One might make certain optimizations based on the system call architecture divorcing the kernel stack and other context from the process structure itself; I would advise this. The result is significantly lighter weight than the current vfork, since the entirety of the context is actually just the user space context. The only real "trick" is to ensure that your scheduler dispatcher has a stack available for every one of these which may return to user space; you wouldn't want to use a handy user space thread's stack for this, since you want to eliminate quantum bias.
The second takes a bit more work. A very rough implementation is trivial, however: a 32 bit bitmask associated with the process context, with one bit per available processor.
What's the result? The result is a highly SMP scalable multiple CPU utilizing user space threading system, which neatly sidesteps the context switch address space thrashing and N:M kernel:user space thread mapping problems.
The kernel reentrancy issues for fault/call/interrupt can be handled seperately, without a lot of overly complicated (by kernel threads) effort. Ideally, the BSDI approach would *NOT* be used; it's dated. Here's a reference from 1991:
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs E. Mohr, D.A. Kranz, and R.H. Halstead, Jr. IEEE Transactions on Parallel and Distributed Systems, July 1991, pages 264-280
And there are other references (which I don't have immediately at hand) showing this approach to be more than a decade and a half behind the current state of the art. Some people have already referenced more recent work on the FreeBSD SMP list. Late-binding resources is a useful technique, but it's not the only technique, and for the particular implementation (interrupts -- yes, I know that I'm suggesting lazy binding of blocking contexts when I suggest an async call gate), it's probably not the best technique.
Terry Lambert ter...@lambert.org
--- Any opinions in this posting are my own and not those of my present or previous employers.
To Unsubscribe: send mail to majo...@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message