atom feed13 messages in org.python.python-ideasRe: [Python-ideas] [Python-Dev] PyPar...
FromSent OnAttachments
Guido van RossumNov 16, 2013 5:38 pm 
Trent NelsonNov 16, 2013 6:24 pm 
Guido van RossumNov 16, 2013 6:55 pm 
Andrew BarnertNov 16, 2013 10:31 pm 
Nick CoghlanNov 16, 2013 10:35 pm 
Andrew BarnertNov 16, 2013 11:40 pm 
Charles-François NataliNov 17, 2013 1:24 am 
Antoine PitrouNov 17, 2013 2:27 am 
Antoine PitrouNov 17, 2013 2:30 am 
Andrew BarnertNov 17, 2013 2:47 am 
Antoine PitrouNov 17, 2013 10:09 am 
Trent NelsonNov 17, 2013 2:34 pm 
Nick CoghlanNov 17, 2013 3:19 pm 
Subject:Re: [Python-ideas] [Python-Dev] PyParallel: alternate async I/O and GIL removal
From:Trent Nelson (
Date:Nov 17, 2013 2:34:09 pm

(I saw that there were a number of additional e-mails echo'ing Guido's sentiment/concerns re: shared nothing. I picked this thread to reply to and tried to provide as much info as possible in lieu of replying to everyone individually.)

On Sat, Nov 16, 2013 at 06:56:11PM -0800, Guido van Rossum wrote:

I wish you had spent more time on explaining how IOCP works and less on judging other approaches.

Heh, it's funny, with previous presentations, I didn't labor the point anywhere near as much, and I found that when presenting to UNIX people, they were very defensive of the status quo.

I probably over-compensated a little too much in the opposite direction this time; I don't think anyone is going to argue vehemently that the UNIX status quo is optimal on Windows; but a side-effect is that it unnecessarily slanders existing bodies of work (Twisted et al) that have undeniably improved the overall ecosystem over the past decade.

Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal.

The only thing I'd add is that, when speaking in terms of socket servers and whatnot, it helps to visualize Python callbacks as "the bits of logic that need to run before invoking the next asynchronous call".

Anything I/O related can be done via an asynchronous call; that's basically the exit point of the processing thread -- it dispatches the async WSARecv() (for example), then moves onto the next request in the I/O completion port's queue.

When that WSARecv() returns, we get all the info we need from the completion context to figure out what we just did, and, based on the protocol we provided, what needs to be done next.

So, we do a little more pure Python processing and then dispatch the next asynchronous call, which, in this case, might be a WSASend(); the thread then moves onto the next request in the queue.

That's all handled by the PxSocket_IOLoop monstrosity:

I got the inspiration for that implementation from CEval_FrameEx; you basically have one big inline method where you can go from anything to anything without having to call additional C functions; thus, doing back-to-back sends, for example, won't exhaust your stack.

That allows us to do the dynamic switch between sync and async depending on protocol preference, current client load, number of active I/O hogs, that sort of thing:

PxSocket_IOLoop currently only handles 1:1 TCP/IP connections, which limits its applicability. I want to expand that -- I should be able to connect any sort of end points together in any fashion -- similar to how ZeroMQ allows the bridge/fan-out/router type composition.

An endpoint would be anything that allows me to initiate an async operation against it, e.g. file, device, socket, whatever. This is where Windows really shines, because you can literally do everything either synchronously or asynchronously. There should also be support for 1:m and m:n relationships between endpoints (i.e. an IRC chat server).

So I see PxSocket_IOLoop turning into a more generic PxThread_Loop that can handle anything-to-anything -- basically, calling the Python code that needs to run before dispatching the next async call.

The current implementation also does a lot of live introspection against the protocol object to figure out what to do next; i.e. first entry point for a newly-connected client is here:

At every entry point into the loop, and at every point *after* the relevant Python code has been run, we're relying on the protocol to tell us what to do next in a very hard-coded fashion.

I think for PxThread_Loop to become truly dynamic, it should mirror CEval_FrameEx even closer; the protocol analysis should be done separately, the output of which is a stream of async-opcode bytes that direct the main dispatching logic:

dispatch: switch (next_opcode) { TARGET(maybe_shutdown_send_or_recv); TARGET(handle_error); TARGET(connection_made_callback); TARGET(data_received_callback); TARGET(send_complete_callback); TARGET(overlapped_recv_callback); TARGET(post_callback_that_supports_sending_retval); TARGET(post_callback_that_does_not_support_sending_retval); TARGET(close_); TARGET(try_send);

default: break; }

Then we'd have one big case statement just like with CEval_FrameEx that handles all possible async-opcodes, rather than the goto spaghetti in the current PxSocket_IOLoop.

The async opcodes would be generic and platform-independent; i.e. file write, file read, single socket write, multi-socket write, etc.

On Windows/Solaris/AIX, everything could be handled asynchronously, on other platforms, you would have to fake it using an event loop + multiplex method, identical to how twisted/tornado/tulip do it currently.

But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-)

Ok, so, heh, I lied in the presentation. The main thread won't be frozen per-se, and the parallel threads will have a way to share state. I've already done a huge amount of work on this, but it's very involved and that presentation was long enough as it is.

Also, it's easier to understand why reference counting and GC isn't needed in parallel contexts if you just assume the main thread isn't running.

In reality, one of the first things I had to figure out was how these parallel contexts could communicate state back to the main thread -- because without this ability, how the heck would you propagate an exception raised in a parallel thread back to the main thread? The exception will be backed my memory allocated in the parallel context -- that can't be free'd until the exception has been dealt with and no references to it remain.

As that was one of the first problems I had to solve, it has one of the hackiest implementations :-) The main thread's async.run_once() implementation can detect which parallel threads raised exceptions (because they've done an interlocked push to the main thread's error list) and it will extend the lifetime of the context for an additional number of subsequent runs of run_once(). Once the TTL of the context drops to 0, it is finally released.

The reason it's hacky is because there's no direct correlation between the exception object finally having no references to it and the point we destroy the context. If you persisted the exception object to a list in the main thread somewhere, you'd segfault down the track when trying to access that memory.

So, on the second iteration, I came up with some new concepts; context persistence and object promotion. A main-thread list or dict could be async protected such that this would work:

# main thread d1 = {} l1 = [] async.protect(d1) async.protect(l1) # this would also work d2 = async.dict() l2 = async.list()

# fyi: async.rdtsc() returns a PyLong wrapped # version of the CPU TSC; handy for generating # non-interned objects allocated from a parallel # context

def callback(name): d1[name] = async.rdtsc() l1.append(async.rdtsc())

async.submit_work(callback, 'foo') async.submit_work(callback, 'bar') async.submit_work(callback, 'moo')

That actually works; the async.protect() call intercepts the object's tp_as_mapping and tp_as_sequence fields and redirects them to thread-safe versions that use read/write locks. It also toggles a persistence bit on both the parallel context and the parallel long object, such that reference counting *is* actually enabled on it once it's back in the main thread -- when the ref count drops to 0, we check to see if it's an object that's been persisted, and if so, we decref the original context -- when the context's refcount gets to 0, only *then* do we free it.

(I also did some stuff where you could promote simple objects where it made sense -- i.e. there's no need to keep a 4k context around if the end result was a scalar that could be represented in 50-200 bytes; just memcpy it from the main thread ("promote it to a main thread object with reference counting") and free the context.)

You can see some examples of the different type of stuff you can do here:

The problem though was that none of my unit tests assigned more than ten items to a list/dict, so I never encountered a resize :-)

You can imagine what happens when a resize takes place within a parallel context -- the list/dict is realloc'd using the parallel context heap allocator -- that's not ideal, it's a main thread object, it shouldn't be reallocated with temporary parallel thread memory.

I think that was the point where I went "oh, bollocks!" and switched over to tackling the async socket stuff.

However, the async socket work forced me to implement all sorts of new concepts, including the heap snapshots and TLS heap overrides (for interned strings).

Pair that with the page locking stuff and I have a much richer set of tools at my disposal to solve that problem -- I just need to completely overhaul everything memory related now that I know how it needs to be implemented :-)

As for the dict/list assignment/resize, the key problem is figuring out whether a PyObject_Realloc call is taking place because we're resizing a main thread container object -- that's not an easy thing to figure out -- all you have is a pointer at the time you need to make the decision.

That's where the memory refactoring work comes in -- I'm still working on the details, but the general idea is that you'll be able to do very efficient pointer address tests against known base address masks to figure out the origin of the object and how the current memory request needs to be satisfied.

The other option I played around with was an interlocked list type that is exposed directly to Python:

x = async.xlist()

def callback(): x.push(async.rdtsc())

for _ in xrange(0, 10): async.submit_work(callback) # interlocked flush of all results into a list. results = x.flush()

The key difference between an interlocked list and a normal list is that an interlocked list has its very own localized heap, just like parallel contexts have; pushing a scalar onto the list automatically "promotes it". That is, the object is memcpy'd directly using the xlist's heap, and we can keep that heap alive independently to the parallel contexts that pushed objects onto it.

I was also planning on using this as a waitable queue, so you could compose pipelines of producers/consumers and that sort of thing. Then I ran out of time :-)

I don't really care how well CHARGEN (I had to look it up) scales. For HTTP, it's great for serving static contents from a cache or from the filesystem, but if that's all you serve, why use Python? Real web apps use intricate combinations of databases, memcache, in-memory cache, and template expansion. The biggest difference you can make there is probably getting rid of the ORM in favor of more direct SQL, and next on the list would be reimplementing template expansion in C. (And heck, you could release the GIL while you're doing that. :-)

Agree with the general sentiment "if that's all you're doing, why use Python?". The async HTTP server should allow other things to be built on top of it such that it's adding value over and above, say, an apache instance serving static files.

And in hindsight, perhaps I need to put more emphasis on the fact that it *is* very experimental work with a long-term view, versus Tulip/asyncio, which was intended for *now*. So although Tulip and PyParallel spawned from the same discussions and are attempting to attack the same problem -- it's really not fair for me to discredit Tulip/Twisted in favor of PyParallel because they're on completely different playing fields with vastly different implementation time frames (I'm thinking 5+ years before this work lands in a mainstream Python release -- if it ever does. And if not, hey, it can live on as another interpreter, just like Stackless et al).

I would love it if you could write a list of things a callback *cannot* do when it is in parallel mode. I believe that list includes mutating any kind of global/shared state (any object created in the main thread is read-only in parallel mode -- it seems you had to work hard to make string interning work, which is semantically transparent but mutates hidden global state). In addition (or, more likely, as a consequence!) a callback cannot create anything that lasts beyond the callback's lifetime, except for the brief time between the callback's return and the completion of the I/O operation involving the return value. (Actually, I missed how you do this -- doesn't this mean you cannot release the callback's heap until much later?)

So, I think I already answered that above. The next presentation (PyCon Montreal) will be purely focused on this stuff -- I've been beating the alternate approach to async I/O for long enough ;-)

So it seems that the price for extreme concurrency is the same as always -- you can only run purely functional code. Haskell fans won't mind, but for Python this seems to be putting the cart before the horse -- who wants to write Python with those constraints?

Basically it's all still work in progress, but the PyParallel-for- parallel-compute use case is very important. And there's no way that can be done without having a way to return the results of parallel computation back into the next stage of your pipeline where more analysis is done.

Getting hired by Continuum is actually great for this use case; we're in the big data, parallel task decomposition space, after all, not the writing-async-socket-server business ;-) I know Peter and Travis are both very supportive of PyParallel so its just a matter of trying to find time to work on it between consultancy engagements.