Gordon Messmer writes:
Unfortunately, I don't have a 'ps' list from the time that the problem
was occurring to back me up on this, but I recall that when I looked at
'ps axf', I did not see any courierlocal processes.
You wouldn't see one courierlocal process for each local delivery. There's
always be a single courierlocal, running as root. For each local delivery
it forks, drops root, and runs courierdeliver.
I'm at a loss to explain the problem, as clearly the kernel should have
been returning the fd's for the courierlocal process when courierd was
calling select() on them, and it also seems clear that it did not (or
those fd's were not in the set ??).
In any case, I'd like to propose a max delivery lifetime on any module,
such that if a module exceeds the configured lifetime, the courierd
"panics", and restarts itself from scratch, without trying to shut down
active deliveries.
I'm not sure that this is the right thing, but I'm also not sure that
there is a "right thing" in cases like this. Is there a better
solution? Would such a patch be accepted, if it was sufficiently clear?
I'd rather have a better idea what the problem is, before trying to fix it.
Since this sounds like a freak occurence, so far, you'll get better results
by keeping a close eye on things and waiting patiently until this happens
again, doing a quick investigation, get some good data, then restart
everything to go back in business.
What are you doing for local deliveries? If you're letting people install
arbitrary scripts, this can cause a lot of mischief even though they're
running scripts under their own uid.