| From | Sent On | Attachments |
|---|---|---|
| Marcus Bianchy | Aug 22, 2008 4:50 am | |
| Maxim Dounin | Aug 22, 2008 8:06 am | |
| Igor Sysoev | Aug 23, 2008 12:04 am | |
| Marcus Bianchy | Aug 23, 2008 4:17 am | |
| Igor Sysoev | Aug 26, 2008 9:25 am | |
| Marcus Bianchy | Aug 26, 2008 11:53 am | |
| Igor Sysoev | Aug 26, 2008 12:03 pm |
| Subject: | Re: Question about http_stub_status | |
|---|---|---|
| From: | Igor Sysoev (is-G...@public.gmane.org) | |
| Date: | Aug 26, 2008 9:25:01 am | |
| List: | ru.sysoev.nginx | |
On Sat, Aug 23, 2008 at 01:18:07PM +0200, Marcus Bianchy wrote:
This means that either someone killed nginx workers using SIGTERM/INT/KILL or workers exited abnornamally. Could you run
grep alert error_log
Well, I can say that no one of our team send's such signals around... We're observing strange signal 8 (SIGFPE) errors the last time: A typical "grep/zgrep signal" of our error.logs shows things similar like this:
############ snip ############ 2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 10:09:42 [alert] 28631#0: worker process 27809 exited on signal 8 2008/08/22 10:09:42 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:06 [alert] 28631#0: worker process 27810 exited on signal 8 2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:06 [alert] 28631#0: worker process 32013 exited on signal 8 2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:11 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:11 [alert] 28631#0: worker process 27811 exited on signal 8 2008/08/22 12:58:11 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:20 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:20 [alert] 28631#0: worker process 785 exited on signal 8 2008/08/22 12:58:20 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:59:36 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:59:36 [alert] 28631#0: worker process 1342 exited on signal 8 2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 13:00:06 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 13:00:06 [alert] 28631#0: worker process 1343 exited on signal 8 2008/08/22 13:00:06 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/23 04:02:18 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/23 04:02:18 [alert] 28631#0: worker process 1344 exited on signal 8 2008/08/23 04:02:18 [notice] 28631#0: signal 29 (SIGIO) received ################## snip #############
The logrotate runs at 04:00 in the morning, that would explain the SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at around 1pm; neither the access.log nor the error.log gives any hint for the thing that produces this behaviour. And guess: yesterday at 1pm the values for active/waiting connections increaesed to ~30000/35000.
Maybe it's a good idea to allow core dumps to exactly reproduce what causes these signals?
It seems that you have "max_fails=0" in some upstream. The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.
-- Igor Sysoev http://sysoev.ru/en/





