atom feed7 messages in ru.sysoev.nginxRe: Question about http_stub_status
FromSent OnAttachments
Marcus BianchyAug 22, 2008 4:50 am 
Maxim DouninAug 22, 2008 8:06 am 
Igor SysoevAug 23, 2008 12:04 am 
Marcus BianchyAug 23, 2008 4:17 am 
Igor SysoevAug 26, 2008 9:25 am 
Marcus BianchyAug 26, 2008 11:53 am 
Igor SysoevAug 26, 2008 12:03 pm 
Subject:Re: Question about http_stub_status
From:Igor Sysoev (is-G@public.gmane.org)
Date:Aug 26, 2008 9:25:01 am
List:ru.sysoev.nginx

On Sat, Aug 23, 2008 at 01:18:07PM +0200, Marcus Bianchy wrote:

This means that either someone killed nginx workers using SIGTERM/INT/KILL or workers exited abnornamally. Could you run

grep alert error_log

Well, I can say that no one of our team send's such signals around... We're observing strange signal 8 (SIGFPE) errors the last time: A typical "grep/zgrep signal" of our error.logs shows things similar like this:

############ snip ############ 2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 10:09:42 [alert] 28631#0: worker process 27809 exited on signal 8 2008/08/22 10:09:42 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:06 [alert] 28631#0: worker process 27810 exited on signal 8 2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:06 [alert] 28631#0: worker process 32013 exited on signal 8 2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:11 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:11 [alert] 28631#0: worker process 27811 exited on signal 8 2008/08/22 12:58:11 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:58:20 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:58:20 [alert] 28631#0: worker process 785 exited on signal 8 2008/08/22 12:58:20 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 12:59:36 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 12:59:36 [alert] 28631#0: worker process 1342 exited on signal 8 2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/22 13:00:06 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/22 13:00:06 [alert] 28631#0: worker process 1343 exited on signal 8 2008/08/22 13:00:06 [notice] 28631#0: signal 29 (SIGIO) received 2008/08/23 04:02:18 [notice] 28631#0: signal 17 (SIGCHLD) received 2008/08/23 04:02:18 [alert] 28631#0: worker process 1344 exited on signal 8 2008/08/23 04:02:18 [notice] 28631#0: signal 29 (SIGIO) received ################## snip #############

The logrotate runs at 04:00 in the morning, that would explain the SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at around 1pm; neither the access.log nor the error.log gives any hint for the thing that produces this behaviour. And guess: yesterday at 1pm the values for active/waiting connections increaesed to ~30000/35000.

Maybe it's a good idea to allow core dumps to exactly reproduce what causes these signals?

It seems that you have "max_fails=0" in some upstream. The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.