atom feed1 message in net.sunsource.gridengine.issues[GE issues] [Issue 3113] New - too ma...
FromSent OnAttachments
templedfAug 24, 2009 12:47 pm 
Subject:[GE issues] [Issue 3113] New - too many qsub -sync jobs wedge the qmaster on failover
From:templedf (dan.@sun.com)
Date:Aug 24, 2009 12:47:41 pm
List:net.sunsource.gridengine.issues

http://gridengine.sunsource.net/issues/show_bug.cgi?id=3113 Issue #|3113 Summary|too many qsub -sync jobs wedge the qmaster on failover Component|gridengine Version|6.2u3 Platform|All URL| OS/Version|All Status|NEW Status whiteboard| Keywords| Resolution| Issue type|DEFECT Priority|P1 Subcomponent|qmaster Assigned to|ernst Reported by|templedf

------- Additional comments from temp@sunsource.net Mon Aug 24 12:48:22
-0700 2009 ------- If the qmaster restarts or fails over with too many qsub -sync processes
running, it becomes completely unresponsive to all client commands. qping is still able to connect and show the communications traffic, however.
The dtrace script shows that there is no activity in any of the probe points. With up to 4 or 5, the problem does not appear. With more
than 30, the problem definitely appears. I did not test to find out where the line is for when the problem starts to occur.

To reproduce:

qsub -sync y $SGE_ROOT/examples/jobs/sleeper.sh 600 & [1] 16669 Your job 29 ("Sleeper") has been submitted [32 qsub's removed for ease of reading] qsub -sync y $SGE_ROOT/examples/jobs/sleeper.sh 600 & [34] 16709 Your job 59 ("Sleeper") has been submitted qstat

job-ID prior name user state submit/start at queue
slots ja-task-ID

----------------------------------------------------------------------------------------------------------------- 29 0.55500 Sleeper dant r 08/18/2009 13:15:22 all.q@ultra20
1 30 0.55500 Sleeper dant r 08/18/2009 13:15:22 all.q@ultra20
1 31 0.55500 Sleeper dant qw 08/18/2009 13:15:19
1 ... 59 0.00000 Sleeper dant qw 08/18/2009 13:15:26
1

su -

Password: Sun Microsystems Inc. SunOS 5.11 snv_112 November 2008 # qconf -km root@ultra20 kills qmaster # pgrep master # The qmaster has gone down. Waiting to reconnect. [33 copies of the error message removed for readability] error: commlib error: can't connect to service (Connection refused) [33 copies of the error message removed for readability] error: unable to send message to qmaster using port 6444 on host "ultra20": got
send error [33 copies of the error message removed for readability] # default/common/sgemaster starting sge_qmaster #^D

Reconnected to qmaster.

[33 copies of the message removed for readability] [Wait about two minutes] error: failed receiving gdi request response for mid=1 (got syncron message
receive timeout error). [This error message repeats forever]

qstat

error: failed receiving gdi request response for mid=1 (got syncron message
receive timeout error). ^C> qping ultra20 6444 qmaster 1 08/18/2009 13:19:23 endpoint ultra20/qmaster/1 at port 6444 is up since 208
seconds 08/18/2009 13:19:24 endpoint ultra20/qmaster/1 at port 6444 is up since 209
seconds 08/18/2009 13:19:25 endpoint ultra20/qmaster/1 at port 6444 is up since 210
seconds ^C>qping -info ultra20 6444 qmaster 1 08/18/2009 13:44:05: SIRM version: 0.1 SIRM message id: 1 start time: 08/18/2009 13:15:55 (1250626555) run time [s]: 1690 messages in read buffer: 74 messages in write buffer: 0 nr. of connected clients: 33 status: 2 info: MAIN: E (1689.34) | signaler000: E (1689.27) |
event_master000: E (1685.75) | timer000: E (1675.17) | worker000: E (1685.75) | worker001: E (1685.75) | listener000: E (1685.75) | listener001: E
(1685.75) | scheduler000: E (0.17) | ERROR Monitor: disabled

Sample DTrace output after master hangs: # ./monitor.sh -interval 5 dtrace: script './monitor.d' matched 64 probes CPU ID FUNCTION:NAME 0 1 :BEGIN Time | #wrt
wrt/ms |#rep #gdi #ack| #dsp dsp/ms #sad| #snd #rcv| #in++ #in-- #out++ #out--| #lck0 #ulck0 #lck1 #ulck1 0 87367 :tick-5 2009 Aug 18 13:21:05 | 0
0| 0 0 0| 0 0 0| 0 0| 0 0 0 0| 0 0 1 1 0 87367 :tick-5 2009 Aug 18 13:21:05 | 0
0| 0 0 0| 0 0 0| 0 0| 0 0 0 0| 0 0 0 0 0 87367 :tick-5 2009 Aug 18 13:21:05 | 0
0| 0 0 0| 0 0 0| 0 0| 0 0 0 0| 0 0 0 0 0 87367 :tick-5 2009 Aug 18 13:21:05 | 0
0| 0 0 0| 0 0 0| 0 0| 0 0 0 0| 0 0 0 0 0 87367 :tick-5 2009 Aug 18 13:21:05 | 0
0| 0 0 0| 0 0 0| 0 0| 0 0 0 0| 0 0 0 0 0 87367 :tick-5 2009 Aug 18 13:21:06 | 0
0| 0 0 0| 0 0 0| 0 0| 0 0 0 0| 0 0 0 0 0 87367 :tick-5 2009 Aug 18 13:21:06 | 0
0| 0 0 0| 0 0 0| 0 0| 0 0 0 0| 0 0 0 0

Evaluation:

The root cause for this issue is the introduction of the concept of a worker thread in qmaster. As a result, the locking mechanism for GDI request processing has become much more coarse grained than before. Basically, the entire GDI processing is protected by GLOBAL_LOCK. The access type of the request
(read/write) is not important; we just attempt to grab the lock as reader or writer.

This works fine with the exception of having event master (i.e. qmaster) trying to modify an event client. This thread acquires Event_Master_Control.mutex first and then prior to updating the subscription list holding GLOBAL_LOCK.

Right now, state of the art, routine sge_event_master_process_mod_event_client()
is the only place that exhibits this (false) ordering scheme. Good news: looks like we can safely swap the locking order here without
causing damage.

------------------------------------------------------ http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=214035

To unsubscribe from this discussion, e-mail:
[issu@gridengine.sunsource.net].