5 messages in net.sunsource.gridengine.usersRe: [GE users] JSV scripts running un...
FromSent OnAttachments
ah_sunsourceJun 10, 2009 3:15 am 
ernstJun 10, 2009 4:32 am 
ah_sunsourceJun 10, 2009 6:10 am.Other
ernstJun 10, 2009 10:00 am 
dougalbAug 25, 2009 1:26 am 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: [GE users] JSV scripts running unreliablyActions...
From:ernst (Erns@sun.com)
Date:Jun 10, 2009 4:32:23 am
List:net.sunsource.gridengine.users

Hi Andreas,

Your JSV scripts are restarted due to two reasons:

1) The Message "JSV modification time in ..." indicates that the modification time stamp of your JSV script has changed. Within GE a worker thread detects that and restarts the corresponding JSV process when the next incoming job should be verified.

2) There is a protocol error between a JSV process and the corresponding thread in master. I assume that your JSV script is not implemented correctly. The first job that is verified by JSV process is handled correctly but the second results in a protocol error. To debug your JSV script you can set the "logging_enabled" and "log_file" variable in the file that is included in your JSV script (e.g. JSV.pm, jsc_include.tcl or jsv_include.sh). After enabling this you can find the data that is exchanged between master and JSV process in the log_file.

Cheers,

Ernst

ah_sunsource wrote:

Hi,

I'm experiencing a bit with the new jsv feature in SGE 6.2u2. I've written a server side jsv that checks whether the user requests at least 256M for h_vmem (below that, the prolog script might die due to missing memory and leaving the queue in an error state).

Unfortunately the jsv feature is not reliable:

[oreade38] ~ % for i in {1..5}; do echo hostname | qsub -l h_vmem=128M done Unable to run job: Do not require less than 256M for h_vmem. Exiting. Unable to run job: Do not require less than 256M for h_vmem. Exiting. Unable to run job: master got unknown command from JSV: "ERROR". Exiting. Unable to run job: master got unknown command from JSV: "ERROR". Exiting. Unable to run job: Do not require less than 256M for h_vmem. Exiting.

On the server logs I see messages like this:

06/10/2009 11:30:35|worker|lolek-vm1|I|JSV modification time in "worker001" has
changed 06/10/2009 11:30:36|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier"
has been stopped 06/10/2009 11:30:36|worker|lolek-vm1|I|JSV modification time in "worker001" has
changed 06/10/2009 11:30:36|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier"
has been started 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" rejected job 921 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV modification time in "worker000" has
changed 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV modification time in "worker000" has
changed 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier"
has been started 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker000" rejected job 922 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" rejected job 923 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" will be restarted. 06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier"
has been stopped 06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "worker000" rejected job 924 06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "worker000" will be restarted. 06/10/2009 11:30:39|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier"
has been stopped 06/10/2009 11:30:39|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier"
has been started 06/10/2009 11:30:40|worker|lolek-vm1|I|JSV "worker001" rejected job 925

Looks like the success of the script is oscillating. Is it be a bug?

-- Sun Microsystems GmbH Ernst Bablick Dr.-Leo-Ritter-Str. 7 Software Engineer D-93049 Regensburg Phone: +49 (0)941 3075 135 Germany Fax: +49 (0)941 3075 222 http://www.sun.de mailto: erns@sun.com

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht München: HRB 161028 Geschäftsführer: Thomas Schröder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring

------------------------------------------------------ http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201411

To unsubscribe from this discussion, e-mail:
[user@gridengine.sunsource.net].