Slurm: Jobs go straight to state CG and fail

From Define Wiki
Revision as of 20:18, 21 October 2015 by David (talk | contribs) (Created page with "== Summary: == '''Problem''': slurm.conf update on headnode and not propagated to all nodes. Once this was resolved, jobs were able to be submitted again. == Details: == U...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary:

Problem: slurm.conf update on headnode and not propagated to all nodes. Once this was resolved, jobs were able to be submitted again.

Details:

User reported all jobs were in a CG state

[root@hyalite ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             94189      defq slurm_te  k58w857 CG       0:01      1 compute024
             94188  priority PARP_40q  p89v618 CG       0:01      1 compute023
             94187  priority PARP_40q  p89v618 CG       0:01      1 compute022
             94186  priority PARP_40q  p89v618 CG       0:01      1 compute021
             94185  priority PARP_40q  p89v618 CG       0:01      1 compute020
             94183  priority PARP_40q  p89v618 CG       0:01      1 compute018
             94184  priority PARP_40q  p89v618 CG       0:01      1 compute019
             94182  priority PARP_40q  p89v618 CG       0:01      1 compute017
             94180  priority PARP_40q  p89v618 CG       0:01      1 compute015
             94181  priority PARP_40q  p89v618 CG       0:01      1 compute016
             94179  priority PARP_40q  p89v618 CG       0:01      1 compute014
             94177  priority PARP_40q  p89v618 CG       0:01      1 compute012
             94178  priority PARP_40q  p89v618 CG       0:01      1 compute013
             94176  priority PARP_40q  p89v618 CG       0:01      1 compute011
             94175  priority PARP_40q  p89v618 CG       0:01      1 compute010
             94174  priority PARP_40q  p89v618 CG       0:01      1 compute009
             94173  priority PARP_40q  p89v618 CG       0:01      1 compute008
             94172  priority PARP_40q  p89v618 CG       0:01      1 compute007
             94171      defq slurm_te  k58w857 CG       0:01      1 compute006
             94170      defq slurm_te  k58w857 CG       0:01      1 compute005
             94169      defq slurm_te  k58w857 CG       0:01      1 compute004
             94191  priority NE_1patc  j22f331 CG       0:00      1 compute026
             94190  priority NE_1patc  j22f331 CG       0:01      1 compute025

Output of the log files report that there is a problem submitting job from headnode to compute nodes

==> /var/log/slurmctld <==
[2015-10-21T13:52:19.603] _slurm_rpc_submit_batch_job JobId=94196 usec=698
[2015-10-21T13:52:19.958] sched: Allocate JobId=94196 NodeList=compute005 #CPUs=32
[2015-10-21T13:52:20.542] Killing non-startable batch job 94196: Header lengths are longer than data received
[2015-10-21T13:52:20.543] job_complete: JobID=94196 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 1
[2015-10-21T13:52:20.543] job_complete: JobID=94196 State=0x8005 NodeCnt=1 done

Quick google and 'SLURM Header lengths are longer than data received' suggests a mismatch of slurm.conf files.

Problem: slurm.conf update on headnode and not propagated to all nodes. Once this was resolved, jobs were able to be submitted again.

Now jobs submit and run ok and complete;

[k58w857@hyalite test1]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             94200      defq slurm_te  k58w857  R       1:12      1 compute004
[k58w857@hyalite test1]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[k58w857@hyalite test1]$ sacct -u k58w857
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
94169        slurm_tes+       defq                    32     FAILED      1:0 
94170        slurm_tes+       defq                    32     FAILED      1:0 
94171        slurm_tes+       defq                    32     FAILED      1:0 
94189        slurm_tes+       defq                    32     FAILED      1:0 
94195        slurm_tes+       defq                    32     FAILED      1:0 
94200        slurm_tes+       defq                    32  COMPLETED      0:0 
94200.batch       batch