Slurm: Jobs go straight to state CG and fail
Jump to navigation
Jump to search
Summary:
Problem: slurm.conf update on headnode and not propagated to all nodes. Once this was resolved, jobs were able to be submitted again.
Details:
User reported all jobs were in a CG state
[root@hyalite ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
94189 defq slurm_te k58w857 CG 0:01 1 compute024
94188 priority PARP_40q p89v618 CG 0:01 1 compute023
94187 priority PARP_40q p89v618 CG 0:01 1 compute022
94186 priority PARP_40q p89v618 CG 0:01 1 compute021
94185 priority PARP_40q p89v618 CG 0:01 1 compute020
94183 priority PARP_40q p89v618 CG 0:01 1 compute018
94184 priority PARP_40q p89v618 CG 0:01 1 compute019
94182 priority PARP_40q p89v618 CG 0:01 1 compute017
94180 priority PARP_40q p89v618 CG 0:01 1 compute015
94181 priority PARP_40q p89v618 CG 0:01 1 compute016
94179 priority PARP_40q p89v618 CG 0:01 1 compute014
94177 priority PARP_40q p89v618 CG 0:01 1 compute012
94178 priority PARP_40q p89v618 CG 0:01 1 compute013
94176 priority PARP_40q p89v618 CG 0:01 1 compute011
94175 priority PARP_40q p89v618 CG 0:01 1 compute010
94174 priority PARP_40q p89v618 CG 0:01 1 compute009
94173 priority PARP_40q p89v618 CG 0:01 1 compute008
94172 priority PARP_40q p89v618 CG 0:01 1 compute007
94171 defq slurm_te k58w857 CG 0:01 1 compute006
94170 defq slurm_te k58w857 CG 0:01 1 compute005
94169 defq slurm_te k58w857 CG 0:01 1 compute004
94191 priority NE_1patc j22f331 CG 0:00 1 compute026
94190 priority NE_1patc j22f331 CG 0:01 1 compute025Output of the log files report that there is a problem submitting job from headnode to compute nodes
==> /var/log/slurmctld <==
[2015-10-21T13:52:19.603] _slurm_rpc_submit_batch_job JobId=94196 usec=698
[2015-10-21T13:52:19.958] sched: Allocate JobId=94196 NodeList=compute005 #CPUs=32
[2015-10-21T13:52:20.542] Killing non-startable batch job 94196: Header lengths are longer than data received
[2015-10-21T13:52:20.543] job_complete: JobID=94196 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 1
[2015-10-21T13:52:20.543] job_complete: JobID=94196 State=0x8005 NodeCnt=1 doneQuick google and 'SLURM Header lengths are longer than data received' suggests a mismatch of slurm.conf files.
Problem: slurm.conf update on headnode and not propagated to all nodes. Once this was resolved, jobs were able to be submitted again.
Now jobs submit and run ok and complete;
[k58w857@hyalite test1]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
94200 defq slurm_te k58w857 R 1:12 1 compute004
[k58w857@hyalite test1]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[k58w857@hyalite test1]$ sacct -u k58w857
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
94169 slurm_tes+ defq 32 FAILED 1:0
94170 slurm_tes+ defq 32 FAILED 1:0
94171 slurm_tes+ defq 32 FAILED 1:0
94189 slurm_tes+ defq 32 FAILED 1:0
94195 slurm_tes+ defq 32 FAILED 1:0
94200 slurm_tes+ defq 32 COMPLETED 0:0
94200.batch batch