Slurm: Node remains in an Drain state permanently
Jump to navigation
Jump to search
- In this example compute006 was stuck in a permanent drain state. Steps below show and resolved the issue
Lets take a look at the queue
[root@hyalite ~]# sinfo | grep compute006
defq* up infinite 1 drain compute006Do a little more digging around the problem
[root@hyalite ~]# scontrol show node compute006
NodeName=compute006 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUErr=0 CPUTot=32 CPULoad=0.00 Features=(null)
Gres=(null)
NodeAddr=compute006 NodeHostName=compute006 Version=14.03.0
OS=Linux RealMemory=64523 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=2015 Weight=1
BootTime=2015-07-15T16:06:12 SlurmdStartTime=2015-07-23T13:48:17
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Epilog error [slurm@2015-07-13T02:10:24]Epilog error eh - whats that?!
[root@hyalite ~]# scontrol show conf | grep -i epilog
Epilog = /cm/local/apps/cmd/scripts/epilogLets jump to compute006 and see what the problem is:
[root@hyalite ~]# ssh compute006
Last login: Fri Jul 17 10:09:05 2015 from hyalite.global.cluster
[root@compute006 ~]# /cm/local/apps/cmd/scripts/epilog
Cannot determine a workload manager nameCheck the slurm logs on compute006
[root@compute006 log]# grep -i epilog slurmd
[2015-03-20T18:06:39.952] epilog for job 30435 ran for 37 seconds
[2015-05-04T10:28:37.268] epilog for job 65954 ran for 98 seconds
[2015-05-18T14:20:58.712] epilog for job 75682 ran for 5 seconds
[2015-05-19T00:28:22.042] epilog for job 75685 ran for 32 seconds
[2015-05-19T11:57:59.541] epilog for job 75713 ran for 17 seconds
[2015-05-19T12:07:39.376] epilog for job 75722 ran for 21 seconds
[2015-05-19T12:12:33.672] epilog for job 75725 ran for 18 seconds
[2015-05-19T12:24:19.152] epilog for job 75728 ran for 23 seconds
[2015-05-19T13:27:22.518] epilog for job 75704 ran for 9 seconds
[2015-05-19T13:35:49.976] epilog for job 75707 ran for 7 seconds
[2015-05-19T13:42:44.965] epilog for job 75733 ran for 58 seconds
[2015-05-19T13:59:21.960] epilog for job 75741 ran for 80 seconds
[2015-05-19T14:02:36.265] epilog for job 75737 ran for 28 seconds
[2015-05-19T14:07:13.998] epilog for job 75750 ran for 57 seconds
[2015-05-19T14:09:30.041] epilog for job 75751 ran for 18 seconds
[2015-05-19T14:33:12.895] epilog for job 75753 ran for 10 seconds
[2015-05-19T14:48:49.078] epilog for job 75734 ran for 11 seconds
[2015-05-19T14:58:11.019] epilog for job 75755 ran for 41 seconds
[2015-05-19T15:02:07.103] epilog for job 75756 ran for 93 seconds
[2015-05-19T15:30:26.124] epilog for job 75761 ran for 25 seconds
[2015-05-19T15:39:45.714] epilog for job 75769 ran for 38 seconds
[2015-05-19T15:40:06.888] epilog for job 75770 ran for 38 seconds
[2015-05-19T15:40:58.159] epilog for job 75771 ran for 29 seconds
[2015-05-19T15:44:27.719] epilog for job 75771 ran for 45 seconds
[2015-05-22T15:57:28.916] epilog for job 76059 ran for 60 seconds
[2015-07-08T10:40:07.200] epilog for job 82214 ran for 5 seconds
[2015-07-10T16:31:39.060] epilog for job 82521 ran for 23 seconds
[2015-07-13T02:10:24.951] error: spank/epilog returned status 0x0007
[2015-07-13T02:10:24.955] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T02:10:24.955] error: [job 82768] epilog failed status=0:7
[2015-07-13T02:16:26.816] error: spank/epilog returned status 0x0007
[2015-07-13T02:16:26.818] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T02:16:26.818] error: [job 82769] epilog failed status=0:7
[2015-07-13T03:31:52.313] error: spank/epilog returned status 0x0007
[2015-07-13T03:31:52.315] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T03:31:52.315] error: [job 82811] epilog failed status=0:7
[2015-07-13T06:55:06.851] error: spank/epilog returned status 0x0007
[2015-07-13T06:55:06.854] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T06:55:06.854] error: [job 82759] epilog failed status=0:7
[2015-07-13T08:48:22.003] error: spank/epilog returned status 0x0007
[2015-07-13T08:48:22.006] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T08:48:22.006] error: [job 82761] epilog failed status=0:7
[2015-07-13T09:55:37.196] error: spank/epilog returned status 0x0007
[2015-07-13T09:55:37.198] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T09:55:37.199] error: [job 82760] epilog failed status=0:7
[2015-07-13T10:09:28.484] error: spank/epilog returned status 0x0007
[2015-07-13T10:09:28.487] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T10:09:28.487] error: [job 82765] epilog failed status=0:7
[2015-07-13T10:56:08.574] error: spank/epilog returned status 0x0007
[2015-07-13T10:56:08.577] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T10:56:08.577] error: [job 82771] epilog failed status=0:7
[2015-07-13T10:57:43.741] error: spank/epilog returned status 0x0007
[2015-07-13T10:57:43.743] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T10:57:43.743] error: [job 82764] epilog failed status=0:7
[2015-07-13T11:33:17.734] error: spank/epilog returned status 0x0007
[2015-07-13T11:33:17.737] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T11:33:17.737] error: [job 82762] epilog failed status=0:7
[2015-07-13T11:43:18.878] error: spank/epilog returned status 0x0007
[2015-07-13T11:43:18.879] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T11:43:18.879] error: [job 82766] epilog failed status=0:7
[2015-07-13T11:57:10.830] error: spank/epilog returned status 0x0007
[2015-07-13T11:57:10.832] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T11:57:10.832] error: [job 82770] epilog failed status=0:7
[2015-07-13T12:08:24.953] error: spank/epilog returned status 0x0007
[2015-07-13T12:08:24.955] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T12:08:24.955] error: [job 82763] epilog failed status=0:7
[2015-07-13T12:09:20.336] error: spank/epilog returned status 0x0007
[2015-07-13T12:09:20.338] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T12:09:20.338] error: [job 82772] epilog failed status=0:7
[2015-07-13T13:10:59.023] error: spank/epilog returned status 0x0007
[2015-07-13T13:10:59.025] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T13:10:59.025] error: [job 82767] epilog failed status=0:7
[2015-07-13T16:54:00.996] error: spank/epilog returned status 0x0007
[2015-07-13T16:54:01.001] error: /cm/local/apps/cmd/scripts/epilog: exited with status 0x0007
[2015-07-13T16:54:01.001] error: [job 82810] epilog failed status=0:7Strange - it used to work fine...
Quick hack:
# added this line to the epilog file as per conf above
WLM=slurmNot sure if thats the problem. Lets try and undrain the node
scontrol update NodeName=compute006 State=RESUMECheck the queues
[root@hyalite ~]# sinfo | grep 006
defq* up infinite 10 idle compute[006,023-028,033-034,038]Submit a quick job to test
[root@hyalite ~]# srun -w compute006 sleep 30
[root@hyalite ~]#
# check logs for output / errors[2015-07-23T14:09:03.702] [83436] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2015-07-23T14:09:03.705] [83436] done with job </syntaxhighlight>
seems ok - doesnt explain why i had to edit the epilog script though.... TBC