Slurm: Kill jobs in SLURM when memory usage exceeds requested amount

From Define Wiki
Jump to navigation Jump to search

I finally manage to get jobs to be terminated after exceeding their memory location. Here is the configuration I used

slurm.conf

EnforcePartLimits=ALL
TaskPlugin=task/cgroup
JobAcctGatherType=jobacct_gather/cgroup
SelectTypeParameters=CR_CPU_Memory
MemLimitEnforce=yes
KillOnBadExit=1

cgroup.conf

CgroupAutomount=yes
ConstrainCores=yes     
ConstrainRAMSpace=yes  
ConstrainSwapSpace=yes 
TaskAffinity=no
MaxSwapPercent=10

Running a job that simply allocates RAM in a loop:

#! /bin/bash 

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=1024MB

./bigmem 100000

Produced the following error once, the job exceeded 1GB RSS:

slurmstepd: error: Detected 1 oom-kill event(s) in step 125.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.