-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Description
Software Versions
$ snakemake --version
9.9.0
$ conda list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 1.6.0 pyhdfd78af_0 bioconda
snakemake-executor-plugin-slurm-jobstep 0.3.0 pyhdfd78af_0 bioconda
$ sinfo --version
slurm 24.11.Describe the bug
When the slurm_jobstep hits OOM, it keeps retrying even though it hit OOM. I would expect that the system would be resubmitted with more RAM.
Logs
~# sacct -j [JOBID]
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[JOBID] [JOBNAME]-+ debug default 1 RUNNING 0:0
[JOBID].bat+ batch default 1 RUNNING 0:0
[JOBID].0 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].1 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].2 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].3 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].4 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].5 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].6 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].7 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].8 python3.12 default 1 OUT_OF_ME+ 0:125
[JOBID].9 python3.12 default 1 OUT_OF_ME+ 0:125
Minimal example
from snakemake.shell import shell
rule use_ram:
input:
input = '/dev/zero'
threads: 1
resources:
cores=1,
mem_mb=10240,
partition='debug',
nodes=1
retries: 10
shell:
"head -c 12G {input} | tail"
Additional context
What I would expect is that given it runs into OOM, it should just cancel the sbatch or return to the starting snakemake job to potentially get more RAM. I am well aware that in this specific example that would not happen, but I still think that should be the behaviour instead of retrying consistently while it hit OOM.
Metadata
Metadata
Assignees
Labels
No labels