Skip to content

OOM handling for jobstep plugin #348

@selten

Description

@selten

Software Versions

$ snakemake --version
9.9.0
$ conda list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 1.6.0              pyhdfd78af_0    bioconda
snakemake-executor-plugin-slurm-jobstep 0.3.0              pyhdfd78af_0    bioconda
$ sinfo --version
slurm 24.11.

Describe the bug
When the slurm_jobstep hits OOM, it keeps retrying even though it hit OOM. I would expect that the system would be resubmitted with more RAM.

Logs

~# sacct -j [JOBID]
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[JOBID]      [JOBNAME]-+      debug    default          1    RUNNING      0:0
[JOBID].bat+      batch               default          1    RUNNING      0:0
[JOBID].0    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].1    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].2    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].3    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].4    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].5    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].6    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].7    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].8    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].9    python3.12               default          1 OUT_OF_ME+    0:125

Minimal example

from snakemake.shell import shell

rule use_ram:
    input:
      input = '/dev/zero'
    threads: 1
    resources:
      cores=1,
      mem_mb=10240,
      partition='debug',
      nodes=1
    retries: 10
    shell:
        "head -c 12G    {input} | tail"

Additional context
What I would expect is that given it runs into OOM, it should just cancel the sbatch or return to the starting snakemake job to potentially get more RAM. I am well aware that in this specific example that would not happen, but I still think that should be the behaviour instead of retrying consistently while it hit OOM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions