OOM handling for jobstep plugin

**Software Versions**
```bash
$ snakemake --version
9.9.0
$ conda list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 1.6.0              pyhdfd78af_0    bioconda
snakemake-executor-plugin-slurm-jobstep 0.3.0              pyhdfd78af_0    bioconda
$ sinfo --version
slurm 24.11.
```


**Describe the bug**
When the slurm_jobstep hits OOM, it keeps retrying even though it hit OOM. I would expect that the system would be resubmitted with more RAM.

**Logs**
```
~# sacct -j [JOBID]
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[JOBID]      [JOBNAME]-+      debug    default          1    RUNNING      0:0
[JOBID].bat+      batch               default          1    RUNNING      0:0
[JOBID].0    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].1    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].2    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].3    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].4    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].5    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].6    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].7    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].8    python3.12               default          1 OUT_OF_ME+    0:125
[JOBID].9    python3.12               default          1 OUT_OF_ME+    0:125
```

**Minimal example**
```
from snakemake.shell import shell

rule use_ram:
    input:
      input = '/dev/zero'
    threads: 1
    resources:
      cores=1,
      mem_mb=10240,
      partition='debug',
      nodes=1
    retries: 10
    shell:
        "head -c 12G    {input} | tail"
```

**Additional context**
What I would expect is that given it runs into OOM, it should just cancel the sbatch or return to the starting snakemake job to potentially get more RAM. I am well aware that in this specific example that would not happen, but I still think that should be the behaviour instead of retrying consistently while it hit OOM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM handling for jobstep plugin #348

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM handling for jobstep plugin #348

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions