Skip to content

Checkpointing in signal handlers for SLURM auto-requeueing leads to intermittent failures #21406

@martenlienen

Description

@martenlienen

Bug description

I have been experienced intermittent and difficult to pin down failures when lightning tried to auto-requeue my jobs on our SLURM cluster on time out. After some debugging (print in the signal handler for SIGUSR1 and many runs), I saw that sometimes the handler would just stop running after or while saving the HPC checkpoint to disk.

Signal handlers are a pretty special environment, because they can run after any python bytecode instruction, so also in the middle of other functions. This can lead to difficult to track down problems and also crashes, e.g. python/cpython#112608. Or it could run in the middle of a backward pass and store a corrupted checkpoint. So the safe option would be to do as little work as possible in them and move the actual signal handling into normal code.

What version are you seeing the problem on?

v2.5, master

Reproduced in studio

No response

How to reproduce the bug

The issue is difficult to reproduce as it depends on signal timing and the exact environment. It never happens on one cluster but regularly on the other.

Error messages and logs

No error messages. The signal handler just never completes until the program gets killed by SLURM.

Environment

Current environment
  • CUDA:
    • GPU: None
    • available: False
    • version: 12.6
  • Lightning:
    • lightning: 2.6.0
    • lightning-utilities: 0.15.2
    • pytorch-lightning: 2.6.0
    • torch: 2.9.1+cu126
    • torch-fidelity: 0.3.0
    • torchdata: 0.11.0
    • torchmetrics: 1.8.2
    • torchvision: 0.24.1+cu126
  • Packages:
    • aiohappyeyeballs: 2.6.1
    • aiohttp: 3.13.2
    • aiosignal: 1.4.0
    • annotated-types: 0.7.0
    • antlr4-python3-runtime: 4.9.3
    • anyio: 4.12.0
    • argon2-cffi: 25.1.0
    • argon2-cffi-bindings: 25.1.0
    • arrow: 1.4.0
    • asttokens: 3.0.1
    • async-lru: 2.0.5
    • attrs: 25.4.0
    • autocommand: 2.2.2
    • babel: 2.17.0
    • backports.tarfile: 1.2.0
    • beautifulsoup4: 4.14.3
    • bleach: 6.3.0
    • brezn: 0.1.0
    • bsi: 0.1.0
    • cachetools: 6.2.2
    • cattrs: 25.3.0
    • certifi: 2025.11.12
    • cffi: 2.0.0
    • charset-normalizer: 3.4.4
    • click: 8.3.1
    • cloudpickle: 3.1.2
    • comm: 0.2.3
    • contourpy: 1.3.3
    • cycler: 0.12.1
    • debugpy: 1.8.17
    • decorator: 5.2.1
    • defusedxml: 0.7.1
    • einops: 0.8.1
    • executing: 2.2.1
    • fastjsonschema: 2.21.2
    • filelock: 3.20.0
    • fonttools: 4.61.0
    • fqdn: 1.5.1
    • frozenlist: 1.8.0
    • fsspec: 2025.12.0
    • gitdb: 4.0.12
    • gitignorant: 0.4.0
    • gitpython: 3.1.45
    • h11: 0.16.0
    • h5py: 3.15.1
    • httpcore: 1.0.9
    • httpx: 0.28.1
    • hydra-core: 1.3.2
    • hydra-submitit-launcher: 1.4.0.dev0
    • idna: 3.11
    • importlib-metadata: 8.0.0
    • inflect: 7.3.1
    • iniconfig: 2.3.0
    • ipdb: 0.13.13
    • ipykernel: 7.1.0
    • ipympl: 0.9.8
    • ipython: 9.8.0
    • ipython-pygments-lexers: 1.1.1
    • ipywidgets: 8.1.8
    • isoduration: 20.11.0
    • jaraco.collections: 5.1.0
    • jaraco.context: 5.3.0
    • jaraco.functools: 4.0.1
    • jaraco.text: 3.12.1
    • jaxtyping: 0.3.3
    • jedi: 0.19.2
    • jinja2: 3.1.6
    • json5: 0.12.1
    • jsonpointer: 3.0.0
    • jsonschema: 4.25.1
    • jsonschema-specifications: 2025.9.1
    • jupyter-client: 8.6.3
    • jupyter-core: 5.9.1
    • jupyter-events: 0.12.0
    • jupyter-lsp: 2.3.0
    • jupyter-server: 2.17.0
    • jupyter-server-terminals: 0.5.3
    • jupyterlab: 4.5.0
    • jupyterlab-pygments: 0.3.0
    • jupyterlab-server: 2.28.0
    • jupyterlab-widgets: 3.0.16
    • kiwisolver: 1.4.9
    • lark: 1.3.1
    • lightning: 2.6.0
    • lightning-utilities: 0.15.2
    • loky: 3.5.6
    • markdown-it-py: 4.0.0
    • markupsafe: 3.0.3
    • matplotlib: 3.10.7
    • matplotlib-inline: 0.2.1
    • mdurl: 0.1.2
    • mistune: 3.1.4
    • more-itertools: 10.3.0
    • mpmath: 1.3.0
    • multidict: 6.7.0
    • nbclient: 0.10.2
    • nbconvert: 7.16.6
    • nbformat: 5.10.4
    • nest-asyncio: 1.6.0
    • networkx: 3.6
    • notebook-shim: 0.2.4
    • numpy: 2.3.5
    • nvidia-cublas-cu12: 12.6.4.1
    • nvidia-cuda-cupti-cu12: 12.6.80
    • nvidia-cuda-nvrtc-cu12: 12.6.77
    • nvidia-cuda-runtime-cu12: 12.6.77
    • nvidia-cudnn-cu12: 9.10.2.21
    • nvidia-cufft-cu12: 11.3.0.4
    • nvidia-cufile-cu12: 1.11.1.6
    • nvidia-curand-cu12: 10.3.7.77
    • nvidia-cusolver-cu12: 11.7.1.2
    • nvidia-cusparse-cu12: 12.5.4.2
    • nvidia-cusparselt-cu12: 0.7.1
    • nvidia-nccl-cu12: 2.27.5
    • nvidia-nvjitlink-cu12: 12.6.85
    • nvidia-nvshmem-cu12: 3.3.20
    • nvidia-nvtx-cu12: 12.6.77
    • omegaconf: 2.3.0
    • packaging: 25.0
    • pandocfilters: 1.5.1
    • parso: 0.8.5
    • pexpect: 4.9.0
    • pillow: 12.0.0
    • platformdirs: 4.5.0
    • pluggy: 1.6.0
    • prometheus-client: 0.23.1
    • prompt-toolkit: 3.0.52
    • propcache: 0.4.1
    • protobuf: 6.33.1
    • psutil: 7.1.3
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.3
    • pycparser: 2.23
    • pydantic: 2.12.5
    • pydantic-core: 2.41.5
    • pygments: 2.19.2
    • pyparsing: 3.2.5
    • pytest: 9.0.1
    • python-dateutil: 2.9.0.post0
    • python-json-logger: 4.0.0
    • pytorch-lightning: 2.6.0
    • pyyaml: 6.0.3
    • pyzmq: 27.1.0
    • referencing: 0.37.0
    • requests: 2.32.5
    • rfc3339-validator: 0.1.4
    • rfc3986-validator: 0.1.1
    • rfc3987-syntax: 1.1.0
    • rich: 14.2.0
    • rpds-py: 0.30.0
    • scipy: 1.16.3
    • send2trash: 1.8.3
    • sentry-sdk: 2.47.0
    • setuptools: 80.9.0
    • six: 1.17.0
    • smmap: 5.0.2
    • soupsieve: 2.8
    • stack-data: 0.6.3
    • submitit: 1.5.3
    • sympy: 1.14.0
    • terminado: 0.18.1
    • tinycss2: 1.4.0
    • toml: 0.10.2
    • tomli: 2.0.1
    • torch: 2.9.1+cu126
    • torch-fidelity: 0.3.0
    • torchdata: 0.11.0
    • torchmetrics: 1.8.2
    • torchvision: 0.24.1+cu126
    • tornado: 6.5.2
    • tqdm: 4.67.1
    • traitlets: 5.14.3
    • triton: 3.5.1
    • typeguard: 4.3.0
    • typing-extensions: 4.15.0
    • typing-inspection: 0.4.2
    • tzdata: 2025.2
    • uri-template: 1.3.0
    • urllib3: 2.5.0
    • wadler-lindig: 0.1.7
    • wandb: 0.23.1
    • wcwidth: 0.2.14
    • webcolors: 25.10.0
    • webencodings: 0.5.1
    • websocket-client: 1.9.0
    • wheel: 0.45.1
    • widgetsnbextension: 4.0.15
    • yarl: 1.22.0
    • zipp: 3.19.2
  • System:

More info

No response

cc @ethanwharris

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions