How to auto-resume `torchrun` multi-node training after hitting SLURM time limit? #20263

amorehead · 2024-09-08T02:55:48Z

amorehead
Sep 8, 2024

Hello. I have recently been trying to set up fault-tolerant (and time limit-tolerant) multi-node training on a SLURM script I have access to. I can successfully train a model using 2 nodes and 4 GPUs each node. However, when my SLURM job times out (e.g., after 1 hour), my SLURM job does not get resubmitted automatically (like would happen when using the SLURMEnvironment) nor does my rendezvous node automatically restart the timed out workers. My question is, what is the standard process for setting up auto-restarts with torchrun and PyTorch Lightning on a SLURM cluster?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to auto-resume `torchrun` multi-node training after hitting SLURM time limit? #20263

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to auto-resume torchrun multi-node training after hitting SLURM time limit? #20263

Uh oh!

amorehead Sep 8, 2024

Replies: 0 comments

How to auto-resume `torchrun` multi-node training after hitting SLURM time limit? #20263

amorehead
Sep 8, 2024