How to auto-resume torchrun
multi-node training after hitting SLURM time limit?
#20263
Unanswered
amorehead
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello. I have recently been trying to set up fault-tolerant (and time limit-tolerant) multi-node training on a SLURM script I have access to. I can successfully train a model using 2 nodes and 4 GPUs each node. However, when my SLURM job times out (e.g., after 1 hour), my SLURM job does not get resubmitted automatically (like would happen when using the
SLURMEnvironment
) nor does my rendezvous node automatically restart the timed out workers. My question is, what is the standard process for setting up auto-restarts withtorchrun
and PyTorch Lightning on a SLURM cluster?Beta Was this translation helpful? Give feedback.
All reactions