DeepSpeed on Multi Node Cluster #11647
Unanswered
maxzvyagin
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Is there best practice for starting a run with pytorch lightning and deepspeed on a local multi node cluster? I'm able to get things working on a single node just fine but would like to scale up. Currently, my sbatch command leads to the single node program running on each node which isn't the desired behavior. Slurm is how the cluster is managed, but I'm able to launch jobs interactively/manually if need be. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions