DDP fails on multinode if each node has a different number of GPUS #15695
Unanswered
SerezD
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
I'm launching a PyTorch-Lightning script in a multinode environment.
In order to do so, I have followed the suggestions at this link:
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_intermediate_1.html
I have configured a bash script that gets the number of nodes assigned and the number of GPUS on each node, as well as the MASTER NODE, MASTER ADDRESS, etc...
In the bash script I launch the mpirun command on every node, like:
mpirun -np 1 -H $NODE $EXEC $SCRIPT &
where
$NODE
is the current node,$EXEC
is thepython3
command, and$SCRIPT
is my python script.In the python script (which is launched N nodes times), I correctly assign the variables:
Then, in the trainer:
Everything works fine if the Scheduler assigns the same number of GPUS at each Node.
(2 GPUS on 4 NODES in the following example, with 8 gpus in total)
The problem is when the Scheduler assigns a different number of GPUS on each node, which may of course happen very often since I am not the only person using the cluster.
In the following example, 4 GPUS are assigned to 3 NODES (1, 2, 1).
As you can see, the
GLOBAL_RANK
andMEMBER
assigned are wrong.In particular, for the
MEMBER
the node with 2 gpus assumes a total number of processes of 6 (2 times WORLD SIZE), while the nodes with one GPU assigned assume a total number of processes of 3 (1 times WORLD_SIZE).Does anyone knows how to solve this ?
Is it a problem with DDP ?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions