How to set up DDP correctly if processes are created externally & CUDA_VISIBLE_DEVICES
is set
#13736
Unanswered
yongsiang-fb
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 4 replies
-
Do you mean each node has 1 GPU visible? if yes, in such a case, |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am trying to get PyTorch lightning work within a certain cluster environment.
In particular, the DDP processes would be created externally, and additionally,
CUDA_VISIBLE_DEVICES
will be set by the cluster manager so that only 1 device would be visible, which is the device the process is supposed to use.I found that in this situation, if I define a subclass of
ClusterEnvironment
wherelocal_rank
is set as the real local rank of the process, an exception would be thrown because PyTorch Lightning would attempt to accessself.parallel_devices[self.local_rank]
but there is only 1 device present inself.parallel_devices
because of theCUDA_VISIBLE_DEVICES
.What would be the best approach to make it work? Should I implement my own strategy class to override the behavior of
self.parallel_devices[self.local_rank]
?Thanks a lot!
Beta Was this translation helpful? Give feedback.
All reactions