Implementing a Metric and including a nn.Module doesn't work correctly in parallel #6693
Unanswered
import-antigravity
asked this question in
DDP / multi-GPU / multi-node
Replies: 3 comments 8 replies
-
Bump |
Beta Was this translation helpful? Give feedback.
1 reply
-
Sidenote: using |
Beta Was this translation helpful? Give feedback.
0 replies
-
Are you using a shared cluster/machine by any chance? That error can be due to another user using the gpu resources (and the gpus set to exclusive mode) |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I implemented the FID metric, which involves using a pre-trained Inception network. I have the following code to move it to CUDA:
When I train using more than one GPU in DDP, this causes an exception
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
. I'm not sure what's causing this. I know that this stuff is supposed to be taken care of automatically but for some reason it's not working for me.Beta Was this translation helpful? Give feedback.
All reactions