DDP training hangs when one GPU returns zero loss #12189
Unanswered
kazimpal87
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 2 replies
-
interesting case! theoretically, the gradients are zero for the parameters, let's assume that on some specific batch and on a specific device the loss is 0, so ideally the gradients shouldn't be calculated and gradient sync should still happen. can you try returning None instead of 0? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a situation where sometimes a batch might contain no ground truth, and so train_step needs to return a zero loss. In the single-gpu case I can just return
torch.tensor([0.0], requires_grad=True, device=self.device)
and it works. However in the DDP case, if one GPU gets a batch like this, it causes the entire training process to hang. I guess this is because this zero tensor doesnt have any gradient information so the reduction doesn't work. Is there some way around this? How can I return zero loss from train_step?Beta Was this translation helpful? Give feedback.
All reactions