DDP training hangs when one GPU returns zero loss #12189

kazimpal87 · 2022-03-02T16:55:02Z

kazimpal87
Mar 2, 2022

I have a situation where sometimes a batch might contain no ground truth, and so train_step needs to return a zero loss. In the single-gpu case I can just return torch.tensor([0.0], requires_grad=True, device=self.device) and it works. However in the DDP case, if one GPU gets a batch like this, it causes the entire training process to hang. I guess this is because this zero tensor doesnt have any gradient information so the reduction doesn't work. Is there some way around this? How can I return zero loss from train_step?

rohitgr7 · 2022-03-02T19:41:40Z

rohitgr7
Mar 2, 2022

interesting case!

theoretically, the gradients are zero for the parameters, let's assume that on some specific batch and on a specific device the loss is 0, so ideally the gradients shouldn't be calculated and gradient sync should still happen.

can you try returning None instead of 0?

2 replies

kazimpal87 Mar 3, 2022
Author

Returning None also causes it to hang. The issue is similar to #5243. I'm just wondering if there is any way around it.

rohitgr7 Mar 3, 2022

okay.. I see

looks like it's also not working with native pytorch: pytorch/pytorch#23425

we need to come up with a solution since this is critical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP training hangs when one GPU returns zero loss #12189

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DDP training hangs when one GPU returns zero loss #12189

Uh oh!

kazimpal87 Mar 2, 2022

Replies: 1 comment · 2 replies

Uh oh!

rohitgr7 Mar 2, 2022

Uh oh!

kazimpal87 Mar 3, 2022 Author

Uh oh!

rohitgr7 Mar 3, 2022

kazimpal87
Mar 2, 2022

Replies: 1 comment 2 replies

rohitgr7
Mar 2, 2022

kazimpal87 Mar 3, 2022
Author