DataLoader producing NaNs wit DDP #7570
Unanswered
jopo666
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
Yes, this is common enough that we have a flag You can try and use it but anything could be causing it. You should try debugging step by step |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a weird and persistent problem where everything works fine when training with one GPU but when I move onto multi-GPU training some individual pixels in my input images become NaNs and this ofc crashes the training. It happens randomly and there is nothing wrong with the images as I check for NaNs in my
Dataset.__call__
function. Then in thetraining_step
the NaNs magically appear, with random inputs at random pixels. So there may be problems inside thecollate_fn
?I'm not sure if this is a
pytorch-lightning
,pytorch
or a hardware problem (that's why I didn't create an Issue)? Has anyone encountered anything similar?Beta Was this translation helpful? Give feedback.
All reactions