init.cc:1256 NCCL WARN Your program may be hanging, after validation. how to fix it #8213
Unanswered
Devoe-97
asked this question in
DDP / multi-GPU / multi-node
Replies: 3 comments
-
Hi, I often faced similar NCCL hanging issues, if there wasn't enough data for all dataloader workers for all gpus for both training and validation data. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Could you pls provide a sample code to reproduce, also some more env details? 🐰 |
Beta Was this translation helpful? Give feedback.
0 replies
-
Have you followed the instructions on the screen there? It says adding |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
when i run my code in one gpu, it seems ok.

when i run my code in multi gpus, it's training is ok but the nccl error occur after validation and saving model
how should i fix it?
Beta Was this translation helpful? Give feedback.
All reactions