DDP deadlock detected from rank 1 and CUDA error: operation not supported on A10 #14322
Unanswered
fugokidi
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When using 2 GPUS with ddp strategy, CUDA error: operation not supported occurred prior to
pytorch_lightning.utilities.expections.DeadlockDetectedException: Deadlock detected from rank:1
This only happens in A10 VM. I have to use
NCCL_P2P_LEVEL=PXB
orP2P
, otherwise NCCL error will be triggered.PyTorch version: 1.12.1
CUDA runtime: 11.6
PyTorch Lightning: 1.7.2
When using only 1 GPU, the training with lightning framework works perfectly.
I have searched other similar errors, but they all lead to old lightning versions, but I'm using the recent version.
I would really appreciate if somebody can share pytorch lightning setup for A10 VM.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions