-
Notifications
You must be signed in to change notification settings - Fork 757
Multi-GPU training fails #931
Description
Search before asking
- I have searched the RF-DETR issues and found no similar bug report.
Bug
NCCL timeout in multi-GPU training
I trained for 2 epochs on single-GPU machine and then wanted to run the training on multi-GPU machine for faster training but I am seeing NCCL timeout after being stuck for a long time at the end of eval
Epoch 2: 100%|███████████████████████████████████████████████████████| 2289/2289 [19:11<00:00, 1.99it/s, train/lr=0.0001, train/lr_min=4.04e-6, train/lr_max=0.0001^
[ idation DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 303/303 [02:01<00:00, 2.49it/s]
^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^
[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[
[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[
A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A
^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[ [rank1]:[E407 10:02:26.551756518 ProcessGroupNCCL.cpp:683] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=492
09, OpType=ALLGATHER, NumelIn=43, NumelOut=172, Timeout(ms)=1800000) ran for 1800056 milliseconds before timing out.
[rank1]:[E407 10:02:26.551997332 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 49209 PG status
: last enqueued work: 49209, last completed work: 49208
[rank1]:[E407 10:02:26.552039573 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E407 10:02:26.552111664 ProcessGroupNCCL.cpp:2573] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank3]:[E407 10:02:26.564157461 ProcessGroupNCCL.cpp:683] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49210, OpType=ALLREDUCE, NumelIn=1,
NumelOut=1, Timeout(ms)=1800000) ran for 1800069 milliseconds before timing out.
[rank3]:[E407 10:02:26.564359555 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 49210 PG status
: last enqueued work: 49210, last completed work: 49209
[rank3]:[E407 10:02:26.564374605 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero va
Environment
- rfdetr=latest
- num-gpus = 4
- Driver Version: 580.95.05
- CUDA Version: 13.0
- 2.9.1+cu128
Minimal Reproducible Example
from rfdetr import RFDETRBase
model = RFDETRBase()
model.train(
dataset_dir="IE_Z22_043/images",
epochs=200,
batch_size=8,
grad_accum_steps=4,
task="segment",
resolution=624,
output_dir="rfdetr_ie_z22_043_finetuning",
progress_bar=True,
# fp16_eval=True,
early_stopping=True,
devices="auto", # required — see note below
strategy='ddp_find_unused_parameters_true',
persistent_workers=True,
device='cuda',
checkpoint_interval=1,
num_workers=4,
warmup_epochs=5,
pin_memory = True,
resume="rfdetr_ie_z22_043_finetuning/checkpoint_1.ckpt"
)
I ran with torchrun --nproc_per_node=4 train_rfdetr.py
Additional
No response
Are you willing to submit a PR?
- Yes, I'd like to help by submitting a PR!