Skip to content

Multi-GPU training fails #931

@abhiagwl4262

Description

@abhiagwl4262

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug

NCCL timeout in multi-GPU training

I trained for 2 epochs on single-GPU machine and then wanted to run the training on multi-GPU machine for faster training but I am seeing NCCL timeout after being stuck for a long time at the end of eval

Epoch 2: 100%|███████████████████████████████████████████████████████| 2289/2289 [19:11<00:00, 1.99it/s, train/lr=0.0001, train/lr_min=4.04e-6, train/lr_max=0.0001^
[ idation DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 303/303 [02:01<00:00, 2.49it/s]

^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^
[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[
[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[
A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A
^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[ [rank1]:[E407 10:02:26.551756518 ProcessGroupNCCL.cpp:683] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=492
09, OpType=ALLGATHER, NumelIn=43, NumelOut=172, Timeout(ms)=1800000) ran for 1800056 milliseconds before timing out.
[rank1]:[E407 10:02:26.551997332 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 49209 PG status
: last enqueued work: 49209, last completed work: 49208
[rank1]:[E407 10:02:26.552039573 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E407 10:02:26.552111664 ProcessGroupNCCL.cpp:2573] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank3]:[E407 10:02:26.564157461 ProcessGroupNCCL.cpp:683] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49210, OpType=ALLREDUCE, NumelIn=1,
NumelOut=1, Timeout(ms)=1800000) ran for 1800069 milliseconds before timing out.
[rank3]:[E407 10:02:26.564359555 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 49210 PG status
: last enqueued work: 49210, last completed work: 49209
[rank3]:[E407 10:02:26.564374605 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero va

Environment

  • rfdetr=latest
  • num-gpus = 4
  • Driver Version: 580.95.05
  • CUDA Version: 13.0
  • 2.9.1+cu128

Minimal Reproducible Example

from rfdetr import RFDETRBase

model = RFDETRBase()

model.train(
    dataset_dir="IE_Z22_043/images",
    epochs=200,
    batch_size=8,
    grad_accum_steps=4,
    task="segment",
    resolution=624,
    output_dir="rfdetr_ie_z22_043_finetuning",
    progress_bar=True,
    # fp16_eval=True,
    early_stopping=True,
    devices="auto",  # required — see note below
    strategy='ddp_find_unused_parameters_true',
    persistent_workers=True,
    device='cuda',
    checkpoint_interval=1,
    num_workers=4,
    warmup_epochs=5,
    pin_memory = True,
    resume="rfdetr_ie_z22_043_finetuning/checkpoint_1.ckpt"
)

I ran with torchrun --nproc_per_node=4 train_rfdetr.py

Additional

No response

Are you willing to submit a PR?

  • Yes, I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions