Multi-GPU training fails

### Search before asking

- [x] I have searched the RF-DETR issues and found no similar bug report.


### Bug

NCCL timeout in multi-GPU training 

I trained for 2 epochs on single-GPU machine and then wanted to run the training on multi-GPU machine for faster training but I am seeing NCCL timeout after being stuck for a long time at the end of eval

Epoch 2: 100%|███████████████████████████████████████████████████████| 2289/2289 [19:11<00:00,  1.99it/s, train/lr=0.0001, train/lr_min=4.04e-6, train/lr_max=0.0001^
[  idation DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 303/303 [02:01<00:00,  2.49it/s]
                                                                                                                                                                     
                                                                                                                                                                     
                                                                                                                                                                     
^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^
[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[
[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[
A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A
^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[  [rank1]:[E407 10:02:26.551756518 ProcessGroupNCCL.cpp:683] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=492
09, OpType=ALLGATHER, NumelIn=43, NumelOut=172, Timeout(ms)=1800000) ran for 1800056 milliseconds before timing out.                                                 
[rank1]:[E407 10:02:26.551997332 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 49209 PG status
: last enqueued work: 49209, last completed work: 49208                                                                                                              
[rank1]:[E407 10:02:26.552039573 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.                                                                                                 
[rank1]:[E407 10:02:26.552111664 ProcessGroupNCCL.cpp:2573] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.                          
[rank3]:[E407 10:02:26.564157461 ProcessGroupNCCL.cpp:683] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49210, OpType=ALLREDUCE, NumelIn=1,
 NumelOut=1, Timeout(ms)=1800000) ran for 1800069 milliseconds before timing out.                                                                                    
[rank3]:[E407 10:02:26.564359555 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 49210 PG status
: last enqueued work: 49210, last completed work: 49209                                                                                                              
[rank3]:[E407 10:02:26.564374605 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero va

### Environment

- rfdetr=latest 
- num-gpus = 4
- Driver Version: 580.95.05
- CUDA Version: 13.0
- 2.9.1+cu128

### Minimal Reproducible Example

```
from rfdetr import RFDETRBase

model = RFDETRBase()

model.train(
    dataset_dir="IE_Z22_043/images",
    epochs=200,
    batch_size=8,
    grad_accum_steps=4,
    task="segment",
    resolution=624,
    output_dir="rfdetr_ie_z22_043_finetuning",
    progress_bar=True,
    # fp16_eval=True,
    early_stopping=True,
    devices="auto",  # required — see note below
    strategy='ddp_find_unused_parameters_true',
    persistent_workers=True,
    device='cuda',
    checkpoint_interval=1,
    num_workers=4,
    warmup_epochs=5,
    pin_memory = True,
    resume="rfdetr_ie_z22_043_finetuning/checkpoint_1.ckpt"
)
```

I ran with `torchrun --nproc_per_node=4 train_rfdetr.py `

### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes, I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training fails #931

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU training fails #931

Description

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions