replace_sampler_ddp and NCCL timeout #12283

dahjungc · 2022-03-10T02:12:23Z

dahjungc
Mar 10, 2022

Hello,
My dataset is very large and I noticed that pytorch lightning loads the dataset for each GPUs separately.
I verified that my code works without memory issue if we have enough memory to load entire dataset multiple times on different machine.
For specific machine I need to use, loading 8 times of dataset for 8 GPUS gives memory issue.

So I divided the dataset into multiple json files (96 shard files in case I will use more GPUs later) and feed parts of the json files to each GPUs (non-overlapping). Followed this : Sharded data loading with DDP #8795
I had to concat the datasets because I need to load multiple json files for each GPUs. After loading them with my Dataset class, I concat them like this to create the final single train_dataset.
```
self.train_dataset = data.ConcatDataset(list_train_datasets)
```
This train dataset will be chunks of entire train dataset depending on ranks.

Dataloader without sampler
Defined the dataloader without sampler like below:

DataLoader(
           self.train_dataset,
           num_workers=self.num_workers,
           batch_size=self.batch_size,
           shuffle=True,
           pin_memory=True,
           collate_fn=collate_fn
       )

After this implementation, I verified that if I use one json file for each GPUs ( sampled small dataset), training runs successfully.
However, if I use entire dataset (12 json file per GPUs), at the end of first epoch, I got error below:

I think it is timed out at the ALLREDUCE and BROADCAST at the Trainer log function with sync_dist=True. (I do not call those functions myself in the code)
However, I am not sure why this happens only when data is sharded and used large data if this is a timeout issue. If I load full dataset on larger machine, this does not occur.
Has anyone experienced the same thing?

[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803236 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803223 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803350 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803201 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803350 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803354 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803350 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:422] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802976 milliseconds before timing out.
Aborted
Aborted
Aborted
Aborted
Aborted
Aborted
Aborted
Aborted
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11644) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

replace_sampler_ddp and NCCL timeout #12283

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

replace_sampler_ddp and NCCL timeout #12283

Uh oh!

Uh oh!

dahjungc Mar 10, 2022

Replies: 0 comments

dahjungc
Mar 10, 2022