replace_sampler_ddp and NCCL timeout #12283
Unanswered
dahjungc
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
My dataset is very large and I noticed that pytorch lightning loads the dataset for each GPUs separately.
I verified that my code works without memory issue if we have enough memory to load entire dataset multiple times on different machine.
For specific machine I need to use, loading 8 times of dataset for 8 GPUS gives memory issue.
So I divided the dataset into multiple json files (96 shard files in case I will use more GPUs later) and feed parts of the json files to each GPUs (non-overlapping). Followed this : Sharded data loading with DDP #8795
I had to concat the datasets because I need to load multiple json files for each GPUs. After loading them with my Dataset class, I concat them like this to create the final single train_dataset.
This train dataset will be chunks of entire train dataset depending on ranks.
Dataloader without sampler
Defined the dataloader without sampler like below:
After this implementation, I verified that if I use one json file for each GPUs ( sampled small dataset), training runs successfully.
However, if I use entire dataset (12 json file per GPUs), at the end of first epoch, I got error below:
I think it is timed out at the ALLREDUCE and BROADCAST at the Trainer log function with sync_dist=True. (I do not call those functions myself in the code)
However, I am not sure why this happens only when data is sharded and used large data if this is a timeout issue. If I load full dataset on larger machine, this does not occur.
Has anyone experienced the same thing?
Beta Was this translation helpful? Give feedback.
All reactions