Not sure how to make DP and DPP on single-node, 2-GPU setup work #268
Replies: 3 comments
-
I think the documentation and examples cover this case. Did those not work for you? The speedups with dp vary according to what you're doing. In DDP you (*mostly) double the speed every time you double the number of GPUs. |
Beta Was this translation helpful? Give feedback.
-
Will reopen if you are still having issues |
Beta Was this translation helpful? Give feedback.
-
Yes, this is most likely what you want. Else you run the same batch on both accelerators. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When I use DDP, it hangs the process, and no metric log files are created for me to view with tensorboard. I just see two tf event files.
On the other hand, when I use DP, the code runs and I can view the loss values going down in Tensorboard, but I don't see any accelerated training. Running with 1 GPU and running with DP on 2 GPUs gives me the same training time: 12 min. I've tried with different batch sizes, and there is virtually no difference in training time.
Do I have to create a DistributedSampler or do something else to see accelerated training using DP?
Code
My code is just as follows:
model = ConvNet()
# most basic trainer, uses good defaults
exp = Experiment(save_dir=os.getcwd())
trainer = Trainer(experiment=exp, gpus=[0, 1], max_nb_epochs=20, distributed_backend='dp')
trainer.fit(model)
What's your environment?
Beta Was this translation helpful? Give feedback.
All reactions