trainer.fit(strategy='ddp') executes code repeatedly #11938
-
Hi everyone. I am trying to use 4 gpus in a single node to train my model with DDP strategy. But everytime I run trainer.fit, the whole bunch of codes are executed 4 times repeatedly, and it requires 4 times of CPU memory compared to a single GPU case. I am not sure whether it is intended behavior or not. I ran the following sample code. It trains MNIST data on 4 gpus.
And I got the following output:
The training is done well, but the thing is that 'Hello world!' is printed four times. My problem here is that train data is loaded four times also and it takes four times of CPU memory. I am not sure whether it is the intended behavior or am I doing something wrong? How do you deal with DDP if train data is too large to be copied by multiple (=gpu num) times? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
hey @earendil25! this is how DDP works exactly. To populate data across devices, DistributedSampler is added to avoid data duplication on each device and the model is wrapped around DistributedDataParallel to sync gradients. The command is launched on each device individually. Alternatively, you can also try DDP_Spawn, which creates spawn processes and won't execute the whole script on each device. |
Beta Was this translation helpful? Give feedback.
hey @earendil25!
this is how DDP works exactly. To populate data across devices, DistributedSampler is added to avoid data duplication on each device and the model is wrapped around DistributedDataParallel to sync gradients. The command is launched on each device individually. Alternatively, you can also try DDP_Spawn, which creates spawn processes and won't execute the whole script on each device.