trainer.fit(strategy='ddp') executes code repeatedly #11938

earendil25 · 2022-02-16T06:45:54Z

earendil25
Feb 16, 2022

Hi everyone.

I am trying to use 4 gpus in a single node to train my model with DDP strategy. But everytime I run trainer.fit, the whole bunch of codes are executed 4 times repeatedly, and it requires 4 times of CPU memory compared to a single GPU case.

I am not sure whether it is intended behavior or not. I ran the following sample code. It trains MNIST data on 4 gpus.

import warnings
warnings.filterwarnings("ignore")

import os
import torch
from pytorch_lightning import LightningModule, Trainer
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader#, random_split
from torchvision import transforms
from torchvision.datasets import MNIST

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")

class MNISTModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)


if __name__ == '__main__':
    print('Hello world!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!')
    mnist_model = MNISTModel()

    train_ds = MNIST(PATH_DATASETS, train=True, download=True, transform=transforms.ToTensor())
    train_loader = DataLoader(train_ds, batch_size=256)

    trainer = Trainer(gpus=4, strategy='ddp', max_epochs=1, replace_sampler_ddp=True, num_nodes=1)
    trainer.fit(mnist_model, train_loader)

And I got the following output:

Hello world!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Hello world!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Hello world!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Hello world!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

The training is done well, but the thing is that 'Hello world!' is printed four times. My problem here is that train data is loaded four times also and it takes four times of CPU memory. I am not sure whether it is the intended behavior or am I doing something wrong?

How do you deal with DDP if train data is too large to be copied by multiple (=gpu num) times?

Answered by rohitgr7

Feb 17, 2022

hey @earendil25!

this is how DDP works exactly. To populate data across devices, DistributedSampler is added to avoid data duplication on each device and the model is wrapped around DistributedDataParallel to sync gradients. The command is launched on each device individually. Alternatively, you can also try DDP_Spawn, which creates spawn processes and won't execute the whole script on each device.

View full answer

rohitgr7 · 2022-02-17T15:36:26Z

rohitgr7
Feb 17, 2022

hey @earendil25!

this is how DDP works exactly. To populate data across devices, DistributedSampler is added to avoid data duplication on each device and the model is wrapped around DistributedDataParallel to sync gradients. The command is launched on each device individually. Alternatively, you can also try DDP_Spawn, which creates spawn processes and won't execute the whole script on each device.

1 reply

earendil25 Feb 19, 2022
Author

Oh I see. Thanks very much!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

trainer.fit(strategy='ddp') executes code repeatedly #11938

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

trainer.fit(strategy='ddp') executes code repeatedly #11938

Uh oh!

earendil25 Feb 16, 2022

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

rohitgr7 Feb 17, 2022

Uh oh!

earendil25 Feb 19, 2022 Author

earendil25
Feb 16, 2022

Replies: 1 comment 1 reply

rohitgr7
Feb 17, 2022

earendil25 Feb 19, 2022
Author