Better undestanding how data is loded in datamodule setup method for multi GPU setting in NLP #7186

ksopyla · 2021-04-23T09:20:18Z

ksopyla
Apr 23, 2021

I want to better understand the setup and prepare_data methods in multi gpu scenariu in context of NLP and text processing.

I have prepared the DataModule which process json line file with pairs of sentence for translation task. The file contains 10M lines.
In prepare_data() I open the data file, read it to memory, do some basic filtering (remove to long sentences and do some sorting based on length in order to group similar length sentences together) then I write it to another file (filtered_data.json).
Next in the setup() method I read filtered_data.json and split it to train and valid.
I can perform split deterministically so train and valid splits will always have the same elements or I can split randomly then each GPU process will have a different train and valid sets.

When using it in multi-gpu (2 GPUs) each process will have its own copy of the train and valid set (am I right?). Which approach is better in context data utilization, random or deterministically?
I do not fully understand how distributed DataLoader handles these two approaches? Could someone explain it in detail?

If data are loaded deterministically then all GPU processes, especially forward and backward pass will return the same values (for gpu 1 and 2), it is efficient? How gradients are merged and how network weight updates will be performed.
Or maybe the second (random split) approach is better because gradients computed on different samples and merged from 2 gpus will result in a better estimation of the true gradient.

Answered by awaelchli

Apr 24, 2021

I have prepared the DataModule which process json line file with pairs of sentence for translation task. The file contains 10M lines.
In prepare_data I open the data file, read it to memory, do some basic filtering (remove to long sentences and do some sorting based on length in order to group similar length sentences together) then I write it to another file (filtered_data.json).

Do all of that either offline in a different script, or do it in the prepare_data hook.

Next in the setup method I read filtered_data.json and split it to train and valid.

Sounds good. Each GPU/node will run the same, so you will have the same train and val split in all of them (initially). Don't split the d…

View full answer

awaelchli · 2021-04-24T22:16:44Z

awaelchli
Apr 24, 2021

I have prepared the DataModule which process json line file with pairs of sentence for translation task. The file contains 10M lines.
In prepare_data I open the data file, read it to memory, do some basic filtering (remove to long sentences and do some sorting based on length in order to group similar length sentences together) then I write it to another file (filtered_data.json).

Do all of that either offline in a different script, or do it in the prepare_data hook.

Next in the setup method I read filtered_data.json and split it to train and valid.

Sounds good. Each GPU/node will run the same, so you will have the same train and val split in all of them (initially). Don't split the data differently for each GPU, that part will be done by the DistributedSampler [1].

I do not fully understand how distributed DataLoader handles these two approaches? Could someone explain it in detail?

Lightning takes your DataLoader and adds a DistributedSampler. The DS knows on which GPU it is and will sample only a portion of your data on one GPU and another portion on the other GPU. Each GPU sees a different split of train and a different split of val.

If data are loaded deterministically then all GPU processes, especially forward and backward pass will return the same values (for gpu 1 and 2), it is efficient? How gradients are merged and how network weight updates will be performed.

As explained above, the dataloader on each GPU will return different samples on each GPU automatically. Each GPU will have the same network weights, uses different data to compute gradients, then gradients are averaged so each GPU gets the same update and starts with the same weights for the next forward/backward [2].

Or maybe the second (random split) approach is better because gradients computed on different samples and merged from 2 gpus will result in a better estimation of the true gradient.

Yes, again this is automatically done for you.

References:
[1] DistributeSampler
[2] Distributed Training in PyTorch
[3] Multi-GPU training in PyTorch

7 replies

Sere-Fu Nov 13, 2021

class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str = "path/to/dir", batch_size: int = 32):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size

    def setup(self, stage: Optional[str] = None):
        self.mnist_test = MNIST(self.data_dir, train=False)
        mnist_full = MNIST(self.data_dir, train=True)
        self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=self.batch_size)

    def teardown(self, stage: Optional[str] = None):
        # Used to clean-up when the run is finished

Here is an official tutorial about the usage of datamodule. https://pytorch-lightning.readthedocs.io/en/latest/extensions/datamodules.html
In the case of ddp with gpus = 2, since every process will call the setup(), thus calling random_split() individually. Looks like the result training set in two processes will be different(containing different set of samples in the MNIST set)? Am I right about this?

In your(and official) desc, the data feed process is like:
- We have a "same" global dataset in every process.
- A distributedsampler under the hood will make the dataloaders in differenet processes draw samples from different (no overlap) part of the "same" dataset. In this case(gpus=2) effectively subdataset0 in process0 and subdataset1 in process1.
  So, how could this work if the setup()s result different datasets in two processes?
Val set samples in process 0 might be in train set in process 1, which makes the val set effectively be trained.
Quite confusing about this part, will help a lot if clarified.

awaelchli Nov 15, 2021

So, how could this work if the setup()s result different datasets in two processes?

Then no distributed sampler is needed (Trainer(replace_sampler_ddp=False) and you have to make sure that both subsets have the same length.

Val set samples in process 0 might be in train set in process 1, which makes the val set effectively be trained.
Quite confusing about this part, will help a lot if clarified.

That shouldn't be the case if the splitting happens beforehand in each process the same way. If the splitting is randomized, a seed should be set, i.e., pl.seed_everything(1).

Sere-Fu Nov 15, 2021

That shouldn't be the case if the splitting happens beforehand in each process the same way. If the splitting is randomized, a seed should be set, i.e., pl.seed_everything(1).

Do you mean to make the tutorial example above work as expected(identically split datasets in both processes). We need to do a pl.seed_everything(1) to make the to random_split() get same results?

awaelchli Nov 16, 2021

Yes!

Sere-Fu Nov 17, 2021

make sense to me, thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better undestanding how data is loded in datamodule setup method for multi GPU setting in NLP #7186

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Better undestanding how data is loded in datamodule setup method for multi GPU setting in NLP #7186

Uh oh!

Uh oh!

ksopyla Apr 23, 2021

Replies: 1 comment · 7 replies

Uh oh!

Uh oh!

awaelchli Apr 24, 2021

Uh oh!

Sere-Fu Nov 13, 2021

Uh oh!

awaelchli Nov 15, 2021

Uh oh!

Sere-Fu Nov 15, 2021

Uh oh!

awaelchli Nov 16, 2021

Uh oh!

Sere-Fu Nov 17, 2021

ksopyla
Apr 23, 2021

Replies: 1 comment 7 replies

awaelchli
Apr 24, 2021