Get batch’s datapoints across all GPUs #11667

fmorenopino · 2022-01-31T10:06:46Z

fmorenopino
Jan 31, 2022

Hello,

I´m running my model in a cluster with multiples GPUs (2). My problem is that I would like to access all the datapoints in the batch (predictions and labels). Because I´m using more than 2 GPUs, my batch in divided between those two devices for parallelisation purposes, which means than when I access the data in the batch in eval/training, I´m getting just half the batch.

How could I obtain the complete batch and the predictions of the model that are divided among different devices/GPUs? @rohitgr7 suggested using self.all_gather, but after trying it on my LightningModule’s forward method, I get just half the batch, that is, just the data stored in one of the two GPUs being used.

Thanks!

PD: may it be possible to access this info through "validation_epoch_end", "test_epoch_end", etc?

Answered by rohitgr7

Jan 31, 2022

if you are going to use the complete batch on the single GPU, then why use DDP?

if you need predictions on a single device, you can rather gather all the predictions using all_gather.

View full answer

rohitgr7 · 2022-01-31T10:36:19Z

rohitgr7
Jan 31, 2022

if you are going to use the complete batch on the single GPU, then why use DDP?

if you need predictions on a single device, you can rather gather all the predictions using all_gather.

5 replies

fmorenopino Jan 31, 2022
Author

if you are going to use the complete batch on the single GPU, then why use DDP?

if you need predictions on a single device, you can rather gather all the predictions using all_gather.

With respect to the DDP, I was confused about its usage.
With respect to "all_gather", where should I call it? I tried it in the "forward" method and I did not obtain the data across all GPUs, but just one.

rohitgr7 Jan 31, 2022

what is your data here? batch or predictions?

fmorenopino Jan 31, 2022
Author

I would like to access both. In the forward method both batch and predictions are available, but after doing "preds = self.all_gather(preds)", I cannot access the predictions across all the GPUs, just one. The same thing happens with the batch data.

rohitgr7 Jan 31, 2022

for batch, I am not sure if it will work, since if I am collecting batch on the single device and using it during forward pass, then its not DDP anymore. But for predictions, it should work. Can you share a minimal reproducible script?

fmorenopino Feb 1, 2022
Author

I tried in a different server and the I could access data across all the devices via "all_gather". Will try to understand what's going on in the other server. Thanks for the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get batch’s datapoints across all GPUs #11667

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Get batch’s datapoints across all GPUs #11667

Uh oh!

Uh oh!

fmorenopino Jan 31, 2022

Replies: 1 comment · 5 replies

Uh oh!

rohitgr7 Jan 31, 2022

Uh oh!

fmorenopino Jan 31, 2022 Author

Uh oh!

rohitgr7 Jan 31, 2022

Uh oh!

fmorenopino Jan 31, 2022 Author

Uh oh!

rohitgr7 Jan 31, 2022

Uh oh!

fmorenopino Feb 1, 2022 Author

fmorenopino
Jan 31, 2022

Replies: 1 comment 5 replies

rohitgr7
Jan 31, 2022

fmorenopino Jan 31, 2022
Author

fmorenopino Jan 31, 2022
Author

fmorenopino Feb 1, 2022
Author