Get batch’s datapoints across all GPUs #11667
-
Hello, I´m running my model in a cluster with multiples GPUs (2). My problem is that I would like to access all the datapoints in the batch (predictions and labels). Because I´m using more than 2 GPUs, my batch in divided between those two devices for parallelisation purposes, which means than when I access the data in the batch in eval/training, I´m getting just half the batch. How could I obtain the complete batch and the predictions of the model that are divided among different devices/GPUs? @rohitgr7 suggested using self.all_gather, but after trying it on my LightningModule’s forward method, I get just half the batch, that is, just the data stored in one of the two GPUs being used. Thanks! PD: may it be possible to access this info through "validation_epoch_end", "test_epoch_end", etc? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
if you are going to use the complete batch on the single GPU, then why use DDP? if you need predictions on a single device, you can rather gather all the predictions using |
Beta Was this translation helpful? Give feedback.
if you are going to use the complete batch on the single GPU, then why use DDP?
if you need predictions on a single device, you can rather gather all the predictions using
all_gather
.