The loss value in the tensorboard is different from the actual value in the DDP environment. #10910
Unanswered
devjwsong
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm currently trying to pre-train a model using 4 GPUs with the DDP accelerator. I saved the loss and perplexity from each training step and calculated the average & last value after one epoch finished. But there is huge difference between the saved loss value and the value recorded in the tensorboard.
Here is what I did in the Pytorch Lightning module class.
As you can see, I parsed all values from the
training_step_outputs
and made a list which contains loss values from each step.And I recorded a log of
last_loss_per_epoch
accessing the value by the last index in the list.However, the loss recorded is totally different.
To make it clear, I parsed


train_losses
separately and compared it to the values in the tensorboard.The last value in the list is 1.1184, but the loss recorded is 6.774.
This also happens for the perplexity, and the difference is much larger.
Interestingly, this happens when using multi-GPUs.
When I tested with a single GPU, all values are synchronized properly.
Can you tell me what I am missing? I really don't know what the problem is.
The environment I am using is as follows.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions