The loss value in the tensorboard is different from the actual value in the DDP environment. #10910

devjwsong · 2021-12-03T02:31:45Z

devjwsong
Dec 3, 2021

Hi, I'm currently trying to pre-train a model using 4 GPUs with the DDP accelerator. I saved the loss and perplexity from each training step and calculated the average & last value after one epoch finished. But there is huge difference between the saved loss value and the value recorded in the tensorboard.

Here is what I did in the Pytorch Lightning module class.

def training_step(self, batch, batch_idx):
    src_ids, trg_ids = batch  # (B, S_L), (B, T_L)
    pad_masks, global_masks = self.make_masks(src_ids)
    
    outputs = self.model(
        input_ids=src_ids, 
        attention_mask=pad_masks, 
        global_attention_mask=global_masks,
        labels=trg_ids, 
        output_hidden_states=True
    )
    
    loss = outputs.loss  # ()
    ppl = torch.exp(loss)  # ()
    
    self.log('loss_per_step', loss, on_step=True, on_epoch=False, prog_bar=True, logger=True)
    self.log('ppl_per_step', ppl, on_step=True, on_epoch=False, prog_bar=True, logger=True)
    
    return {
        'loss': loss, 'ppl': ppl.detach()
    }

def training_epoch_end(self, training_step_outputs): 
    train_losses, train_ppls = [], []
    for result in training_step_outputs:
        if math.isinf(result['ppl'].item()) or math.isnan(result['ppl'].item()):
            train_ppls.append(1e+8)
        else:
            train_ppls.append(result['ppl'].item())
        train_losses.append(result['loss'].item())
    
    train_ppl = np.mean(train_ppls)
    train_loss = np.mean(train_losses)
    
    self.log('avg_loss_per_epoch', train_loss, on_step=False, on_epoch=True, prog_bar=True, logger=True)
    self.log('avg_ppl_per_epoch', train_ppl, on_step=False, on_epoch=True, prog_bar=True, logger=True)
    self.log('last_loss_per_epoch', train_losses[-1], on_step=False, on_epoch=True, prog_bar=True, logger=True)
    self.log('last_ppl_per_epoch', train_ppls[-1], on_step=False, on_epoch=True, prog_bar=True, logger=True)

As you can see, I parsed all values from the training_step_outputs and made a list which contains loss values from each step.
And I recorded a log of last_loss_per_epoch accessing the value by the last index in the list.
However, the loss recorded is totally different.

To make it clear, I parsed train_losses separately and compared it to the values in the tensorboard.

The last value in the list is 1.1184, but the loss recorded is 6.774.
This also happens for the perplexity, and the difference is much larger.

Interestingly, this happens when using multi-GPUs.
When I tested with a single GPU, all values are synchronized properly.

Can you tell me what I am missing? I really don't know what the problem is.

The environment I am using is as follows.

OS: Ubuntu 18.04.5 LTS
RAM: 472G
GPUs: NVIDIA A100-SXmM-80GB * 4
Python: 3.8.8
torch: 1.8.1
pytorch-lightning: 1.4.9

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The loss value in the tensorboard is different from the actual value in the DDP environment. #10910

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

The loss value in the tensorboard is different from the actual value in the DDP environment. #10910

Uh oh!

devjwsong Dec 3, 2021

Replies: 0 comments

devjwsong
Dec 3, 2021