Wrong module being called when entering .training_step() with 'dp' accelerator #8221

roman-vygon · 2021-06-30T09:09:12Z

roman-vygon
Jun 30, 2021

Hi!
I am trying to train a model with a 'dp' accelerator, using 2 gpus on the same machine.
Here are my models _step functions:

 def training_step(self, batch, batch_nb):        
        print(' I AM IN ENCODER NOW', self.device, id(self))
        audio_signal, audio_signal_len, labels, labels_len = batch
        logits = self.forward(input_signal=audio_signal, input_signal_length=audio_signal_len) 
        return {
            'preds': logits,
            'labels':labels,
        }
    
    def training_step_end(self, batch_parts):    
        
        embeds = batch_parts['preds']
        labels = batch_parts['labels']

        loss_value = self.loss(logits=embeds, labels=labels)
        return loss_value

    def validation_step(self, batch, batch_idx, dataloader_idx=0):
        audio_signal, audio_signal_len, labels, labels_len = batch
        logits = self.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
        loss_value = self.loss(logits=logits, labels=labels)
        return {
            'val_loss': loss_value
        }

    def test_step(self, batch, batch_idx, dataloader_idx=0):       
        print(' I AM IN ENCODER NOW', self.device, id(self))
        audio_signal, audio_signal_len, labels, labels_len = batch        
        logits = self.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
        loss_value = self.loss(logits=logits, labels=labels)

        return {
            'test_loss': loss_value
        }

The model class is inherited from LightningModule.
During testing (trainer.test) and validation sanity checks, everything is fine, I receive:
I AM IN ENCODER NOW cuda:0 I AM IN ENCODER NOW cuda:1
as expected, but when I call trainer.fit()
Both workers output
I AM IN ENCODER NOW cuda:0 I AM IN ENCODER NOW cuda:0
I've been tracing the devices of the two modules and managed to go down to this line, which should run the training_step function. If I output the modules device before this line, everything is fine, the devices are cuda:0 and cuda:1, it seems that the device changes somehow when entering the training_step function. How could that be?

I'm sorry that I didn't provide a minimal example, I couldn't reproduce this with simple models.

"""UPDATE"""
I thought that it may be not the device that is wrong but the module itself, so I printed the id of the module before and after going inside training_step, this is what I get:

139778359854896 cuda:0  THIS IS THE LAST STEP  
 I AM IN ENCODER NOW cuda:0 139779939779296  
139778359854560 cuda:1  THIS IS THE LAST STEP  
 I AM IN ENCODER NOW cuda:0 139779939779296

The module changes two a new one when entering training_step()

When testing, everything is ok again:

cuda:0 140172337860080 THIS IS THE LAST STEP
 I AM IN ENCODER NOW cuda:0 140172337860080
cuda:1 140172337857200 THIS IS THE LAST STEP
 I AM IN ENCODER NOW cuda:1 140172337857200

roman-vygon · 2021-06-30T10:25:01Z

roman-vygon
Jun 30, 2021
Author

UPDATE 2
might be related to pytorch/pytorch#8637

1 reply

roman-vygon Jun 30, 2021
Author

Found the mistake NVIDIA/NeMo#2425

awaelchli · 2021-07-05T16:08:31Z

awaelchli
Jul 5, 2021

Are you using an older PL version here? In DP, the self.device attributed didn't get updated (fixed in #6414) but if you inspect next(self.parameters()).device you should see the correct one.

Anyway, mind trying to update PL to the latest version please?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong module being called when entering .training_step() with 'dp' accelerator #8221

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Wrong module being called when entering .training_step() with 'dp' accelerator #8221

Uh oh!

Uh oh!

roman-vygon Jun 30, 2021

Replies: 2 comments · 1 reply

Uh oh!

roman-vygon Jun 30, 2021 Author

Uh oh!

roman-vygon Jun 30, 2021 Author

Uh oh!

awaelchli Jul 5, 2021

roman-vygon
Jun 30, 2021

Replies: 2 comments 1 reply

roman-vygon
Jun 30, 2021
Author

roman-vygon Jun 30, 2021
Author

awaelchli
Jul 5, 2021