Wrong module being called when entering .training_step() with 'dp' accelerator #8221
Unanswered
roman-vygon
asked this question in
DDP / multi-GPU / multi-node
Replies: 2 comments 1 reply
-
UPDATE 2 |
Beta Was this translation helpful? Give feedback.
1 reply
-
Are you using an older PL version here? In DP, the Anyway, mind trying to update PL to the latest version please? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I am trying to train a model with a 'dp' accelerator, using 2 gpus on the same machine.
Here are my models _step functions:
The model class is inherited from LightningModule.
During testing (trainer.test) and validation sanity checks, everything is fine, I receive:
I AM IN ENCODER NOW cuda:0 I AM IN ENCODER NOW cuda:1
as expected, but when I call trainer.fit()
Both workers output
I AM IN ENCODER NOW cuda:0 I AM IN ENCODER NOW cuda:0
I've been tracing the devices of the two modules and managed to go down to this line, which should run the training_step function. If I output the modules device before this line, everything is fine, the devices are cuda:0 and cuda:1, it seems that the device changes somehow when entering the training_step function. How could that be?
I'm sorry that I didn't provide a minimal example, I couldn't reproduce this with simple models.
"""UPDATE"""
I thought that it may be not the device that is wrong but the module itself, so I printed the id of the module before and after going inside training_step, this is what I get:
The module changes two a new one when entering training_step()
When testing, everything is ok again:
Beta Was this translation helpful? Give feedback.
All reactions