Mitigating program hang from on_train_epoch_end() with self.all_gather() call #20294
Unanswered
isaacgerg
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to manually save my loss, which is a single scalar, each epoch and then print it out in on_train_epoch_end().
I am doing self.train_loss.append(loss.item()) during my training step. Next, in on_train_epoch_end(), I immediately do a self.all_gather(self.train_loss) but it hangs until NCCL times out. I am on a single node with 2 GPUs.
What really stumps me is that this paradigm works fine for on_test_epoch_end() and test_step().
Any thoughts on how to debug or fix? What is the "right" way this code should look when operating correctly.
Reference: Using pytorch 2.4 and lightning 2.4.0.
Beta Was this translation helpful? Give feedback.
All reactions