Unstable training duration #12371
Unanswered
murinmat
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
Hi @gorcurek! Do you have a reproducible script? I would try using one of the supported profilers to identify which part of the code can vary that significantly in time. Trainer(profiler="simple")
Trainer(profiler="advanced")
Trainer(profiler="pytorch") stable: https://pytorch-lightning.readthedocs.io/en/1.5.10/advanced/profiler.html |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am trying to train a fairly complicated GAN.
What I am experiencing is very unstable training time. I have to restart the training a few times to get the iterations/second value that our rig is capable of. Most of the time, I get ~30sec/iteration, but after I restart the training a few times, I get a desired ~1.3sec/iteration.
Do you have any tips on where can the issue be and what to investigate? I am training using automatic_optimization = False flag, with DPPPlugin strategy.
Version:
Hardware:
Beta Was this translation helpful? Give feedback.
All reactions