Unstable training duration #12371

murinmat · 2022-03-18T12:55:03Z

murinmat
Mar 18, 2022

Hi, I am trying to train a fairly complicated GAN.

What I am experiencing is very unstable training time. I have to restart the training a few times to get the iterations/second value that our rig is capable of. Most of the time, I get ~30sec/iteration, but after I restart the training a few times, I get a desired ~1.3sec/iteration.

Do you have any tips on where can the issue be and what to investigate? I am training using automatic_optimization = False flag, with DPPPlugin strategy.

Version:

pytoch_lightning: 1.5.7
pytorch: 1.10.1

Hardware:

gpu: Nvidia GeForce RTX 3090, 24gb
cpu: AMD Ryzen 9 5900X 12-Core Processor

akihironitta · 2022-03-18T18:56:36Z

akihironitta
Mar 18, 2022

Hi @gorcurek!

Do you have a reproducible script? I would try using one of the supported profilers to identify which part of the code can vary that significantly in time.

Trainer(profiler="simple")
Trainer(profiler="advanced")
Trainer(profiler="pytorch")

stable: https://pytorch-lightning.readthedocs.io/en/1.5.10/advanced/profiler.html
latest: https://pytorch-lightning.readthedocs.io/en/latest/advanced/profiler.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unstable training duration #12371

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unstable training duration #12371

Uh oh!

murinmat Mar 18, 2022

Replies: 1 comment

Uh oh!

akihironitta Mar 18, 2022

murinmat
Mar 18, 2022

akihironitta
Mar 18, 2022