Training stuck at the begining #8321

MendelXu · 2021-07-07T10:11:26Z

MendelXu
Jul 7, 2021

When I use 2 GPUs, My training process is stuck at the beginning of the first epoch and even I am not able to kill it with ctrl+c. However, if I change it to 1 gpu or 4 gpu, it works fine. As there is no error information, how can I debug it and find the problem?

Borda · 2021-07-07T14:28:58Z

Borda
Jul 7, 2021
Maintainer

could you please provide some sample code to reproduce it?

6 replies

Borda Jul 8, 2021
Maintainer

you can try to condition this saving to be done only on rank 0

MendelXu Jul 8, 2021
Author

Sorry, I don't recognize what you mean. Shouldn't the saving be done only on rank 0 in the current code conditioned on trainer.is_global_zero?

MendelXu Jul 8, 2021
Author

I find the hang-on behavior is quite random. With this callback, sometimes the program works well but sometimes just hangs on. Is there any tool to help me find where it hangs on?

stonelazy Sep 27, 2022

Yes, very much in need of a tool that would help understand where it hangs. I have an issue where the training stops after 4 steps in a random epoch. It doesn't happen for every epoch, but totally random.

semaphore-egg Feb 20, 2023

Thanks for your reply. I have found the reason is that I create a callback that saves some files to the file system at the begging. Which is

class SaveConfigCallback(Callback):
    def __init__(self, parser):
        self.parser = parser

    def on_train_start(self, trainer: Trainer, pl_module: LightningModule) -> None:
        if trainer.is_global_zero:
            log_dir = trainer.log_dir or trainer.default_root_dir
            with open(os.path.join(log_dir,self.parser.name)) as f:
                f.write(self.parser.pretty())

If I remove the callback, it works well. However, as it exists, the training will be stuck at the first batch.

I experienced similar problems when adopting the CodeSnapShot Callback by @awaelchli . I modified the saving path to the trainer.log_dir and it hangs.
After a night of debugging, I found that it is trainer.log_dir causing the hangs. log_dir is a method that calls strategy.broadcast which is supposed to be called on all ranks.
The solution is simple, just move log_dir = trainer.log_dir or trainer.default_root_dir outside the if test to make sure that it is called on all ranks.

class SaveConfigCallback(Callback):
    def __init__(self, parser):
        self.parser = parser

    def on_train_start(self, trainer: Trainer, pl_module: LightningModule) -> None:
        log_dir = trainer.log_dir or trainer.default_root_dir
        if trainer.is_global_zero:
            with open(os.path.join(log_dir,self.parser.name)) as f:
                f.write(self.parser.pretty())

Last, a method that calls trainer.log_dir can not be decorated by @rank_zero_only.

akihironitta · 2022-09-27T07:53:36Z

akihironitta
Sep 27, 2022

@MendelXu @stonelazy I am not sure how the issue arises in your specific cases, but maybe place trainer.strategy.barrier() right after the rank-zero-only operation to make sure that non-main processes don't do anything until the main process reaches the same line.

FYI, here's the general guide for debugging: https://pytorch-lightning.readthedocs.io/en/1.7.7/debug/debugging.html

1 reply

ecolss Jan 29, 2023

@akihironitta I'm having the same problem, which is very counter intuitive, no idea why such a simple all_gather and rank-zero-only checking can hang the program forever.

Can you help to have a look #16541 ?
I provided a dummy code example to reproduce the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training stuck at the begining #8321

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training stuck at the begining #8321

Uh oh!

MendelXu Jul 7, 2021

Replies: 2 comments · 7 replies

Uh oh!

Borda Jul 7, 2021 Maintainer

Uh oh!

Borda Jul 8, 2021 Maintainer

Uh oh!

MendelXu Jul 8, 2021 Author

Uh oh!

MendelXu Jul 8, 2021 Author

Uh oh!

stonelazy Sep 27, 2022

Uh oh!

semaphore-egg Feb 20, 2023

Uh oh!

akihironitta Sep 27, 2022

Uh oh!

ecolss Jan 29, 2023

MendelXu
Jul 7, 2021

Replies: 2 comments 7 replies

Borda
Jul 7, 2021
Maintainer

Borda Jul 8, 2021
Maintainer

MendelXu Jul 8, 2021
Author

MendelXu Jul 8, 2021
Author

akihironitta
Sep 27, 2022