Skip to content

init_process_group not called when training on multiple-GPUs #8517

Discussion options

You must be logged in to vote

The issue comes from the line

  File "train.py", line 173, in main
    print(f"Logs for this experiment are being saved to {trainer.log_dir}")

which tries to access trainer.log_dir outside of the trainer scope.

trainer.log_dir tries to broadcast the directory but fails as DDP hasn’t been initialized yet.

  File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/properties.py", line 137, in log_dir
    dirpath = self.accelerator.broadcast(dirpath)

This is fixed in the 1.4 release as broadcast becomes a no-op in that case

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by EricWiener
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants