`init_process_group` not called when training on multiple-GPUs #8517

EricWiener · 2021-07-21T21:46:48Z

EricWiener
Jul 21, 2021

Hi,

I’m trying to train a model on 2 GPUs. I do this by specifying Trainer(..., gpus=2). ddp_spawn should automatically be selected for the method, but I instead get the following message + error:

UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(
accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
  'You requested multiple GPUs but did not specify a backend, e.g.'
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Traceback (most recent call last):
 File "train.py", line 186, in <module>
    main(sys.argv[1:])
  File "train.py", line 173, in main
    print(f"Logs for this experiment are being saved to {trainer.log_dir}")
  File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/properties.py", line 137, in log_dir
    dirpath = self.accelerator.broadcast(dirpath)
  File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/accelerators/accelerator.py", line 436, in broadcast
    return self.training_type_plugin.broadcast(obj, src)
  File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 275, in broadcast
    return self.dist.broadcast(obj)
  File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/distributed/dist.py", line 33, in broadcast
    broadcast_object_list(obj, 0, group=group or _group.WORLD)
  File ".../pypi__torch_python3_deps/torch/distributed/distributed_c10d.py", line 1700, in broadcast_object_list
    my_rank = get_rank()
  File ".../pypi__torch_python3_deps/torch/distributed/distributed_c10d.py", line 725, in get_rank
    default_pg = _get_default_group()
  File ".../pypi__torch_python3_deps/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I looked at the source code of ddp_spawn and it looks like it should print out a message when initializing ddp, but it didn’t.

Could I please have advice on how to correct this error.

Thank you!

Answered by carmocca

Jul 21, 2021

The issue comes from the line

  File "train.py", line 173, in main
    print(f"Logs for this experiment are being saved to {trainer.log_dir}")

which tries to access trainer.log_dir outside of the trainer scope.

trainer.log_dir tries to broadcast the directory but fails as DDP hasn’t been initialized yet.

  File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/properties.py", line 137, in log_dir
    dirpath = self.accelerator.broadcast(dirpath)

This is fixed in the 1.4 release as broadcast becomes a no-op in that case

View full answer

EricWiener · 2021-07-21T21:50:44Z

EricWiener
Jul 21, 2021
Author

Found a relevant discussion, but I don't think it's applicable here because I construct the dataloader with:

class CustomDataModule(pl.LightningDataModule):
    def train_dataloader(self):
        train_dataset = CustomDataset(
            params=self.params,
            data_params=self.data_params,
            num_workers=self.num_workers,
        )
        return DataLoader(
            train_dataset,
            timeout=self.data_loader_timeout,
            num_workers=self.num_workers,
            batch_size=self.batch_size,
            worker_init_fn=worker_init_fn,
        )

and here's what the order of operations looks like:

data_module = CustomDataModule(...)
model = CustomLightningModule(...)
tb_logger = TensorBoardLogger(...)
checkpoint_callback = ModelCheckpoint(...)
trainer = Trainer.from_argparse_args(
    args,
    logger=tb_logger,
    default_root_dir=args.output_dir,
    profiler="pytorch", # tried removing this and it doesn't make a difference
    callbacks=[checkpoint_callback],
    gpus=args.gpus
)
trainer.fit(model, data_module)

0 replies

carmocca · 2021-07-21T22:41:06Z

carmocca
Jul 21, 2021

The issue comes from the line

  File "train.py", line 173, in main
    print(f"Logs for this experiment are being saved to {trainer.log_dir}")

which tries to access trainer.log_dir outside of the trainer scope.

trainer.log_dir tries to broadcast the directory but fails as DDP hasn’t been initialized yet.

  File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/properties.py", line 137, in log_dir
    dirpath = self.accelerator.broadcast(dirpath)

This is fixed in the 1.4 release as broadcast becomes a no-op in that case

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`init_process_group` not called when training on multiple-GPUs #8517

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

init_process_group not called when training on multiple-GPUs #8517

Uh oh!

EricWiener Jul 21, 2021

Replies: 2 comments

Uh oh!

Uh oh!

EricWiener Jul 21, 2021 Author

Uh oh!

carmocca Jul 21, 2021

`init_process_group` not called when training on multiple-GPUs #8517

EricWiener
Jul 21, 2021

EricWiener
Jul 21, 2021
Author

carmocca
Jul 21, 2021