init_process_group
not called when training on multiple-GPUs
#8517
-
Hi, I’m trying to train a model on 2 GPUs. I do this by specifying Trainer(..., gpus=2). ddp_spawn should automatically be selected for the method, but I instead get the following message + error:
I looked at the source code of ddp_spawn and it looks like it should print out a message when initializing ddp, but it didn’t. Could I please have advice on how to correct this error. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Found a relevant discussion, but I don't think it's applicable here because I construct the dataloader with: class CustomDataModule(pl.LightningDataModule):
def train_dataloader(self):
train_dataset = CustomDataset(
params=self.params,
data_params=self.data_params,
num_workers=self.num_workers,
)
return DataLoader(
train_dataset,
timeout=self.data_loader_timeout,
num_workers=self.num_workers,
batch_size=self.batch_size,
worker_init_fn=worker_init_fn,
) and here's what the order of operations looks like: data_module = CustomDataModule(...)
model = CustomLightningModule(...)
tb_logger = TensorBoardLogger(...)
checkpoint_callback = ModelCheckpoint(...)
trainer = Trainer.from_argparse_args(
args,
logger=tb_logger,
default_root_dir=args.output_dir,
profiler="pytorch", # tried removing this and it doesn't make a difference
callbacks=[checkpoint_callback],
gpus=args.gpus
)
trainer.fit(model, data_module) |
Beta Was this translation helpful? Give feedback.
-
The issue comes from the line File "train.py", line 173, in main
print(f"Logs for this experiment are being saved to {trainer.log_dir}") which tries to access
File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/properties.py", line 137, in log_dir
dirpath = self.accelerator.broadcast(dirpath) This is fixed in the 1.4 release as |
Beta Was this translation helpful? Give feedback.
The issue comes from the line
which tries to access
trainer.log_dir
outside of the trainer scope.trainer.log_dir
tries tobroadcast
the directory but fails as DDP hasn’t been initialized yet.This is fixed in the 1.4 release as
broadcast
becomes a no-op in that case