Error resuming from checkpoint with multiple GPUs #11435

rubvber · 2022-01-12T09:01:41Z

rubvber
Jan 12, 2022

I started training a model on two GPUs, using the following trainer:

trainer = pl.Trainer(
     devices = [0,2], accelerator='gpu', precision=16, max_epochs=2000, 
     callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'), 
     gradient_clip_val=5.0, gradient_clip_algorithm='norm')

This is set to save the best three epochs (based on the validation loss) and the last epoch:

checkpoint_callback = ModelCheckpoint(
        monitor="val_loss",
        save_top_k=3, 
        mode="min",
        save_last=True         
    )

Training halted unexpectedly and I now want to resume it, which I did by configuring my trainer as follows (note addition of last line):

trainer = pl.Trainer(devices=[2,0], accelerator="gpu", precision=16, max_epochs=2000, 
    callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'), 
    gradient_clip_val=5.0, gradient_clip_algorithm='norm', 
    resume_from_checkpoint="path/to/checkpoint.ckpt")

But, after initializing the two distributed processes and completing the validation sanity check, this crashes on starting the first step of the new training epoch, giving a long error stack that ends with:

File "/home/username/miniconda3/lib/
python3.8/site-packages/torch/optim/_functional.py", line 86, in adam 
   exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!

So somehow it seems that it's not correctly dividing all the tensors onto the two GPUs. I wonder if this has to do with how it's loading the checkpoint. Am I doing something wrong here? Is this even possible, and if so how do I do it correctly?

(If I try to resume with a trainer that's set to use just one GPU, there's no problem.)

rubvber · 2022-01-12T09:43:55Z

rubvber
Jan 12, 2022
Author

Okay so I found a solution to this, although I don't understand why it works. I hadn't specified a distributed mode in my trainer so it defaulted to "ddp_spawn". Setting strategy="ddp" solved the problem. I hadn't really looked into distributed modes much (figuring the default would probably be fine) and I still don't fully understand the differences but from what I can tell "ddp" is actually better for me anyway. I'll leave my post up in case anyone else runs into the same problem.

0 replies

four4fish · 2022-01-21T18:27:44Z

four4fish
Jan 21, 2022

@rubvber https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html has info about ddp vs ddp_spawn
search "We use DDP this way because ddp_spawn has a few limitations (due to Python and PyTorch):" in the link above will take you to the ddp vs ddp_spawn part. Usually DDP has better performance compared to DDPSpawn as there are fewer data movement.

Look at the trace, seems the issue happens at move_optimizer_state(), could you share the full log if you still have them?

Also, I think it's possbile the problem solved by #10059 already, could you try with the newer lightning version?

1 reply

rubvber Jan 25, 2022
Author

Thanks for your input! For now I have made the switch to DDP anyway and don't really have the bandwidth to go back and try to replicate the issue, and see whether it is solved by a newer lightning version. It seems unlikely though, as I was already using 1.5.8 which was released on Jan 5, 2022, and it looks like #10059 dates from Oct 2021?

In any case I'll bear this in mind, and if I do have a chance to replicate the issue I'll let you know.

circlecrystal · 2022-03-15T00:09:33Z

circlecrystal
Mar 15, 2022

Encountered the exact same bug, when trying to continue training from a model trained with ddp_spawn strategy.

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 208, in _wrapped_function
    result = function(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 236, in new_process
    results = trainer.run_stage()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 219, in advance
    self.optimizer_idx,
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 386, in _optimizer_step
    using_lbfgs=is_lbfgs,
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/optimizer.py", line 164, in step
    trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 339, in optimizer_step
    self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
    optimizer.step(closure=closure, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/optim/radam.py", line 128, in step
    eps=group['eps'])
  File "/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py", line 436, in radam
    exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!

0 replies

circlecrystal · 2022-03-16T10:42:22Z

circlecrystal
Mar 16, 2022

Please take a look at this post #12327, as I have found a quick fix to this issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error resuming from checkpoint with multiple GPUs #11435

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Error resuming from checkpoint with multiple GPUs #11435

Uh oh!

rubvber Jan 12, 2022

Replies: 4 comments · 1 reply

Uh oh!

rubvber Jan 12, 2022 Author

Uh oh!

four4fish Jan 21, 2022

Uh oh!

rubvber Jan 25, 2022 Author

Uh oh!

Uh oh!

circlecrystal Mar 15, 2022

Uh oh!

circlecrystal Mar 16, 2022

rubvber
Jan 12, 2022

Replies: 4 comments 1 reply

rubvber
Jan 12, 2022
Author

four4fish
Jan 21, 2022

rubvber Jan 25, 2022
Author

circlecrystal
Mar 15, 2022

circlecrystal
Mar 16, 2022