Error resuming from checkpoint with multiple GPUs #11435
Replies: 4 comments 1 reply
-
Okay so I found a solution to this, although I don't understand why it works. I hadn't specified a distributed mode in my trainer so it defaulted to "ddp_spawn". Setting strategy="ddp" solved the problem. I hadn't really looked into distributed modes much (figuring the default would probably be fine) and I still don't fully understand the differences but from what I can tell "ddp" is actually better for me anyway. I'll leave my post up in case anyone else runs into the same problem. |
Beta Was this translation helpful? Give feedback.
-
@rubvber https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html has info about ddp vs ddp_spawn Look at the trace, seems the issue happens at move_optimizer_state(), could you share the full log if you still have them? Also, I think it's possbile the problem solved by #10059 already, could you try with the newer lightning version? |
Beta Was this translation helpful? Give feedback.
-
Encountered the exact same bug, when trying to continue training from a model trained with ddp_spawn strategy. -- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 208, in _wrapped_function
result = function(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 236, in new_process
results = trainer.run_stage()
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 219, in advance
self.optimizer_idx,
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 386, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/optimizer.py", line 164, in step
trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 339, in optimizer_step
self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
optimizer.step(closure=closure, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/optim/radam.py", line 128, in step
eps=group['eps'])
File "/usr/local/lib/python3.7/dist-packages/torch/optim/_functional.py", line 436, in radam
exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! |
Beta Was this translation helpful? Give feedback.
-
Please take a look at this post #12327, as I have found a quick fix to this issue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I started training a model on two GPUs, using the following trainer:
This is set to save the best three epochs (based on the validation loss) and the last epoch:
Training halted unexpectedly and I now want to resume it, which I did by configuring my trainer as follows (note addition of last line):
But, after initializing the two distributed processes and completing the validation sanity check, this crashes on starting the first step of the new training epoch, giving a long error stack that ends with:
So somehow it seems that it's not correctly dividing all the tensors onto the two GPUs. I wonder if this has to do with how it's loading the checkpoint. Am I doing something wrong here? Is this even possible, and if so how do I do it correctly?
(If I try to resume with a trainer that's set to use just one GPU, there's no problem.)
Beta Was this translation helpful? Give feedback.
All reactions