ddp_sharded crash during model save #13951

MaugrimEP · 2022-08-01T01:09:16Z

MaugrimEP
Aug 1, 2022

I am trying to train a big styleGAN model on 4 v100, I used the ddp_sharded strategy,
At the end of the first epoch, the training ends with an error.
I think it came from the model checkpointing at the end of each epoch, do you have maybe a solution to perform the checkpointing without this error?
I also have issues visualizing how the model is shared between the multiple gpus.

  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 294, in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 308, in on_train_epoch_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 381, in _save_topk_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 661, in _save_none_monitor_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 384, in _save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 2467, in save_checkpoint
    self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 444, in save_checkpoint
    _checkpoint = self.dump_checkpoint(weights_only)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 380, in dump_checkpoint
    optimizer_state = self.trainer.strategy.optimizer_state(optimizer)
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/pytorch_lightning/strategies/sharded.py", line 117, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/2022022/02/packages_latent/lib/python3.9/site-packages/fairscale/optim/oss.py", line 364, in consolidate_state_dict
    dist.broadcast_object_list(
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1840, in broadcast_object_list
    object_list[i] = _tensor_to_object(obj_view, obj_size)
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1532, in _tensor_to_object
    return _unpickler(io.BytesIO(buf)).load()
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/storage.py", line 161, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/serialization.py", line 787, in _legacy_load
    result = unpickler.load()
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/serialization.py", line 743, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
    return storage_type(obj.size())
  File "/soft/Python3-DL/conda/2010_4.8.3/envs/pydl-2112/lib/python3.9/site-packages/torch/cuda/__init__.py", line 606, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Answered by MaugrimEP

Sep 12, 2022

Ok, DeepSpeed with one optimizer is working quite well with only little changes in the code.

I still have an issue, when training on 1 node with 4 GPUs everything works fine. When switching to 2 nodes with 4 GPUs per node, I get CUDA out of memory which is interesting?

If anyone has the same problem with 2 optimizers, I just did the simple solution to have 1 optimizer and set either the Generator or the Discriminator requires_grad to True or False.
The other limitation of DeepSpeed is that, I think, you can only do one .step() per training step on the optimizer.

	def configure_optimizers(self):
		optimizer = torch.optim.Adam(
			[
				{'params': self.generators.parameters(), 'lr': self.lr…

View full answer

rohitgr7 · 2022-08-02T10:53:42Z

rohitgr7
Aug 2, 2022

optimizer.consolidate_state_dict()

this seems to be consolidating state dicts from all the ranks on rank 0, causing memory issues. If you just need the model's weights to be saved in the checkpoint, you can set ModelCheckpoint(..., save_weights_only=True).

Note that if you do that, you will not be able to resume the training using such a checkpoint.

4 replies

MaugrimEP Aug 2, 2022
Author

Yeah, I would also like to save the callbacks in order to resume the training. As in our grid, we have a limit of 48hours of consecutive training.
Since everything fit in memory for one epoch, I would expect the save to also works, maybe bring back everything to the CPU, save it, then bring it back to the GPU (but I don't know how to do that in PL)?

MaugrimEP Aug 2, 2022
Author

In the first place, I am doing multi-GPU training because otherwise, it would not fit in a single GPU. So bringing everything on rank 0 goes against what I want to do already. If there is a way in PL to just bring everything back to the CPU and save it. It would be nice.

rohitgr7 Sep 8, 2022

all the state_dicts during checkpointing still points to the original parameters and is just a reference but while saving, fairscale consolidates the state_dicts on a single GPU. Even if you try to move that to CPU, it might not help since originally they will blow up the GPU memory.

maybe try deepspeed, which doesn't consolidate the state_dicts?

MaugrimEP Sep 8, 2022
Author

This is what I would have liked to do, but DeepSpeed currently only supports single optimizer [https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#deepspeed], and I have one for the generator, and one for the discriminator.

MaugrimEP · 2022-09-08T10:09:10Z

MaugrimEP
Sep 8, 2022
Author

Hi,
I am still interested to know if there is a native solution in PL to perform the checkpointing on the CPU to save this issue, or if the only solution is to write my own ModelCheckpoint?

0 replies

MaugrimEP · 2022-09-12T07:56:46Z

MaugrimEP
Sep 12, 2022
Author

Ok, DeepSpeed with one optimizer is working quite well with only little changes in the code.

I still have an issue, when training on 1 node with 4 GPUs everything works fine. When switching to 2 nodes with 4 GPUs per node, I get CUDA out of memory which is interesting?

If anyone has the same problem with 2 optimizers, I just did the simple solution to have 1 optimizer and set either the Generator or the Discriminator requires_grad to True or False.
The other limitation of DeepSpeed is that, I think, you can only do one .step() per training step on the optimizer.

	def configure_optimizers(self):
		optimizer = torch.optim.Adam(
			[
				{'params': self.generators.parameters(), 'lr': self.lr * g_reg_ratio, 'betas': (0 ** g_reg_ratio, 0.99 ** g_reg_ratio)},
				{'params': self.discriminator.parameters(), 'lr': self.lr * d_reg_ratio, 'betas': (0 ** d_reg_ratio, 0.99 ** d_reg_ratio)},
			],
		)
		return optimizer

	def _set_grad_step(self, is_generator: bool):
		for p in self.generators.parameters():
			p.requires_grad = is_generator
		for p in self.discriminator.parameters():
			p.requires_grad = not is_generator

	def single_training_step(self, batch, batch_idx):
		if batch_idx % 2 == 0:
			self._set_grad_step(is_generator=False)
			return self.training_step_discriminator(reals=batch, batch_idx=batch_idx)
		else:
			self._set_grad_step(is_generator=True)
			return self.training_step_generator(reals=batch, batch_idx=batch_idx)

0 replies

ddp_sharded crash during model save #13951

Uh oh!

Uh oh!

MaugrimEP Aug 1, 2022

Replies: 3 comments · 4 replies

Uh oh!

rohitgr7 Aug 2, 2022

Uh oh!

Uh oh!

MaugrimEP Aug 2, 2022 Author

Uh oh!

Uh oh!

MaugrimEP Aug 2, 2022 Author

Uh oh!

rohitgr7 Sep 8, 2022

Uh oh!

MaugrimEP Sep 8, 2022 Author

Uh oh!

MaugrimEP Sep 8, 2022 Author

Uh oh!

Uh oh!

MaugrimEP Sep 12, 2022 Author

MaugrimEP
Aug 1, 2022

Replies: 3 comments 4 replies

rohitgr7
Aug 2, 2022

MaugrimEP Aug 2, 2022
Author

MaugrimEP Aug 2, 2022
Author

MaugrimEP Sep 8, 2022
Author

MaugrimEP
Sep 8, 2022
Author

MaugrimEP
Sep 12, 2022
Author