How save deepspeed stage 3 model with pickle or torch #8910

ViktorThink · 2021-08-14T12:40:26Z

ViktorThink
Aug 14, 2021

Hi, I'm trying to save a model trained using deepspeed stage 2 using this code:

trainer = pl.Trainer(
    gpus=4,
    plugins=DeepSpeedPlugin(
              stage=3,
              cpu_offload=True,
              partition_activations=True,),
    precision=16,
    accelerator="ddp",
    )
trainer.fit(model, train_dataloader)

With stage 2 it worked if I added this code:

trainer = pl.Trainer(gpus=0,max_epochs=0,)
trainer.fit(model, train_dataloader)
pickle.dump(model,open("model.p","wb")

But using stage=3 I get this error:

Traceback (most recent call last):
File "t5-11b-regression.py", line 227, in
torch.save(model,fileName)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/serialization.py", line 484, in _save
pickler.dump(obj)
AttributeError: Can't pickle local object 'FP16_DeepSpeedZeroOptimizer_Stage3._register_hooks_recursively.._post_forward_module_hook'

I also tried saving using torch.save, but got same error. I also tried both pytorch-lightning version 1.3.8 and 1.4.1

cc: @SeanNaren

Answered by SeanNaren

Dec 17, 2021

After some debugging with a user, I've come up with a final script to show how you can use the convert_zero_checkpoint_to_fp32_state_dict to generate a single file that can be loaded using pickle, or lightning.

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.plugins import DeepSpeedPlugin
from pytorch_lightning.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __g…

View full answer

SeanNaren · 2021-08-14T14:14:07Z

SeanNaren
Aug 14, 2021

Hey @ViktorThink!

Thanks for bringing this up, I think we can make this clearer in the documentation for next time.

To save I recommend you using trainer.save_checkpoint('model.pt') once your model has been trained. This is because DeepSpeed requires special care that is handled via the pytorch Trainer, so in your above example:

trainer = pl.Trainer(
    gpus=4,
    plugins=DeepSpeedPlugin(
              stage=3,
              cpu_offload=True,
              partition_activations=True,),
    precision=16,
    accelerator="ddp",
    )
trainer.fit(model, train_dataloader)
trainer.save_checkpoint('model.pt')

Note when using DeepSpeed we save a directory not a single file. More information can be read in the documentation here: https://pytorch-lightning.readthedocs.io/en/latest/advanced/advanced_gpu.html#deepspeed

1 reply

ViktorThink Aug 17, 2021
Author

Hi, thank you so much for the answer.

I have tried saving using save_checkpoint, but have run into some errors.

Firstly, I got: TypeError: init() missing 1 required positional argument: 'encoder'
Same as 4390

Then I tried adding self.save_hyperparameters('encoder') to my init function of the class, but then the training never started, the script just froze when it got to trainer.fit

I then tried passing encoder to MyModel.load_from_checkpoint, but got an error about wrong dimensions.

Preferably, I would be able to just convert the model back to normal pytorch and use torch save, like it's possible with deepspeed stage=2. Is that possible?

Zhylkaaa · 2021-10-19T17:20:21Z

Zhylkaaa
Oct 19, 2021

Same error with pytorch-lightning 1.4.9 and 1.5.0rc1 on python 3.7 and 3.8 (DeepSpeedPlugin version is 0.5.4)

after evaluation phase, checkpoint callback tries to save shared model to disk, but torch can't pickle deepspeed's hooks as functions are not pickleble. Is there any way to fix this? (I am also not sure why _save_checkpoint is called in save_non_zero_checkpoint context in deepspeed, maybe this is the problem with miss configuration?)

with Pytorch-lightning 1.4.9 I can use save_full_weights to overcome this.
This is my stack trace:

  File "main.py", line 457, in <module>                                                                                                                                                         
    trainer.fit(model, datamodule=dpr_datamodule)                                                                                                                                               
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 635, in fit                                      
    self._call_and_handle_interrupt(self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule)                                                                                      
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 587, in _call_and_handle_interrupt               
    return trainer_fn(*args, **kwargs)                                                                                                                                                          
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 665, in _fit_impl                                
    self._run(model)                                                                                                                                                                            
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1100, in _run                                    
    self._dispatch()                                                                                                                                                                            
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1179, in _dispatch                               
    self.training_type_plugin.start_training(self)                                                                                                                                              
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 183, in start_training
    self._results = trainer.run_stage()                                                                                                                                                         
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1189, in run_stage                               
    return self._run_train()                                                                                                                                                                    
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1228, in _run_train                              
    self.fit_loop.run()                                                                                                                                                                         
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run                                           
    self.advance(*args, **kwargs)                                                                                                                                                               
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 211, in advance                                   
    self.epoch_loop.run(data_fetcher)                                                                                                                                                           
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 146, in run                                           
    self.on_advance_end()                                                                                                                                                                       
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 233, in on_advance_end           
    self._run_validation()                                                                                                                                                                      
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 330, in _run_validation          
    self.val_loop.run()                                                                                                                                                                         
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 151, in run                                           
    output = self.on_run_end()                                                                                                                                                                  
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 137, in on_run_end              
    self._on_evaluation_end()                                                                                                                                                                   
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 199, in _on_evaluation_end      
    self.trainer.call_hook("on_validation_end", *args, **kwargs)                                                                                                                                
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1392, in call_hook                               
    callback_fx(*args, **kwargs)                                                                                                                                                                
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 221, in on_validation_end                  
    callback.on_validation_end(self, self.lightning_module)                                                                                                                                     
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 327, in on_validation_end             
    self.save_checkpoint(trainer)                                                                                                                                                               
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 384, in save_checkpoint               
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)                                                                                                                             
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_none_monitor_checkpoint 
    trainer.save_checkpoint(filepath, self.save_weights_only)                                                                                                                                   
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1809, in save_checkpoint                         
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)                                                                                                                           
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 465, in save_checkpoint  
    self.trainer.training_type_plugin.save_checkpoint(_checkpoint, filepath)                                                                                                                    
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 714, in save_checkpoint          
    self.deepspeed_engine.save_checkpoint(filepath, client_state=checkpoint)                                                                                                                    
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2231, in save_checkpoint                                  
    self._save_checkpoint(save_dir, tag, client_state=client_state)                                                                                                                             
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2408, in _save_checkpoint                                 
    torch.save(state, save_path)                                                                                                                                                                
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/torch/serialization.py", line 379, in save                                                   
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)                                                                                                                                  
  File "/net/ascratch/people/plgdmytro/anaconda3/envs/deepspeed_env_37/lib/python3.7/site-packages/torch/serialization.py", line 484, in _save                                                  
    pickler.dump(obj)                                                                                                                                                                           
AttributeError: Can't pickle local object 'FP16_DeepSpeedZeroOptimizer_Stage3._register_hooks_recursively.<locals>._post_forward_module_hook'```

1 reply

Zhylkaaa Dec 17, 2021

Turned out I had args with DeepSpeedPlugin itself saved to hparams, that's why this problem occurred. Sorry for bothering, and maybe some lost soul will find this useful.

SeanNaren · 2021-12-17T11:00:19Z

SeanNaren
Dec 17, 2021

After some debugging with a user, I've come up with a final script to show how you can use the convert_zero_checkpoint_to_fp32_state_dict to generate a single file that can be loaded using pickle, or lightning.

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.plugins import DeepSpeedPlugin
from pytorch_lightning.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


if __name__ == "__main__":
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        strategy=DeepSpeedPlugin(stage=2),
        precision=16,
        gpus=2,
        callbacks=ModelCheckpoint(dirpath='checkpoints', save_last=True)
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)

    # once saved via the model checkpoint callback,
    # it saves a folder containing the deepspeed checkpoint rather than a single file
    checkpoint_path = "checkpoints/last.ckpt/"

    if trainer.is_global_zero:
        single_ckpt_path = "single_model.pt"

        # magically converts the folder into a single lightning loadable pytorch file (for ZeRO 1,2 and 3)
        convert_zero_checkpoint_to_fp32_state_dict(checkpoint_path, single_ckpt_path)
        loaded_parameters = BoringModel.load_from_checkpoint(single_ckpt_path).parameters()

        model = model.cpu()
        # Assert model parameters are identical after loading
        for orig_param, saved_model_param in zip(model.parameters(), loaded_parameters):
            if model.dtype == torch.half:
                # moved model to float32 for comparison with single fp32 saved weights
                saved_model_param = saved_model_param.half()
            assert torch.equal(orig_param, saved_model_param)

The above where we use the Trainer as an engine still works, but now you'd need to pass the checkpoint path like so trainer.predict(ckpt_path=..., ...)

1 reply

ViktorThink Dec 17, 2021
Author

Wow, that's great stuff! Super useful for using models trained using deepspeed! Deeply appreciated by me and I'm sure many others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How save deepspeed stage 3 model with pickle or torch #8910

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How save deepspeed stage 3 model with pickle or torch #8910

Uh oh!

Uh oh!

ViktorThink Aug 14, 2021

Replies: 3 comments · 3 replies

Uh oh!

SeanNaren Aug 14, 2021

Uh oh!

ViktorThink Aug 17, 2021 Author

Uh oh!

Uh oh!

Zhylkaaa Oct 19, 2021

Uh oh!

Zhylkaaa Dec 17, 2021

Uh oh!

SeanNaren Dec 17, 2021

Uh oh!

ViktorThink Dec 17, 2021 Author

ViktorThink
Aug 14, 2021

Replies: 3 comments 3 replies

SeanNaren
Aug 14, 2021

ViktorThink Aug 17, 2021
Author

Zhylkaaa
Oct 19, 2021

SeanNaren
Dec 17, 2021

ViktorThink Dec 17, 2021
Author