Sharding and training multiple models at once for a large scale reinforcement learning #13601

thomfoster · 2022-07-11T14:35:12Z

thomfoster
Jul 11, 2022

Hey lightning team (perhaps @SeanNaren would be best placed),

I'm currently replicating this paper by OpenAI, in which they run the PPO algorithm from reinforcement learning with large language models.

Whilst I have successfully used the lightning trainer and deepspeed implementation in the past to train LLMs up 20B parameters, I am struggling to get deepspeed to correctly shard the models in this case. This is because my LightningModule contains 4 transformers (actor, critic, reference and reward networks). Whilst only 2 of the transformers are being contained and require gradients, all the networks involved are multiple billions of parameters and do not fit onto a 40GB A100 chip without deepspeed.

As simple proof of concept that would be amazing to get up an running is below. The exact architectures / loss computation isn't important, I just want to prove out that I can do two forward passes with gradients and two forward passes without, and combine the outputs together.

As you can see, I am initialising the models inside setup after calling the enable_transformers_pretrained_deepspeed_sharding method. Whilst the models intialise successfully, we run out of CUDA memory attempting to do the second forward pass (to get r2).

Perhaps one way to solve this would be to somehow mark "r2" as an activation in the training step so that it can be offloaded to RAM? (Althought right now I'm not even sure thats the issue and am struggling to debug haha!)

Any help would be much appreciated!

Best,
Thom

Lightning module definition
import pytorch_lightning as pl
from typing import Optional

import torch
from transformers import (
    AutoModel,
    AutoTokenizer,
    AutoModelForCausalLM,   # actor and reference
    AutoModelForSequenceClassification,    # reward
    AutoModelForTokenClassification,    # critic
)
from torch.optim import AdamW
from lightning_transformers.utilities.deepspeed import enable_transformers_pretrained_deepspeed_sharding

class PPOAgent(pl.LightningModule):
    def __init__(
        self,
        deepspeed_sharding: bool = False,
        **model_kwargs
    ):
        super().__init__()
        self.save_hyperparameters()
        self.model_kwargs = model_kwargs

        if not deepspeed_sharding:
            self.initialize_model(**model_kwargs)


    def setup(self, stage: Optional[str] = None) -> None:
        # self.configure_metrics()
        if self.hparams.deepspeed_sharding and not hasattr(self, "model"):
            enable_transformers_pretrained_deepspeed_sharding(self)
            self.initialize_model(**self.model_kwargs)


    def initialize_model(self, **kwargs):
        # Actually do the initialisations
        self.tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-j-6B')
        self.a_t = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-j-6B', num_labels=1)
        self.a_0 = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-j-6B', num_labels=1)
        self.critic = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-j-6B', num_labels=1)
        self.reward = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-j-6B', num_labels=1)

        self.a_t.config.pad_token_id = self.a_t.config.eos_token_id
        self.a_0.config.pad_token_id = self.a_0.config.eos_token_id
        self.critic.config.pad_token_id = self.critic.config.eos_token_id
        self.reward.config.pad_token_id = self.reward.config.eos_token_id


    def configure_optimizers(self):
        # Create optimizer using self.actor and self.critic parameters
        return AdamW(
            [
                {'params': self.a_t.parameters(), 'lr': 1e-5},
                {'params': self.critic.parameters(), 'lr': 5e-5}
            ]
        )


    def training_step(self, batch, batch_idx):
        r1 = self.a_t(**batch)
        r2 = self.critic(**batch)
        
        with torch.no_grad():
            r3 = self.a_0(**batch)
            r4 = self.reward(**batch)

        self.log('actor train loss', r1.loss)
        self.log('critic train loss', r2.loss)
        self.log('reference train loss', r3.loss)
        self.log('reward train loss', r4.loss)

        return torch.mean(r1.loss + r2.loss)

My trainer config is below:

accelerator='gpu',
        devices=-1,
        max_epochs=1,
        precision=16,
        gradient_clip_val=1.0,
        strategy=DeepSpeedStrategy(
            stage=3,
            offload_optimizer=True,
            offload_parameters=True,
            partition_activations=True,
            logging_level=logging.INFO
        ),
        accumulate_grad_batches=4,

SeanNaren · 2022-07-11T15:05:36Z

SeanNaren
Jul 11, 2022

🤯 this is an epic application! Haven't read this paper at all and from a quick skim seems really interesting, will read it in more depth.

From what you've described, the model weights are successfully sharded and kept on all devices (may I ask, how many GPUs are being used?). It seems like your observation is that the activations are fairly large, and you'd like to try offloading/partitioning them (hence the partition_activations flag).

From what I see, this is a case of using activation checkpointing to enable partitioning of activations, and potentially checkpoint_in_cpu. The challenge here is that I'm not sure how you inject the deepspeed.checkpointing.checkpoint function into the encoder blocks within the pre-built Transformer models.

Also you should check out the really helpful guide for transformer models here. You can either pass in these arguments directly to the Strategy, or make your own custom config.

5 replies

thomfoster Jul 11, 2022
Author

Hey! Thanks! I'm really passionate about the RL / LLM alignment field at the moment - it's just quite the engineering challenge!

I've tried this on machines with 8xA100s and 16xA100s :)

It seems like your observation is that the activations are fairly large, and you'd like to try offloading/partitioning them

That's only my hunch - my debugging workflow and tooling is really poor right now, nothing more than log statements and watching nvidia-smi. If you know any better techniques for debugging sharded workflows I'd love to hear them!

The challenge here is that I'm not sure how you inject the deepspeed.checkpointing.checkpoint function into the encoder blocks within the pre-built Transformer models.

I think checkpointing inside the pre-built transformer models is actually working fine - its checkpointing the activations inside my training_step that is taking up the memory. Will using deepspeed.checkpointing.checkpoint inside my LightningModule.training_step break lightnings implementation?

Also you should check out the really helpful guide for transformer models here. You can either pass in these arguments directly to the Strategy, or make your own custom config.

Mmm thanks for this. One of the reasons I went for the Lightning Trainer strategy over say calling my own script from the command line was the built in defaults and being able to avoid tinkering with this scary config myself, but perhaps I'll have another read. Does the lightning trainer do any optimisation of these parameters or did I just get lucky with the defaults?

SeanNaren Jul 11, 2022

I think checkpointing inside the pre-built transformer models is actually working fine - its checkpointing the activations inside my training_step that is taking up the memory. Will using deepspeed.checkpointing.checkpoint inside my LightningModule.training_step break lightnings implementation?

Based on this, are you somehow injecting the deepspeed.checkpointing.checkpoint function into the Transformer model? or is there a way to do this via HF transformers? just trying to clear if you're already doing what I've suggested!

Lightning doesn't do any of the checkpointing for you, just configures it using deepspeed.checkpointing.configure. It's up to you to actually wrap the layers you want to enable activation/gradient checkpointing. If you think it's the activations after the model has finished running that's taking up space, a simple (and maybe silly thing), have you tried moving this to CPU explicitly?

Does the lightning trainer do any optimisation of these parameters or did I just get lucky with the defaults?

We do provide some nice defaults and because you're ensembling a bunch of transformer models together, the guide may unhelpful. Just be aware that these parameters may help you to get this working.

To clear up, for both models that are being trained, they are mostly being fine-tuned right? This might allow us to take the hit of enabling offloading to CPU (when training for long periods of time, offloading would be too much of a performance hit).

thomfoster Jul 11, 2022
Author

Based on this, are you somehow injecting the deepspeed.checkpointing.checkpoint function into the Transformer model? or is there a way to do this via HF transformers?

No I'm just instantiating the hugging face models as in the code above - I assumed they were being wrapped somehow by the enable_transformers_pretrained_deepspeed_sharding method haha 🤣 I was actually going to try write out a minGPT style transformer module (with the relevant layers wrapped) and then see if I could instantiate it with the pretrained weights from huggingface, but when I found the lightning-transformers package (which instantiates as above) I didn't think it was necessary. In hindsight this was probably a bit silly/lazy.

If you think it's the activations after the model has finished running that's taking up space, a simple (and maybe silly thing), have you tried moving this to CPU explicitly?

I'm fairly new to lightning so am a little scared to mess with device management inside my training loop - I thought that was one of lightnings golden rules haha.

It could equally not be the activations - it might be that deepspeed isn't offloading one transformer partition for another when its role in the forward pass is up. Not sure how to check or fix this if true.

We do provide some nice defaults and because you're ensembling a bunch of transformer models together, the guide may unhelpful.

Okay wicked will try with a custom config.

To clear up, for both models that are being trained, they are mostly being fine-tuned right?

That's correct - the algorithm is effectively:

Repeat N times:
     - Generate a batch of completions with the actor transformer
     - Use the ciritic transformer to analyse the completions
     - Use the reward transformer to score the completions
     Compute loss for actor and critic and backpropogate

In smaller scale experiments we found that N need only be around 500, so time isn't too much of an issue.

thomfoster Jul 11, 2022
Author

Update: I just tried wrapping my different forward passes in deepspeed.checkpointing.checkpoint but lost the gradient information.

    def apply_args_as_kwargs(self, *args):
        model = args[0]
        input_ids = args[1]
        attention_mask = args[2]
        labels = args[3] if len(args) > 3 else None
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        return outputs.values()

    def training_step(self, batch, batch_idx):

        r1 = deepspeed.checkpointing.checkpoint(self.apply_args_as_kwargs, *(self.a_t, batch['input_ids'], batch['attention_mask'], batch['labels']))
        r2 = deepspeed.checkpointing.checkpoint(self.apply_args_as_kwargs, *(self.critic, batch['input_ids'], batch['attention_mask'], batch['labels']))

        with torch.no_grad():
            r3 = deepspeed.checkpointing.checkpoint(self.apply_args_as_kwargs, *(self.a_0, batch['input_ids'], batch['attention_mask'], batch['labels']))
            r4 = deepspeed.checkpointing.checkpoint(self.apply_args_as_kwargs, *(self.reward, batch['input_ids'], batch['attention_mask'], batch['labels']))

with the error: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Looks like this is a similar issue to here. Trying to work out a way to get some gradient information into the input tensors.

SeanNaren Jul 11, 2022

hmm, unfortunately checkpointing really will help at the per layer level as opposed to the entire network level, which you currently have (it would mean that the activations stored between layers which are required for the backwards pass would be freed up, and re-computed in the backwards pass).

I'm fairly new to lightning so am a little scared to mess with device management inside my training loop - I thought that was one of lightnings golden rules haha.

Definitely in the normal case yes, but since this use case might warrant it, might be worth it!

I'm going to cc @stas00 who might have an idea on how you can inject deepspeed.checkpointing.checkpoint functions into the actual HF Transformer model without having to do some model surgery.

In terms of memory profiling, I'm actually a bit under-equipped to answer this. I think if we could figure out if CUDA memory was being allocated mostly to activations, it would help us a lot here.

I'm also going to cc @jeffra and @tjruwase who probably have a lot of helpful information about this type of application with DeepSpeed!

trialan · 2022-07-16T09:30:21Z

trialan
Jul 16, 2022

@thomfoster have you cracked this? I am working on the exact same problem with some friends: using PPO to make GPT-J better at conversations (our reward model is trained on a large dataset of user conversations from our app chai.ml). I got good results applying PPO to GPT2 as my initial policy but want to initialise it to GPT-J --- I've stuck to deepspeed so far but not getting speedups from increasing N GPUs so I must be doing something wrong.

0 replies

Sharding and training multiple models at once for a large scale reinforcement learning #13601

Uh oh!

Uh oh!

thomfoster Jul 11, 2022

Replies: 2 comments · 5 replies

Uh oh!

SeanNaren Jul 11, 2022

Uh oh!

thomfoster Jul 11, 2022 Author

Uh oh!

SeanNaren Jul 11, 2022

Uh oh!

Uh oh!

thomfoster Jul 11, 2022 Author

Uh oh!

Uh oh!

thomfoster Jul 11, 2022 Author

Uh oh!

SeanNaren Jul 11, 2022

Uh oh!

trialan Jul 16, 2022

thomfoster
Jul 11, 2022

Replies: 2 comments 5 replies

SeanNaren
Jul 11, 2022

thomfoster Jul 11, 2022
Author

thomfoster Jul 11, 2022
Author

thomfoster Jul 11, 2022
Author

trialan
Jul 16, 2022