Gradient Checkpointing breaks peft trainer. #742

nivibilla · 2023-07-21T18:41:09Z

System Info

Im using the notebook shown in the qlora repo. I am trying to train using notebook with gradient checkpointing enabled. However this causes an issue.

None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I have tried the solution in mentioned in #137, however this causes the 'expected to mark variable ready once...' error.

Who can help?

I think this is a library error so tagging @pacman100 @younesbelkada @sayakpaul

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

I am installing all libraries from source.


class PeftSavingCallback(TrainerCallback):
  def on_save(self, args, state, control, **kwargs):
    checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
    kwargs["model"].save_pretrained(checkpoint_path)

    if "pytorch_model.bin" in os.listdir(checkpoint_path):
      os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))

trainer = Trainer(
    model = peft_model,
    train_dataset = dataset["train"],
    args = TrainingArguments(
        save_steps = 0.1,
        gradient_checkpointing=True,
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 16,
        num_train_epochs = 10,
        learning_rate = 2e-5,
        fp16 = True,
        max_grad_norm=0.03,
        logging_steps = 1,
        output_dir = models[model_name]['folder_name'],
        optim = "paged_adamw_8bit"
    ),
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm = False),
    callbacks = [PeftSavingCallback]
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

trainer.train()```

### Expected behavior

Train with gradient checkpointing with Peft

The text was updated successfully, but these errors were encountered:

nivibilla · 2023-07-21T18:41:46Z

@younesbelkada tagging you as you provided the solution in the other issue.

younesbelkada · 2023-07-24T12:10:47Z

Hi @nivibilla
Which version of PEFT are you using? This should have been fixed by #404 that automatically correctly prepares the model for gradient checkpointing + we have added some CI tests meaning this feature seems to currently work as of today. Can you double check that be uninstall peft and re-installing it from source?

pip uninstall peft
pip install git+https://github.com/huggingface/peft.git

nivibilla · 2023-07-24T12:34:34Z

Hey @younesbelkada thanks for your reply, I am installing from source. But I seemed to have solved it.

The code change was to call this before getting the Peft model.

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing = True

After this I was getting the "expected to mark ready only once..." Error. Then I added this

ddp_find_unused_parameters = False

In the trainer. Even though I was only training on one GPU, this seemed to fix that.

The colab notebook from the QLoRA also works on colab if you run it as is. so I'm not sure why I was having the issue, the only thing I changed was the model(Llama2) and the data.

nivibilla · 2023-07-24T12:35:37Z

As far as I'm concerned this issue is solved so I'm closing it but please let me know if you need any more information.

younesbelkada · 2023-07-24T12:36:20Z

I see, this is a bit strange as I have managed to FT llama-2 on a single T4 without any problem using PEFT.. What is the transformers version you use?

nivibilla · 2023-07-24T12:40:51Z

I am also Installing transformers from source. And also (part of another Issue ) I can confirm you can finetune on a T4 but I didn't realise that the sequence length effected the vram usage so much. I had some rare 4k samples in my data which caused OOM. But limiting to 1024 meant that I could do batch size 32 on a 24GB GPU.

nivibilla · 2023-07-24T12:42:26Z

Also I should probably note that I am doing this on databricks so there are a lot of issues with the way they make thier clusters. Could be that too.

younesbelkada · 2023-07-24T12:44:59Z

I am doing this on databricks so there are a lot of issues with the way they make thier clusters. Could be that too.

Yes that could defintiely explain the issue! Maybe there is some sort of silent bug that converts the model into a DDP model ..
Thanks a lot for sharing all these details !

nivibilla closed this as completed Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Checkpointing breaks peft trainer. #742

Gradient Checkpointing breaks peft trainer. #742

nivibilla commented Jul 21, 2023

nivibilla commented Jul 21, 2023

younesbelkada commented Jul 24, 2023

nivibilla commented Jul 24, 2023

nivibilla commented Jul 24, 2023

younesbelkada commented Jul 24, 2023

nivibilla commented Jul 24, 2023

nivibilla commented Jul 24, 2023 •

edited

Loading

younesbelkada commented Jul 24, 2023

Gradient Checkpointing breaks peft trainer. #742

Gradient Checkpointing breaks peft trainer. #742

Comments

nivibilla commented Jul 21, 2023

System Info

Who can help?

Information

Tasks

Reproduction

nivibilla commented Jul 21, 2023

younesbelkada commented Jul 24, 2023

nivibilla commented Jul 24, 2023

nivibilla commented Jul 24, 2023

younesbelkada commented Jul 24, 2023

nivibilla commented Jul 24, 2023

nivibilla commented Jul 24, 2023 • edited Loading

younesbelkada commented Jul 24, 2023

nivibilla commented Jul 24, 2023 •

edited

Loading