Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient Checkpointing breaks peft trainer. #742

Closed
2 of 4 tasks
nivibilla opened this issue Jul 21, 2023 · 8 comments
Closed
2 of 4 tasks

Gradient Checkpointing breaks peft trainer. #742

nivibilla opened this issue Jul 21, 2023 · 8 comments

Comments

@nivibilla
Copy link

System Info

Im using the notebook shown in the qlora repo. I am trying to train using notebook with gradient checkpointing enabled. However this causes an issue.

None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I have tried the solution in mentioned in #137, however this causes the 'expected to mark variable ready once...' error.

Who can help?

I think this is a library error so tagging @pacman100 @younesbelkada @sayakpaul

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

I am installing all libraries from source.


class PeftSavingCallback(TrainerCallback):
  def on_save(self, args, state, control, **kwargs):
    checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
    kwargs["model"].save_pretrained(checkpoint_path)

    if "pytorch_model.bin" in os.listdir(checkpoint_path):
      os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))

trainer = Trainer(
    model = peft_model,
    train_dataset = dataset["train"],
    args = TrainingArguments(
        save_steps = 0.1,
        gradient_checkpointing=True,
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 16,
        num_train_epochs = 10,
        learning_rate = 2e-5,
        fp16 = True,
        max_grad_norm=0.03,
        logging_steps = 1,
        output_dir = models[model_name]['folder_name'],
        optim = "paged_adamw_8bit"
    ),
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm = False),
    callbacks = [PeftSavingCallback]
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

trainer.train()```

### Expected behavior

Train with gradient checkpointing with Peft
@nivibilla
Copy link
Author

@younesbelkada tagging you as you provided the solution in the other issue.

@younesbelkada
Copy link
Contributor

Hi @nivibilla
Which version of PEFT are you using? This should have been fixed by #404 that automatically correctly prepares the model for gradient checkpointing + we have added some CI tests meaning this feature seems to currently work as of today. Can you double check that be uninstall peft and re-installing it from source?

pip uninstall peft
pip install git+https://github.com/huggingface/peft.git

@nivibilla
Copy link
Author

Hey @younesbelkada thanks for your reply, I am installing from source. But I seemed to have solved it.

The code change was to call this before getting the Peft model.

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing = True

After this I was getting the "expected to mark ready only once..." Error. Then I added this

ddp_find_unused_parameters = False

In the trainer. Even though I was only training on one GPU, this seemed to fix that.

The colab notebook from the QLoRA also works on colab if you run it as is. so I'm not sure why I was having the issue, the only thing I changed was the model(Llama2) and the data.

@nivibilla
Copy link
Author

As far as I'm concerned this issue is solved so I'm closing it but please let me know if you need any more information.

@younesbelkada
Copy link
Contributor

I see, this is a bit strange as I have managed to FT llama-2 on a single T4 without any problem using PEFT.. What is the transformers version you use?

@nivibilla
Copy link
Author

I am also Installing transformers from source. And also (part of another Issue ) I can confirm you can finetune on a T4 but I didn't realise that the sequence length effected the vram usage so much. I had some rare 4k samples in my data which caused OOM. But limiting to 1024 meant that I could do batch size 32 on a 24GB GPU.

@nivibilla
Copy link
Author

nivibilla commented Jul 24, 2023

Also I should probably note that I am doing this on databricks so there are a lot of issues with the way they make thier clusters. Could be that too.

@younesbelkada
Copy link
Contributor

I am doing this on databricks so there are a lot of issues with the way they make thier clusters. Could be that too.

Yes that could defintiely explain the issue! Maybe there is some sort of silent bug that converts the model into a DDP model ..
Thanks a lot for sharing all these details !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants