-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient Checkpointing breaks peft trainer. #742
Comments
@younesbelkada tagging you as you provided the solution in the other issue. |
Hi @nivibilla pip uninstall peft
pip install git+https://github.com/huggingface/peft.git |
Hey @younesbelkada thanks for your reply, I am installing from source. But I seemed to have solved it. The code change was to call this before getting the Peft model.
After this I was getting the "expected to mark ready only once..." Error. Then I added this
In the trainer. Even though I was only training on one GPU, this seemed to fix that. The colab notebook from the QLoRA also works on colab if you run it as is. so I'm not sure why I was having the issue, the only thing I changed was the model(Llama2) and the data. |
As far as I'm concerned this issue is solved so I'm closing it but please let me know if you need any more information. |
I see, this is a bit strange as I have managed to FT llama-2 on a single T4 without any problem using PEFT.. What is the transformers version you use? |
I am also Installing transformers from source. And also (part of another Issue ) I can confirm you can finetune on a T4 but I didn't realise that the sequence length effected the vram usage so much. I had some rare 4k samples in my data which caused OOM. But limiting to 1024 meant that I could do batch size 32 on a 24GB GPU. |
Also I should probably note that I am doing this on databricks so there are a lot of issues with the way they make thier clusters. Could be that too. |
Yes that could defintiely explain the issue! Maybe there is some sort of silent bug that converts the model into a |
System Info
Im using the notebook shown in the qlora repo. I am trying to train using notebook with gradient checkpointing enabled. However this causes an issue.
I have tried the solution in mentioned in #137, however this causes the 'expected to mark variable ready once...' error.
Who can help?
I think this is a library error so tagging @pacman100 @younesbelkada @sayakpaul
Information
Tasks
examples
folderReproduction
I am installing all libraries from source.
The text was updated successfully, but these errors were encountered: