-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR #389 breaks Flash Attention 2 with peft #790
Comments
You can check out my blog: https://www.philschmid.de/instruction-tune-llama-2 it includes flash attention and works with peft. |
Thanks for your blog - it's helped me immensely! I followed the steps you gave at https://www.philschmid.de/instruction-tune-llama-2 and I was able to get flash attention working (on H100) but only after adding a couple extra lines. Initially I was getting the same error: but I was finally able to solve it by following this line (from your instructions):
with these two extra lines (my addition):
Just FYI. I'm not sure why this worked (or why you didn't need this but I did), but I just thought it might be of interest to you, and possibly helpful to others. (BTW, for the sake of others besides @philschmid, |
@davidsvaughn |
I was following @philschmid blog (extremely useful btw, thank you!) and got the same error. @davidsvaughn comment also worked for me, so ty |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
@davidsvaughn and @fullanton from what package is utils import from? I am getting utils modules not found error please help? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Hey @sids07 this is from here: https://github.com/philschmid/deep-learning-pytorch-huggingface/tree/main/training/ |
@younesbelkada should this latest issue be fixed? I had a similar issue running https://huggingface.co/larryvrh/Yi-6B-200K-Llamafied today. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
This should be fixed I think with latest PEFT ! Closing it for now feel free to open a new issue if that's not the case |
System Info
peft = 0.4.0
accelerate = 0.21.0
transformers = 4.31.0
Ubuntu 22.04, PyTorch 2.0.1, CUDA 11.8, nVidia A6000, Python 3.10
Who can help?
@pacman100
Information
Tasks
examples
folderReproduction
Sorry to keep harping on this (issue #422 , issue #423), but the type casting in PR #389 now breaks using Flash Attention for Llama with PEFT / QLoRA, as Flash Attention only works with fp16/bf16. Here is the relevant code:
https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/modeling_flash_llama.py
Without Flash Attention, using the full context window (now at 4,096 tokens for Llama 2) will not run due to a lack of GPU memory.
Expected behavior
The model should run (Llama 2), but instead leads to a type error:
RuntimeError: FlashAttention only support fp16 and bf16 data type
See also this issue: artidoro/qlora#221
Monkey-patching out the upcasting in
other.py
fixes the issue.The text was updated successfully, but these errors were encountered: