-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
13B On 24GB go OOM #225
Comments
Related question, if I had to drop from modules to adapt. Out of q,k,v,o up, down, gate. Which should I drop first? |
You should probably first preferably limit your batch size to 1 if you haven't already done so. Decreasing the Lora R to 8 may also help to a limited extent without affecting model performance.
If you're still getting OOM after this then perhaps your training data has excessively long sequence lengths. With my 24GB RTX3090 I could go up to about 3750 tokens with these settings. |
I've basically stripped everything down. I'm only training q,v. I'm using batch size 1 with grad accumulation 1. And LoRA r 4. But I still get OOM |
Do you mind sharing your code/script? 24GB should be sufficient for 13B |
Sure here is the converted notebook
|
or here as a notebook in drive |
The code looks reasonable. One thing I noticed is that you are not controlling the number of tokens in your data. I think in our experiments we keep the maximum sequence length to 256 or 512 depending on the dataset and truncate the rest. This could explain your code running fine for a few steps and then getting into OOM with a longer sequence. If you do want to train on longer sequences that don't fit in memory, maybe try splitting your model across two GPUs. This should be handled pretty easily with accelerate or by defining a custom device map that puts layers on different GPUs. |
Ah okay I didn't know that. I have some samples that are close to 4k. Does finetuning on 512 length still perform well when you test on longer sequences? |
I am not sure, to be honest. Maybe if you are using LLaMA2, its pretraining on 8k context is sufficient even if finetuning is only on 512 tokens. Especially if you use relatively few examples similar to LIMA or like we did with Guanaco. But this is still an open question as far as I know. |
Ah okay, cool. I will let you know how it goes. I plan to train on 50k data rows max |
I will confirm that with limiting to 512 fixes this and then close the issue |
You might want to look into flash attention for long sequences. There was some discussion on this issue #221. If you have an interest in long sequence modeling we would appreciate some help getting an example with flash attention in QLoRA. |
Thanks! Will have a look |
@artidoro I have checked with limiting the seq length to 512. It works fine. Might be worth mentioning this somewhere in case others have the same issue. Closing this issue as my problem is fixed. |
@artidoro for reference I was able to go to seq Len 1024 with an effective batch size of 32(actual batch 4 and gradient accumulation 8 with gradient checkpointing enabled) all on a single 24GB gpu |
could you please share that where to adjust the seq len? |
Hey, how can I stop the training going OOM when training a 13B in a 24GB GPU?
The text was updated successfully, but these errors were encountered: