Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error while running LoRa #5

Open
bmanikan opened this issue Sep 3, 2023 · 11 comments
Open

OOM error while running LoRa #5

bmanikan opened this issue Sep 3, 2023 · 11 comments

Comments

@bmanikan
Copy link

bmanikan commented Sep 3, 2023

My Parameters are:

python lit-gpt/finetune/lora.py --data_dir data/dolly/ --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --precision bf16-true --out_dir out/lora/llama-2-7b
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'override_max_seq_length': 512, 'learning_rate': 0.0002, 'batch_size': 1, 'micro_batch_size': 1, 'gradient_accumulation_iters': 1, 'max_iters': 20000, 'weight_decay': 0.01, 'lora_r': 4, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': True, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 4, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': True, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}

Error Trace is:

Estimated TFLOPs: 154.46
Measured TFLOPs: 134.33
Traceback (most recent call last):
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 390, in <module>
    CLI(setup)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/jsonargparse/_cli.py", line 181, in _run_component
    return component(**cfg)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 116, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 834, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 177, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 248, in train
    logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=64)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 118, in forward
    output = self._forward_module(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/lora.py", line 525, in forward
    x, *_ = block(x, (cos, sin), max_seq_length)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/model.py", line 173, in forward
    x = x + self.mlp(self.norm_2(x))
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/model.py", line 294, in forward
    x_fc_2 = self.fc_2(x)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/lora.py", line 146, in forward
    pretrained = self.linear(x)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 41.75 MiB is free. Including non-PyTorch memory, this process has 23.15 GiB memory in use. Of the allocated memory 22.22 GiB is allocated by PyTorch, and 481.55 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I went all the way down to '1' batch_size and reduced all the parameters, but still getting this OOM error.

I have RTX 4090 GPU with 24Gb Vram

Can anyone help?

@ayulockin
Copy link
Owner

Try turning lora_key off (lora_key=False).

You can also try to lower the lora_rank to 2 from 4.

Your override_max_seq_length is 512 which can be reduced as well.

Note that all this will reduce the quality of the fine-tuned model but you will atleast have a baseline and can work from there.

@bmanikan
Copy link
Author

bmanikan commented Sep 4, 2023

Tried it, Still getting the OOM error. I went till 8 for override_max_seq_length
I am using CUDA version 12.1 will that be a problem?

@nahidalam
Copy link
Contributor

I have the same problem. I think 24GB memory is not enough for this.

@ayulockin
Copy link
Owner

Did you try QLoRA to fine-tune? I guess quantising to LoRA weights to 4 bits might help.

Another suggestion would be to use SGD optimiser instead of the AdamW. Adam optimiser maintains two states per trainable parameters requiring double the memory. Using SGD might help. You can change the optimiser in this line of code: https://github.com/ayulockin/lit-gpt/blob/b6829289f977e65c3588bbb28737986fe38f8ec1/finetune/lora.py#L154

@bmanikan
Copy link
Author

bmanikan commented Sep 7, 2023

I have tried both QLoRA and SGD but no luck. The 3b model runs perfectly
Does having 32Gb RAM affects in this case?

@ayulockin
Copy link
Owner

I don't think RAM should be an issue here. I am finetuning a 7b model on A100 now with Lion optimizer, micro_batch_size=2 and batch_size of 128.
image

@nahidalam
Copy link
Contributor

@ayulockin is your A100 40GB or 80GB?

@nahidalam
Copy link
Contributor

@bmanikan which 3b model you used?

@nahidalam
Copy link
Contributor

Got hold of an A100 with 40GB memory. Getting into the same issue. I tried everything that @ayulockin suggested

My parameters are

python lit-gpt/finetune/lora.py --data_dir data/dolly/ --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --precision bf16-true --out_dir out/lora/llama-2-7b
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'override_max_seq_length': 256, 'learning_rate': 0.0002, 'batch_size': 4, 'micro_batch_size': 2, 'gradient_accumulation_iters': 2, 'max_iters': 20000, 'weight_decay': 0.01, 'lora_r': 2, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 2, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}

OOM error message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 99.06 MiB is free. Including non-PyTorch memory, this process has 39.29 GiB memory in use. Of the allocated memory 37.73 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@ayulockin
Copy link
Owner

I have a 40gb A100.

Is your flash attention correctly installed?

@nahidalam
Copy link
Contributor

@ayulockin I validated the setup with the command you shared and it seemed fine.

python lit-gpt/generate/base.py --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --prompt "Tell me an interesting fun fact about earth:"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants