Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device_map='auto' not working along with bitsandbytes (transformers) #1411

Closed
guillemram97 opened this issue Nov 12, 2024 · 3 comments
Closed

Comments

@guillemram97
Copy link

guillemram97 commented Nov 12, 2024

System Info

Hardware: Amazon Linux EC2 Instance.
8 NVIDIA A10G (23 GB)

Python 3.10.14
CUDA Version: 12.4
accelerate==0.34.2
bitsandbytes==0.44.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
torch==2.4.1
transformers==4.45.1

Reproduction

from accelerate import infer_auto_device_map
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(**{'load_in_4bit':True})
 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)

device_map = infer_auto_device_map(model, max_memory = {0: "23GB", 1: "23GB", 2: "23GB", 3: "23GB", 4: "23GB", 5: "23GB", 6: "23GB", 7: "23GB"})
print(device_map)
--> OrderedDict([('', 0)])

However, if I load without the quantization_config, no issue at all:

 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)
 print(device_map)
--> OrderedDict([('model.embed_tokens', 0), ('lm_head', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7.self_attn', 0), ('model.layers.7.mlp.gate_proj', 0), ('model.layers.7.mlp.up_proj', 0), ('model.layers.7.mlp.down_proj', 1), ('model.layers.7.mlp.act_fn', 1), ('model.layers.7.input_layernorm', 1), ('model.layers.7.pre_feedforward_layernorm', 1), ('model.layers.7.post_feedforward_layernorm', 1), ('model.layers.7.post_attention_layernorm', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 1), ('model.layers.17.self_attn', 1), ('model.layers.17.mlp.gate_proj', 1), ('model.layers.17.mlp.up_proj', 1), ('model.layers.17.mlp.down_proj', 2), ('model.layers.17.mlp.act_fn', 2), ('model.layers.17.input_layernorm', 2), ('model.layers.17.pre_feedforward_layernorm', 2), ('model.layers.17.post_feedforward_layernorm', 2), ('model.layers.17.post_attention_layernorm', 2), ('model.layers.18', 2), ('model.layers.19', 2), ('model.layers.20', 2), ('model.layers.21', 2), ('model.layers.22', 2), ('model.layers.23', 2), ('model.layers.24', 2), ('model.layers.25', 2), ('model.layers.26', 2), ('model.layers.27.self_attn', 2), ('model.layers.27.mlp.gate_proj', 2), ('model.layers.27.mlp.up_proj', 2), ('model.layers.27.mlp.down_proj', 3), ('model.layers.27.mlp.act_fn', 3), ('model.layers.27.input_layernorm', 3), ('model.layers.27.pre_feedforward_layernorm', 3), ('model.layers.27.post_feedforward_layernorm', 3), ('model.layers.27.post_attention_layernorm', 3), ('model.layers.28', 3), ('model.layers.29', 3), ('model.layers.30', 3), ('model.layers.31', 3), ('model.layers.32', 3), ('model.layers.33', 3), ('model.layers.34', 3), ('model.layers.35', 3), ('model.layers.36', 3), ('model.layers.37.self_attn', 3), ('model.layers.37.mlp.gate_proj', 3), ('model.layers.37.mlp.up_proj', 3), ('model.layers.37.mlp.down_proj', 4), ('model.layers.37.mlp.act_fn', 4), ('model.layers.37.input_layernorm', 4), ('model.layers.37.pre_feedforward_layernorm', 4), ('model.layers.37.post_feedforward_layernorm', 4), ('model.layers.37.post_attention_layernorm', 4), ('model.layers.38', 4), ('model.layers.39', 4), ('model.layers.40', 4), ('model.layers.41', 4), ('model.layers.42', 4), ('model.layers.43', 4), ('model.layers.44', 4), ('model.layers.45', 4), ('model.norm', 4)])

Expected behavior

The model is (mostly) being loaded to the last GPU. However, I'd expect it to be loaded across the different GPUs. Moreover, infer_auto_device_map seems to be not working.
I have experienced this very similar issue with different hardware.

@ra-MANUJ-an
Copy link

I'm getting the same issue. Can anyone answer?

@guillemram97
Copy link
Author

I think I've isolated part of the issue. When I don't allow one GPU then the model is split across GPUS:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6
I don't know if the issue is version-specific or happens for settings with >7 GPUs. Interestingly enough, 8 GPUs worked fine for Mistral-7B.

@guillemram97
Copy link
Author

This was an issue with accelerate, find the fix here: huggingface/accelerate#3244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants