-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Description
System Info
transformers==5.0.0dev
autoawq==0.2.9
autoawq_kernels==0.0.9
torch==2.6.0+cu124
Who can help?
Due to PR #35235, the past_key_values is no longer a returned value of attention modules.
However, when using AWQ models with Fused modules AWQ Fused modules docs, there will be an error like issue #38554
hidden_states, _ = self.self_attn(
ValueError: too many values to unpack (expected 2)So we can hack the awq.modules.fused.attn.QuantAttentionFused to avoid returning past_key_values. Therefore, I create a primary PR #41909 to fix it.
However, for special rope_type such as LLaMA3, the RoPE implementation in AutoAWQ will cause error, since awq.modules.fused.attn.RoPE supports default RoPE only.
Maybe we can implement and maintain AwqRoPE and AwqQuantAttentionFused in transformers.integrations.awq? Or we can maintain huggingface/AutoAWQ as casper-hansen/AutoAWQ is archived.
I'd like to refine my PR to help transformers fix this bug!
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import AwqConfig, AutoModelForCausalLM, AutoTokenizer
# model_path = "./llama-3.1-8b-instruct-awq"
model_path = "./qwen2.5-7b-instruct-awq"
# model_path = "./qwen3-8b-awq"
awq_config = AwqConfig(
bits=4,
do_fuse=True,
fuse_max_seq_len=8192
)
model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=awq_config).to("cuda:0")
print(model)
tokenizer = AutoTokenizer.from_pretrained(model_path)
max_new_tokens = 1024 if "qwen3" in model_path else 32
messages = []
prompt1 = "What is the result of 3+5?"
messages.append({"role": "user", "content": prompt1})
text1 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs1 = tokenizer(text1, return_tensors="pt").to("cuda:0")
generated_ids1 = model.generate(**inputs1, max_new_tokens=max_new_tokens)
output_ids1 = generated_ids1[0, len(inputs1.input_ids[0]) :].tolist()
output1 = tokenizer.decode(output_ids1, skip_special_tokens=True)
messages.append({"role": "assistant", "content": output1})
print("Output 1:", output1)
prompt2 = "What about adding 10 to that result?"
messages.append({"role": "user", "content": prompt2})
text2 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs2 = tokenizer(text2, return_tensors="pt").to("cuda:0")
generated_ids2 = model.generate(**inputs2, max_new_tokens=max_new_tokens)
output_ids2 = generated_ids2[0, len(inputs2.input_ids[0]) :].tolist()
output2 = tokenizer.decode(output_ids2, skip_special_tokens=True)
messages.append({"role": "assistant", "content": output2})
print("Output 2:", output2)Expected behavior
There is no error.