Skip to content

Breaking change about AWQ Fused modules due to Attention Refactor #41910

@fanqiNO1

Description

@fanqiNO1

System Info

transformers==5.0.0dev
autoawq==0.2.9
autoawq_kernels==0.0.9
torch==2.6.0+cu124

Who can help?

Due to PR #35235, the past_key_values is no longer a returned value of attention modules.

However, when using AWQ models with Fused modules AWQ Fused modules docs, there will be an error like issue #38554

    hidden_states, _ = self.self_attn(
ValueError: too many values to unpack (expected 2)

So we can hack the awq.modules.fused.attn.QuantAttentionFused to avoid returning past_key_values. Therefore, I create a primary PR #41909 to fix it.

However, for special rope_type such as LLaMA3, the RoPE implementation in AutoAWQ will cause error, since awq.modules.fused.attn.RoPE supports default RoPE only.

Maybe we can implement and maintain AwqRoPE and AwqQuantAttentionFused in transformers.integrations.awq? Or we can maintain huggingface/AutoAWQ as casper-hansen/AutoAWQ is archived.

I'd like to refine my PR to help transformers fix this bug!

@SunMarc @MekkCyber

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AwqConfig, AutoModelForCausalLM, AutoTokenizer


# model_path = "./llama-3.1-8b-instruct-awq"
model_path = "./qwen2.5-7b-instruct-awq"
# model_path = "./qwen3-8b-awq"

awq_config = AwqConfig(
    bits=4,
    do_fuse=True,
    fuse_max_seq_len=8192
)

model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=awq_config).to("cuda:0")
print(model)
tokenizer = AutoTokenizer.from_pretrained(model_path)

max_new_tokens = 1024 if "qwen3" in model_path else 32


messages = []

prompt1 = "What is the result of 3+5?"
messages.append({"role": "user", "content": prompt1})
text1 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs1 = tokenizer(text1, return_tensors="pt").to("cuda:0")

generated_ids1 = model.generate(**inputs1, max_new_tokens=max_new_tokens)
output_ids1 = generated_ids1[0, len(inputs1.input_ids[0]) :].tolist()
output1 = tokenizer.decode(output_ids1, skip_special_tokens=True)
messages.append({"role": "assistant", "content": output1})
print("Output 1:", output1)

prompt2 = "What about adding 10 to that result?"
messages.append({"role": "user", "content": prompt2})
text2 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs2 = tokenizer(text2, return_tensors="pt").to("cuda:0")

generated_ids2 = model.generate(**inputs2, max_new_tokens=max_new_tokens)
output_ids2 = generated_ids2[0, len(inputs2.input_ids[0]) :].tolist()
output2 = tokenizer.decode(output_ids2, skip_special_tokens=True)
messages.append({"role": "assistant", "content": output2})
print("Output 2:", output2)

Expected behavior

There is no error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions