Breaking change about AWQ Fused modules due to Attention Refactor

### System Info

transformers==5.0.0dev
autoawq==0.2.9
autoawq_kernels==0.0.9
torch==2.6.0+cu124

### Who can help?

Due to PR #35235, the `past_key_values` is no longer a returned value of attention modules.

However, when using AWQ models with Fused modules [AWQ Fused modules docs](https://huggingface.co/docs/transformers/main/en/quantization/awq#fused-modules), there will be an error like issue #38554

```bash
    hidden_states, _ = self.self_attn(
ValueError: too many values to unpack (expected 2)
```

So we can hack the `awq.modules.fused.attn.QuantAttentionFused` to avoid returning `past_key_values`. Therefore, I create a primary PR #41909 to fix it.

However, for special `rope_type` such as LLaMA3, the RoPE implementation in AutoAWQ will cause error, since `awq.modules.fused.attn.RoPE` supports default RoPE only.

Maybe we can implement and maintain `AwqRoPE` and `AwqQuantAttentionFused` in `transformers.integrations.awq`? Or we can maintain `huggingface/AutoAWQ` as `casper-hansen/AutoAWQ` is archived.

I'd like to refine my PR to help transformers fix this bug!

@SunMarc @MekkCyber

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

```python

from transformers import AwqConfig, AutoModelForCausalLM, AutoTokenizer


# model_path = "./llama-3.1-8b-instruct-awq"
model_path = "./qwen2.5-7b-instruct-awq"
# model_path = "./qwen3-8b-awq"

awq_config = AwqConfig(
    bits=4,
    do_fuse=True,
    fuse_max_seq_len=8192
)

model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=awq_config).to("cuda:0")
print(model)
tokenizer = AutoTokenizer.from_pretrained(model_path)

max_new_tokens = 1024 if "qwen3" in model_path else 32


messages = []

prompt1 = "What is the result of 3+5?"
messages.append({"role": "user", "content": prompt1})
text1 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs1 = tokenizer(text1, return_tensors="pt").to("cuda:0")

generated_ids1 = model.generate(**inputs1, max_new_tokens=max_new_tokens)
output_ids1 = generated_ids1[0, len(inputs1.input_ids[0]) :].tolist()
output1 = tokenizer.decode(output_ids1, skip_special_tokens=True)
messages.append({"role": "assistant", "content": output1})
print("Output 1:", output1)

prompt2 = "What about adding 10 to that result?"
messages.append({"role": "user", "content": prompt2})
text2 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs2 = tokenizer(text2, return_tensors="pt").to("cuda:0")

generated_ids2 = model.generate(**inputs2, max_new_tokens=max_new_tokens)
output_ids2 = generated_ids2[0, len(inputs2.input_ids[0]) :].tolist()
output2 = tokenizer.decode(output_ids2, skip_special_tokens=True)
messages.append({"role": "assistant", "content": output2})
print("Output 2:", output2)

```

### Expected behavior

There is no error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Breaking change about AWQ Fused modules due to Attention Refactor #41910

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Breaking change about AWQ Fused modules due to Attention Refactor #41910

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions