[Bug]: InternVL2 support for AWQ quantization

### ⚙️ Your current environment

### Environment Information ###
Operating System: `Linux-5.15.0-72-generic-x86_64-with-glibc2.35`
Python Version: `3.10.18 (main, Jun  5 2025, 13:14:17) [GCC 11.2.0]`
llm-compressor Version: `0.8.1`
compressed-tensors Version: `0.12.2`
transformers Version: `4.56.2`
torch Version: `2.8.0`
CUDA Devices: `['NVIDIA A100-SXM4-80GB']`
AMD Devices: `None`

### 🐛 Describe the bug

Hi team, 
we are trying to use AWQ to quantize InternVL2 model https://huggingface.co/OpenGVLab/InternVL2-8B


at first, we met some torch.fx issue, like the int() cast in following code snippet is not supported by torch.fx
```bash
   (self.create_arg(fn(*args)),),
  File "InternVLChatModel_8756586350349_autowrapped", line 20, in forward
    input_embeds = input_embeds.reshape(B * N, C)
  File "/home/xxx/.cache/huggingface/modules/transformers_modules/InternVL2-8B/modeling_internvl_chat.py", line 198, in extract_feature
    h = w = int(vit_embeds.shape[1] ** 0.5)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'HFProxy'

```

```python
def pixel_shuffle(self, x, scale_factor=0.5):
        n, w, h, c = x.size()
        # N, W, H, C --> N, W, H * scale, C // scale
        x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
        # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
        x = x.permute(0, 2, 1, 3).contiguous()
        # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
        x = x.view(n, int(h * scale_factor), int(w * scale_factor),
                   int(c / (scale_factor * scale_factor)))
        if self.ps_version == 'v1':
            warnings.warn("In ps_version 'v1', the height and width have not been swapped back, "
                          'which results in a transposed image.')
        else:
            x = x.permute(0, 2, 1, 3).contiguous()
        return x
```
as the cast is not necessary as these variable are fixed once the model is trained, we manually rewrite these expression to remove the int cast. 

the process can preceed after the rewrite,  however, we have the following issue in the 2nd propagation step 
```
Preparing cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 37.15it/s]
(1/2): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:06<00:00, 14.79it/s]
Smoothing: 0it [00:00, ?it/s]
(1/2): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.77it/s]
(2/2): Calibrating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 130.18it/s]
Smoothing: 0it [00:00, ?it/s]
(2/2): Propagating:   0%|                                                                                                             | 0/100 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/mnt/xxxx/workspace/xxxxxx/quant/test_awq.py", line 97, in <module>
    oneshot(
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
    one_shot()
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
    self.apply_recipe_modifiers(
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
    pipeline(
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
    pipeline(model, dataloader, dataset_args)
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 112, in __call__
    inputs = activations.fetch(batch_idx, subgraph.input_names)
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/pipelines/cache.py", line 104, in fetch
    return {
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/pipelines/cache.py", line 105, in <dictcomp>
    key: self._onload_value(subgraph_input)
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/pipelines/cache.py", line 210, in _onload_value
    raise e
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/pipelines/cache.py", line 205, in _onload_value
    setattr(value, field.name, self._onload_value(v))
  File "/mnt/xxxxx/xxxxxx/miniconda/envs/quant/lib/python3.10/site-packages/llmcompressor/pipelines/cache.py", line 195, in _onload_value
    value = intermediate.value
AttributeError: 'NoneType' object has no attribute 'value'

```

following the stacktrace, we print the variable that trigger the exception in `cache.py` using following code snippet
```python
        if is_dataclass(value):
            for field in fields(value):  # `asdict` is recursive, not applicable here
                v = getattr(value, field.name)
                try: 
                    setattr(value, field.name, self._onload_value(v))
                except Exception as e:
                    print("value:", value)
                    print("field:", field)
                    print("v:", v)
                    raise e
```
```
value: CausalLMOutputWithPast(loss=None, logits=tensor([[[-6.5625, -5.8438, -5.3125,  ..., -4.2500, -5.4062, -5.5625],
         [ 3.1562,  4.3125,  1.0391,  ...,  5.6875,  5.4688,  5.0625],
         [ 4.5000,  4.5625,  3.0625,  ...,  6.9375,  7.0312,  6.5312],
         ...,
         [ 8.0625,  9.3750,  6.9688,  ...,  9.6250,  9.5000,  9.1875],
         [ 4.8750,  5.7500,  5.8438,  ...,  6.2500,  6.1562,  6.0000],
         [ 2.1562,  3.3906,  0.6875,  ...,  3.4688,  3.1406,  3.2812]]],
       device='cuda:0'), past_key_values=None, hidden_states=None, attentions=None)
field: Field(name='loss',type=typing.Optional[torch.FloatTensor],default=None,default_factory=<dataclasses._MISSING_TYPE object at 0x7f60485226b0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)
v: None
```
looks like the loss field is None which is unexpected.  could you please help to guide what to do next to resolve this issue?



### 🛠️ Steps to reproduce


```python

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping

from utils import load_image

# Load model.
model_id = "OpenGVLab/InternVL2-8B" # better to download and use following code snippet to replace `forward` method
model = AutoModel.from_pretrained(model_id, torch_dtype="auto", trust_remote_code=True)
print(model)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
NUM_CALIBRATION_SAMPLES = 100
DATASET = "xxx.jsonl"

def preprocess(example):
    messages = []
    for turn in example["conversations"]:
        if turn["from"] == "human":
            messages.append({
                "role": "user",
                "content": turn["value"]
            })
        elif turn["from"] == "gpt":
            pass
        else:
            raise ValueError
            
    prompt_ids = tokenizer.apply_chat_template(messages)
    example["input_ids"] = prompt_ids
    return example

# Load dataset and preprocess.
ds = load_dataset('json', data_files=DATASET, split='train')

ds = ds.map(preprocess)
print(ds[0])

def data_collator(batch):
    assert len(batch) == 1
    item = {key: value for key, value in batch[0].items()}
    item["pixel_values"] = torch.concat([load_image(x) for x in item["image"]])
    item["labels"] = torch.LongTensor([item["input_ids"]])
    item["input_ids"] = torch.LongTensor([item["input_ids"]])
    return item


# Recipe
recipe = [
    AWQModifier(
        ignore=["re:.*lm_head", "re:mlp1.*", "re:.*vision_model.*"],
        scheme="W4A16", 
        targets=["Linear"],
        offload_device=torch.device("cpu"),
        mappings=[
                # AWQMapping(
                #     "re:.*input_layernorm",
                #     ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"],
                # ),
                AWQMapping("re:.*v_proj", ["re:.*o_proj"]),
                # AWQMapping(
                #     "re:.*post_attention_layernorm",
                #     ["re:.*gate_proj", "re:.*up_proj"],
                # ),
                # AWQMapping(
                #     "re:.*up_proj",
                #     ["re:.*down_proj"],
                # ),
            ]
    ),
]


# Perform oneshot
oneshot(
    model=model,
    tokenizer=model_id,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True,
    data_collator=data_collator,
    sequential_targets=["InternLM2ForCausalLM"]
)
```
replace the `forward` method in `modeling_internvl_chat.py` for fast reproduce
```
    def forward(
            self,
            pixel_values: torch.FloatTensor,
            input_ids: torch.LongTensor = None,
            attention_mask: Optional[torch.Tensor] = None,
            position_ids: Optional[torch.LongTensor] = None,
            image_flags: Optional[torch.LongTensor] = None,
            past_key_values: Optional[List[torch.FloatTensor]] = None,
            labels: Optional[torch.LongTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # image_flags = image_flags.squeeze(-1)
        input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()

        # vit_embeds = self.extract_feature(pixel_values)
        # vit_embeds = vit_embeds[image_flags == 1]
        # vit_batch_size = pixel_values.shape[0]

        # B, N, C = input_embeds.shape
        # input_embeds = input_embeds.reshape(B * N, C)

        # if torch.distributed.is_initialized() and torch.distributed.get_rank() == 0:
        #     print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')

        # input_ids = input_ids.reshape(B * N)
        # selected = (input_ids == self.img_context_token_id)
        # try:
        #     input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
        # except Exception as e:
        #     vit_embeds = vit_embeds.reshape(-1, C)
        #     print(f'warning: {e}, input_embeds[selected].shape={input_embeds[selected].shape}, '
        #           f'vit_embeds.shape={vit_embeds.shape}')
        #     n_token = selected.sum()
        #     input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds[:n_token]

        # input_embeds = input_embeds.reshape(B, N, C)

        outputs = self.language_model(
            inputs_embeds=input_embeds,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        logits = outputs.logits

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            shift_labels = shift_labels.to(shift_logits.device)
            loss = loss_fct(shift_logits, shift_labels)

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: InternVL2 support for AWQ quantization #1929

⚙️ Your current environment

Environment Information

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: InternVL2 support for AWQ quantization #1929

Description

⚙️ Your current environment

Environment Information

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions