Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load a just-built Pixtral quant; RuntimeError: start (0) + length (1280) exceeds dimension size (1024). #1127

Open
sjuxax opened this issue Feb 5, 2025 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@sjuxax
Copy link

sjuxax commented Feb 5, 2025

Describe the bug
Just built a Pixtral quant using the example script and git HEAD of llm-compressor. Can't load it in vLLM head, get RuntimeError: start (0) + length (1280) exceeds dimension size (1024).

Expected behavior
Expected model to run correctly.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Arch
  2. Python version [e.g. 3.7]: 3.12
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: caee1c8
  4. ML framework version(s) [e.g. torch 2.3.1]: torch 2.5.1
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
Name: compressed-tensors
Version: 0.9.1
---
Name: numpy
Version: 1.26.4
---
Name: vllm
Version: 0.7.2.dev59+g998669c7e.d20250205.cu128
  1. Other relevant environment information [e.g. hardware, CUDA version]: CUDA 12.8, GeForce RTX 3090Ti, nVidia 570.86.16

To Reproduce
Exact steps to reproduce the behavior:
Build a Pixtral quant and observe that vLLM can't load it.

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

vLLM Traceback
INFO 02-05 16:23:33 gpu_model_runner.py:867] Starting to load model /intnvme/models/pixtral-12b-W4A16-G128/...
INFO 02-05 16:23:33 config.py:2993] cudagraph sizes specified by model runner [] is overridden by config []
DEBUG 02-05 16:23:33 decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.qkv_proj
INFO 02-05 16:23:33 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.o_proj
INFO 02-05 16:23:33 cuda.py:158] Using Flash Attention backend on V1 engine.
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.down_proj
No ROCm runtime is found, using ROCM_HOME='/opt/rocm'
INFO 02-05 16:23:34 topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
DEBUG 02-05 16:23:34 config.py:3408] enabled custom ops: Counter({'rms_norm': 130, 'silu_and_mul': 41, 'rotary_embedding': 1})
DEBUG 02-05 16:23:34 config.py:3410] disabled custom ops: Counter()
DEBUG 02-05 16:23:34 config.py:3408] enabled custom ops: Counter({'rms_norm': 130, 'silu_and_mul': 41, 'rotary_embedding': 1})
DEBUG 02-05 16:23:34 config.py:3410] disabled custom ops: Counter()
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
DEBUG 02-05 16:23:34 utils.py:154] Loaded weight lm_head.weight with shape torch.Size([131072, 5120])
ERROR 02-05 16:23:34 core.py:210] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 202, in run_engine_core
ERROR 02-05 16:23:34 core.py:210]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 02-05 16:23:34 core.py:210]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 156, in __init__
ERROR 02-05 16:23:34 core.py:210]     super().__init__(vllm_config, executor_class)
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 51, in __init__
ERROR 02-05 16:23:34 core.py:210]     self.model_executor = executor_class(vllm_config)
ERROR 02-05 16:23:34 core.py:210]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 51, in __init__
ERROR 02-05 16:23:34 core.py:210]     self._init_executor()
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 42, in _init_executor
ERROR 02-05 16:23:34 core.py:210]     self.collective_rpc("load_model")
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 02-05 16:23:34 core.py:210]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-05 16:23:34 core.py:210]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2220, in run_method
ERROR 02-05 16:23:34 core.py:210]     return func(*args, **kwargs)
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 143, in load_model
ERROR 02-05 16:23:34 core.py:210]     self.model_runner.load_model()
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 869, in load_model
ERROR 02-05 16:23:34 core.py:210]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 02-05 16:23:34 core.py:210]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 02-05 16:23:34 core.py:210]     return loader.load_model(vllm_config=vllm_config)
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 386, in load_model
ERROR 02-05 16:23:34 core.py:210]     loaded_weights = model.load_weights(
ERROR 02-05 16:23:34 core.py:210]                      ^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llava.py", line 727, in load_weights
ERROR 02-05 16:23:34 core.py:210]     return loader.load_weights(weights)
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 235, in load_weights
ERROR 02-05 16:23:34 core.py:210]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 02-05 16:23:34 core.py:210]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 196, in _load_module
ERROR 02-05 16:23:34 core.py:210]     yield from self._load_module(prefix,
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 173, in _load_module
ERROR 02-05 16:23:34 core.py:210]     loaded_params = module_load_weights(weights)
ERROR 02-05 16:23:34 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 567, in load_weights
ERROR 02-05 16:23:34 core.py:210]     return loader.load_weights(
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 235, in load_weights
ERROR 02-05 16:23:34 core.py:210]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 02-05 16:23:34 core.py:210]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 196, in _load_module
ERROR 02-05 16:23:34 core.py:210]     yield from self._load_module(prefix,
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 173, in _load_module
ERROR 02-05 16:23:34 core.py:210]     loaded_params = module_load_weights(weights)
ERROR 02-05 16:23:34 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 427, in load_weights
ERROR 02-05 16:23:34 core.py:210]     weight_loader(param, loaded_weight, shard_id)
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 812, in weight_loader_v2
ERROR 02-05 16:23:34 core.py:210]     param.load_qkv_weight(loaded_weight=loaded_weight,
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 151, in load_qkv_weight
ERROR 02-05 16:23:34 core.py:210]     loaded_weight = loaded_weight.narrow(self.output_dim,
ERROR 02-05 16:23:34 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] RuntimeError: start (0) + length (1280) exceeds dimension size (1024).
ERROR 02-05 16:23:34 core.py:210]

Additional context
Add any other context about the problem here. Also include any relevant files.

@sjuxax sjuxax added the bug Something isn't working label Feb 5, 2025
@kylesayrs kylesayrs self-assigned this Feb 5, 2025
@kylesayrs
Copy link
Collaborator

I suspect this is related to the most recent updates to pixtral by vllm and transformers. You may have to update to the most recent transformers version. Will attempt to verify on my side

@sjuxax
Copy link
Author

sjuxax commented Feb 5, 2025

fwiw, I couldn't get it to build unless I was on 4.48.2 exactly -- 4.49.x complains about a missing image_sizes forward argument. I'd think the version that builds the quant should be able to run it, but I'll try with transformers git HEAD and report back shortly.

@sjuxax
Copy link
Author

sjuxax commented Feb 5, 2025

❯ bat /home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/transformers-4.49.0.dev0.dist-info/direct_url.json
───────┬──────────────────────────────────────────────────────────
       │ File: /home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/transformers-4.49.0.dev0.dist-info/direct_url.json
───────┼──────────────────────────────────────────────────────────
   1   │ {"url":"https://github.com/huggingface/transformers.git","vcs_info":{"vcs":"git","commit_id":"0de15c988b0d27758ce360adb2627e9ea99e91b3"}}

Same error with Transformers 0de15c988b0d27758ce360adb2627e9ea99e91b3

@kylesayrs
Copy link
Collaborator

I was able to replicate this issue, working on a fix

@kylesayrs
Copy link
Collaborator

This is an ongoing issue with saving the pixtral config, being tracked here

In the meantime, you can patch you config with these options

  "text_config": {
    "hidden_size": 5120,
    "head_dim": 128,
    "intermediate_size": 14336,
    "is_composition": true,
    "max_position_embeddings": 1024000,
    "model_type": "mistral",
    "num_hidden_layers": 40,
    "num_key_value_heads": 8,
    "rms_norm_eps": 1e-05,
    "rope_theta": 1000000000.0,
    "sliding_window": null,
    "vocab_size": 131072
  },
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "vision_config": {
    "head_dim": 64,
    "hidden_act": "silu",
    "image_size": 1024,
    "is_composition": true,
    "model_type": "pixtral",
    "patch_size": 16,
    "rope_theta": 10000.0,
    "tie_word_embeddings": false
  },

Then run with vllm

from vllm import LLM;
llm = LLM(
    "/home/kyle/llm-compressor/pixtral-12b-W4A16-G128",
    gpu_memory_utilization=0.95,
    max_model_len=8192
)

@sjuxax
Copy link
Author

sjuxax commented Feb 9, 2025

Thanks, adding head_dim: 128 to the text_config got the quant working here. Hopefully this will get fixed upstream soon, but grateful for the workaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants