P0: 4-bit quantized inference produces incorrect output on RDNA 3.5 (gfx1151) — affine_dequantize_packed_kernel miscompile

## Problem

4-bit quantized models on ROCm RDNA 3.5 (gfx1151) produce garbage inference output. FP16, CPU, and macOS all pass with the same weights.

### Symptom

```
Prompt: "What is 2+2? Reply with just the number."
Expected: 4
Actual: "nipple 2 2 2 2" (repeating garbage)
```

Server log:
```
[hipBLASLt] first call
[hipBLASLt] M=13 N=248320 K=1024 ta=0 tb=1 lda=1024 ldb=1024 ldc=248320
```

### Affected Models

- Qwen3.5-0.8B-4bit
- Qwen3-Coder-30B-A3B-Instruct-4bit
- Gemma-4-26B/31B-it-4bit
- Qwen3.6-35B-A3B-4bit
- All MoE 4-bit models on ROCm gfx1151

### Unaffected

| Configuration | Status |
|---------------|--------|
| FP16 on ROCm gfx1151 | PASS |
| 4-bit on CPU | PASS |
| 4-bit on macOS (Apple Silicon) | PASS |

---

## Root Cause

**NOT in hipBLASLt** — the corruption happens in `affine_dequantize_packed_kernel` (in NripeshN/mlx) BEFORE hipBLASLt touches the data.

### Execution Path

1. `QuantizedMatmul::eval_gpu` → dequant+GEMM path
2. `affine_dequantize()` → launches **`affine_dequantize_packed_kernel`** ← **corruption here**
3. `dequant_rocblas_gemm()` → rocBLAS/hipBLASLt (receives already-corrupted data)

### Proof

| Model | Path | hipBLASLt? | Works? |
|-------|------|------------|--------|
| FP16 | `Matmul → gemm_rocblas` | Yes | Yes |
| 4-bit | `QMM → dequant → gemm_rocblas` | Yes | No |

Same hipBLASLt — FP16 works, 4-bit doesn't. The difference is the dequantize kernel.

### The Bug

The `#pragma unroll` in `affine_dequantize_packed_kernel` triggers an LLVM 23 / hipcc 7.13 codegen bug on RDNA 3.5 that emits incorrectly optimized vectorized stores, corrupting the dequantized fp16/bf16 weight data.

Same root cause as the uint4 load fix in `qdequant.hpp` (commit e15fcef9 on NripeshN/mlx).

---

## Fix

[NripeshN/mlx PR #10](https://github.com/NripeshN/mlx/pull/10) fixes the kernel:

| Change | Detail |
|--------|--------|
| Remove `#pragma unroll` | LLVM 23 miscompiles vectorized stores |
| Explicit scalar stores | Same throughput, no miscompile |
| Boundary guard | `if (oindex + i >= size) break;` |

### Immediate Workaround

```
MLX_ROCM_QMM_DEQUANT_GEMM=0
```

This disables the dequant+GEMM path, falling back to QMV tiled kernel (already fixed).

---

## Related

- [NripeshN/mlx PR #10](https://github.com/NripeshN/mlx/pull/10) — kernel fix
- [PR #35](https://github.com/lemonade-sdk/lemon-mlx-engine/pull/35) — ROCm 4-bit quantization issue analysis
- [Issue #36](https://github.com/lemonade-sdk/lemon-mlx-engine/issues/36) — ROCm 4-bit inference investigation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P0: 4-bit quantized inference produces incorrect output on RDNA 3.5 (gfx1151) — affine_dequantize_packed_kernel miscompile #37

Problem

Symptom

Affected Models

Unaffected

Root Cause

Execution Path

Proof

The Bug

Fix

Immediate Workaround

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Configuration	Status
FP16 on ROCm gfx1151	PASS
4-bit on CPU	PASS
4-bit on macOS (Apple Silicon)	PASS

Model	Path	hipBLASLt?	Works?
FP16	`Matmul → gemm_rocblas`	Yes	Yes
4-bit	`QMM → dequant → gemm_rocblas`	Yes	No

Change	Detail
Remove `#pragma unroll`	LLVM 23 miscompiles vectorized stores
Explicit scalar stores	Same throughput, no miscompile
Boundary guard	`if (oindex + i >= size) break;`

P0: 4-bit quantized inference produces incorrect output on RDNA 3.5 (gfx1151) — affine_dequantize_packed_kernel miscompile #37

Description

Problem

Symptom

Affected Models

Unaffected

Root Cause

Execution Path

Proof

The Bug

Fix

Immediate Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions