Skip to content

P0: 4-bit quantized inference produces incorrect output on RDNA 3.5 (gfx1151) — affine_dequantize_packed_kernel miscompile #37

@antmikinka

Description

@antmikinka

Problem

4-bit quantized models on ROCm RDNA 3.5 (gfx1151) produce garbage inference output. FP16, CPU, and macOS all pass with the same weights.

Symptom

Prompt: "What is 2+2? Reply with just the number."
Expected: 4
Actual: "nipple 2 2 2 2" (repeating garbage)

Server log:

[hipBLASLt] first call
[hipBLASLt] M=13 N=248320 K=1024 ta=0 tb=1 lda=1024 ldb=1024 ldc=248320

Affected Models

  • Qwen3.5-0.8B-4bit
  • Qwen3-Coder-30B-A3B-Instruct-4bit
  • Gemma-4-26B/31B-it-4bit
  • Qwen3.6-35B-A3B-4bit
  • All MoE 4-bit models on ROCm gfx1151

Unaffected

Configuration Status
FP16 on ROCm gfx1151 PASS
4-bit on CPU PASS
4-bit on macOS (Apple Silicon) PASS

Root Cause

NOT in hipBLASLt — the corruption happens in affine_dequantize_packed_kernel (in NripeshN/mlx) BEFORE hipBLASLt touches the data.

Execution Path

  1. QuantizedMatmul::eval_gpu → dequant+GEMM path
  2. affine_dequantize() → launches affine_dequantize_packed_kernelcorruption here
  3. dequant_rocblas_gemm() → rocBLAS/hipBLASLt (receives already-corrupted data)

Proof

Model Path hipBLASLt? Works?
FP16 Matmul → gemm_rocblas Yes Yes
4-bit QMM → dequant → gemm_rocblas Yes No

Same hipBLASLt — FP16 works, 4-bit doesn't. The difference is the dequantize kernel.

The Bug

The #pragma unroll in affine_dequantize_packed_kernel triggers an LLVM 23 / hipcc 7.13 codegen bug on RDNA 3.5 that emits incorrectly optimized vectorized stores, corrupting the dequantized fp16/bf16 weight data.

Same root cause as the uint4 load fix in qdequant.hpp (commit e15fcef9 on NripeshN/mlx).


Fix

NripeshN/mlx PR #10 fixes the kernel:

Change Detail
Remove #pragma unroll LLVM 23 miscompiles vectorized stores
Explicit scalar stores Same throughput, no miscompile
Boundary guard if (oindex + i >= size) break;

Immediate Workaround

MLX_ROCM_QMM_DEQUANT_GEMM=0

This disables the dequant+GEMM path, falling back to QMV tiled kernel (already fixed).


Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions