Problem
4-bit quantized models on ROCm RDNA 3.5 (gfx1151) produce garbage inference output. FP16, CPU, and macOS all pass with the same weights.
Symptom
Prompt: "What is 2+2? Reply with just the number."
Expected: 4
Actual: "nipple 2 2 2 2" (repeating garbage)
Server log:
[hipBLASLt] first call
[hipBLASLt] M=13 N=248320 K=1024 ta=0 tb=1 lda=1024 ldb=1024 ldc=248320
Affected Models
- Qwen3.5-0.8B-4bit
- Qwen3-Coder-30B-A3B-Instruct-4bit
- Gemma-4-26B/31B-it-4bit
- Qwen3.6-35B-A3B-4bit
- All MoE 4-bit models on ROCm gfx1151
Unaffected
| Configuration |
Status |
| FP16 on ROCm gfx1151 |
PASS |
| 4-bit on CPU |
PASS |
| 4-bit on macOS (Apple Silicon) |
PASS |
Root Cause
NOT in hipBLASLt — the corruption happens in affine_dequantize_packed_kernel (in NripeshN/mlx) BEFORE hipBLASLt touches the data.
Execution Path
QuantizedMatmul::eval_gpu → dequant+GEMM path
affine_dequantize() → launches affine_dequantize_packed_kernel ← corruption here
dequant_rocblas_gemm() → rocBLAS/hipBLASLt (receives already-corrupted data)
Proof
| Model |
Path |
hipBLASLt? |
Works? |
| FP16 |
Matmul → gemm_rocblas |
Yes |
Yes |
| 4-bit |
QMM → dequant → gemm_rocblas |
Yes |
No |
Same hipBLASLt — FP16 works, 4-bit doesn't. The difference is the dequantize kernel.
The Bug
The #pragma unroll in affine_dequantize_packed_kernel triggers an LLVM 23 / hipcc 7.13 codegen bug on RDNA 3.5 that emits incorrectly optimized vectorized stores, corrupting the dequantized fp16/bf16 weight data.
Same root cause as the uint4 load fix in qdequant.hpp (commit e15fcef9 on NripeshN/mlx).
Fix
NripeshN/mlx PR #10 fixes the kernel:
| Change |
Detail |
Remove #pragma unroll |
LLVM 23 miscompiles vectorized stores |
| Explicit scalar stores |
Same throughput, no miscompile |
| Boundary guard |
if (oindex + i >= size) break; |
Immediate Workaround
MLX_ROCM_QMM_DEQUANT_GEMM=0
This disables the dequant+GEMM path, falling back to QMV tiled kernel (already fixed).
Related
Problem
4-bit quantized models on ROCm RDNA 3.5 (gfx1151) produce garbage inference output. FP16, CPU, and macOS all pass with the same weights.
Symptom
Server log:
Affected Models
Unaffected
Root Cause
NOT in hipBLASLt — the corruption happens in
affine_dequantize_packed_kernel(in NripeshN/mlx) BEFORE hipBLASLt touches the data.Execution Path
QuantizedMatmul::eval_gpu→ dequant+GEMM pathaffine_dequantize()→ launchesaffine_dequantize_packed_kernel← corruption heredequant_rocblas_gemm()→ rocBLAS/hipBLASLt (receives already-corrupted data)Proof
Matmul → gemm_rocblasQMM → dequant → gemm_rocblasSame hipBLASLt — FP16 works, 4-bit doesn't. The difference is the dequantize kernel.
The Bug
The
#pragma unrollinaffine_dequantize_packed_kerneltriggers an LLVM 23 / hipcc 7.13 codegen bug on RDNA 3.5 that emits incorrectly optimized vectorized stores, corrupting the dequantized fp16/bf16 weight data.Same root cause as the uint4 load fix in
qdequant.hpp(commit e15fcef9 on NripeshN/mlx).Fix
NripeshN/mlx PR #10 fixes the kernel:
#pragma unrollif (oindex + i >= size) break;Immediate Workaround
This disables the dequant+GEMM path, falling back to QMV tiled kernel (already fixed).
Related