RFC: Planar3 KV-cache quantization for compressed attention (experimental) by hexxyan · Pull Request #265 · antirez/ds4

hexxyan · 2026-05-27T11:53:38Z

Thanks @antirez for ds4 — a remarkably clean and well-structured inference engine.

Summary

This PR adds an opt-in, experimental Planar3 KV-cache quantization based on the PlanarQuant approach. 2D Givens rotations decorrelate 128-dim blocks of KV cache rows, then Lloyd-Max 3-bit centroids quantize the rotated coefficients. For ds4's 512-dim heads, each compressed row is 200 bytes (4 blocks × 50 bytes) — a 5.12× density ratio vs FP16.

What's Included

Core Codec (`ds4_planar_quant.c/h`)

CPU quantize/dequantize with inverse Givens rotation
ds4_row_planar3 struct: 200 bytes per 512-dim row (4 × ds4_block_planar3 at 50 bytes each)
Subnormal-safe FP16 rounding

Metal GPU Path

GPU quantize kernel (kernel_planar3_quantize_row) — cooperative 128-thread per row
Planar3→F16 dequant kernel (kernel_planar3_dequant_to_f16_rows) — pre-stage for FlashAttention
Indexed attention: cooperative dequant inline in kernel_dsv4_indexed_mixed_attention_heads8 and _rb16
FlashAttention: Planar3→F16 pre-dequant in all FA encoder paths (gathered_heads, decode_mixed_batch, prefill_static_mixed nonvec+vec)

CPU Hot-Path

Per-layer planar_staging buffer for dequant (no per-token malloc)
comp_kv_for_attn() transparent dequant from Planar3

Planar-Only Mode (`--planar-kv-cache-only`)

Skips FP16/FP32 compressed cache allocation entirely
Only Planar3 compressed cache is allocated (actual memory savings)
Checkpoint save/restore: writes/reads Planar3 bytes directly
Implies --planar-kv-cache

Checkpoint Compatibility

GPU quantize on restore: read FP16 comp cache → CPU quantize → upload Planar3
Planar-only: read/write Planar3 directly, no FP16/FP32 staging

Offline Quality Evaluator (`tools/planar_eval`)

Random-normal, worst-case, and file-based test modes
Cosine similarity, MSE, per-row and batch metrics

Tests (`tests/planar_quant_test.c`)

Block size, roundtrip (random/basis/large-norm), batch, compression ratio, block independence, dim mismatch, zero-norm, single-element — 10/10 passing

Status

Experimental / RFC. Implementation and static/unit verification are solid. The codec round-trips random vectors at cosine similarity ~0.985 avg. End-to-end quality validation on a real 80GB+ DS4 model with compressed KV dumps is needed before recommending production use.

Memory Savings

With --planar-kv-cache-only, each compressed KV row is 200 bytes instead of 1024 bytes (FP16). The dual-cache overhead is eliminated — only Planar3 bytes are stored persistently. FA paths use a transient F16 scratch buffer for dequant.

Usage

# Enable Planar3 quantization (keeps FP16/FP32 cache too)
./ds4 -m model.bin --planar-kv-cache -p "Hello"

# Planar-only mode (actual memory savings)
./ds4 -m model.bin --planar-kv-cache-only -p "Hello"

References & Related Work

Planar3 sits within a family of rotation-based KV-cache compression methods that reduce the d×d random orthogonal projection (originally from TurboQuant) to progressively lighter structures:

Core Methods

Method	Rotation	Params/row	FMA/row	PPL (Llama 3.1 8B, 3-bit)	Source
TurboQuant	Dense d×d Walsh-Hadamard	16,384	~16K	7.07	Google, ICLR 2026
RotorQuant	Cl(3,0) Clifford SO(3) rotor	372	~2,400	7.05*	scrya-com/rotorquant
IsoQuant	4D quaternion SO(4) isoclinic	256	1,024	6.91	ParaMind2025/isoquant, arXiv:2603.28430
PlanarQuant (= IsoQuant-2D)	2D Givens rotation	128	256	7.05	experolk/planar-llama

* RotorQuant PPL from production llama.cpp benchmarks on RTX 5090.

Evolution: TurboQuant (dense WHT) → RotorQuant (3D Clifford, by John D. Pope) → IsoQuant (4D quaternion) / PlanarQuant (2D Givens, both by Zhongping Ji / ParaMind2025). RotorQuant's README credits ParaMind2025 for designing PlanarQuant and IsoQuant.

PlanarQuant is the lightest variant — 128 params, 256 FMAs — making it the best fit for fused GPU kernels in production inference engines. Its quality is competitive with TurboQuant at a fraction of the compute.

Community Projects

Multi-TurboQuant — Unified Python toolkit with 12 methods including all above, plus community Metal/MLX kernels.
HybridQuant — PlanarQuant + semantic activation indexing for layer-adaptive compression.
RotorQuant llama.cpp integration — Production reference for TurboQuant/RotorQuant/IsoQuant in llama.cpp with RTX 5090 benchmarks.

Pure C reference implementation of PlanarQuant (2D Givens rotation + Lloyd-Max 3-bit) for ds4's head_dim=512, adapted from experolk/planar-llama. Standalone — no ggml dependency. Block layout: 50 bytes per 128-dim block (norm FP16 + 2-bit indices + 1-bit QJL signs). Four blocks per 512-dim row = 200 bytes (5.12x vs FP16). d=512 roundtrip quality baseline (random vectors): cosine avg=0.985, MSE avg=0.010, norm preservation <0.03% error This is an RFC prototype for offline quality evaluation. Not yet wired into ds4's hot path or Metal/CUDA attention kernels. Reference: ParaMind2025 PlanarQuant, RotorQuant paper

- ds4_planar3_dequantize: use n_per_row for output stride instead of hardcoded 512; add assert(n_per_row == 512) in both batch functions - Rotation parameters: clarify "64 pairs per block, reused across 4 blocks" instead of misleading "256 pairs" - Block signs field: correct comment from "QJL signs" to "high bit of 3-bit centroid index"

Standalone CLI tool that evaluates Planar3 quantization quality on synthetic KV-cache-like data. Supports four distribution modes: random_normal, random_uniform, sparse, ds4_realistic. Reports cosine similarity, MSE, max element error, relative norm error, and attention score drift (Pearson correlation + top-1 agreement). d=512 quality baseline (10K rows, ds4_realistic mode): cosine: mean=0.981 p99=0.986 min=0.967 MSE: mean=4.57e-02 norm error: mean=3.17e-04 (< 0.04%) attention score corr: 0.981, top-1 preserved Makefile targets: planar-eval, planar-quant-test.

- ds4.h: add planar_kv_cache and dump_comp_kv to engine options - ds4.c: dual-cache strategy (FP32 + Planar3), comp_kv_for_attn() dequant helper, conditional planar staging in decode scratch, checkpoint load quantizes FP32→Planar3, dump-comp-kv tool - ds4_cli.c: --planar-kv-cache and --dump-comp-kv flags - ds4_planar_quant.c/h: soft asserts for dim contract, attribution - metal/dsv4_misc.metal: Planar3 centroids/tables and dequant helper for Phase 2 inline attention dequant - Makefile: ds4_planar_quant.o in all build variants - tests/planar_quant_test.c: dim-mismatch edge case test - tools/planar_eval.c: ds4_like mode, multi-query eval improvements

…tize - metal/dsv4_misc.metal: rename pad0->comp_kv_planar in args struct, add cooperative Planar3 dequant path in both indexed attention kernels (heads8 and rb16), add kernel_planar3_quantize_row for GPU-side FP32->Planar3 conversion with midpoint-based fast quantization - ds4_metal.m: thread comp_kv_planar through indexed mixed attention bridge, add ds4_gpu_planar3_quantize_tensor dispatch function, register planar3 quantize pipeline - ds4_gpu.h: add comp_kv_planar param and planar3 quantize declaration - ds4.c: allocate Planar3 GPU cache tensors per layer (layer_attn_comp_planar), add planar_kv_cache param through metal_graph_alloc_raw_cap, wire new parameter through all call sites

- Add metal_graph_quantize_attn_comp_planar() helper that reads from FP32 staging (F16 path) or FP32 cache, quantizes to Planar3 on GPU - Call after all 4 compressed-row commit sites (single-row decode, batch prefill, chunked prefill, batch single-row) - Add metal_graph_attn_comp_for_attention() / metal_graph_attn_comp_is_planar() helpers to select Planar3 cache when enabled - Pass Planar3 cache and flag to all 4 indexed attention dispatch sites - Rebuild Planar3 GPU cache after checkpoint restore (read-back via CPU, quantize, upload) for both F16 and FP32 comp cache paths - Add Planar3 tensor to layer allocation validation check

P0: Move planar_staging from ds4_cpu_decode_scratch to ds4_layer_cache, eliminating per-token xmalloc/free in CPU decode and prefill hot-paths. P0: Optimize _rb16 Planar3 dequant to use all 256 threads (2 rows/iteration). P1: Add staging capacity guard in metal_graph_quantize_attn_comp_planar. P1: Fix fp16_to_fp32 subnormal and infinity handling. P1: Document ds4_session_dump_comp_kv single-layer limitation. P2: Remove unused kv_cache_uses_planar (now per-layer staging). P2: Add zero-norm and single-element edge-case tests.

Add comp_kv_planar parameter to all three FA wrapper functions (decode_heads, decode_mixed_batch, prefill_static_mixed) and pass Planar3→F16 dequant path through the internal dispatch chain. FA encoder functions now branch: if comp_kv_planar, use the new kernel_planar3_dequant_to_f16_rows to decompress directly into g_flash_attn_kv_buffer; otherwise use existing copy_to_f16 path. ds4.c dispatch sites use metal_graph_attn_comp_for_attention() and metal_graph_attn_comp_is_planar() to select the correct tensor.

When --planar-kv-cache-only is enabled (implies --planar-kv-cache): - GPU: skips layer_attn_comp_cache allocation, only allocates Planar3 - CPU: skips attn_comp_kv allocation, quantizes directly to Planar3 - Checkpoint save: writes Planar3 bytes directly instead of FP16/FP32 - Checkpoint restore: reads Planar3 directly, no CPU quantize step - metal_graph_store_attn_comp_stage: returns true when comp cache is NULL - kv_cache_push_comp: handles NULL rows pointer (planar-only path) This eliminates the dual-cache memory overhead. Combined with the FA Planar3→F16 pre-dequant path, all attention paths work without the persistent FP16/FP32 compressed cache.

hexxyan changed the title ~~RFC: add Planar3 KV-cache quantization prototype and offline evaluator~~ RFC: Planar3 KV-cache quantization for compressed attention (experimental) May 27, 2026

hexxyan force-pushed the codex/ds4-planarquant-rfc branch from 51f333c to a192722 Compare May 27, 2026 17:31

hexxyan added 10 commits May 28, 2026 01:35

Add Planar3→F16 dequant kernel for FlashAttention pre-stage

31df1f1

hexxyan force-pushed the codex/ds4-planarquant-rfc branch from a192722 to 381959a Compare May 27, 2026 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Planar3 KV-cache quantization for compressed attention (experimental)#265

RFC: Planar3 KV-cache quantization for compressed attention (experimental)#265
hexxyan wants to merge 10 commits into
antirez:mainfrom
hexxyan:codex/ds4-planarquant-rfc

hexxyan commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hexxyan commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Included

Core Codec (ds4_planar_quant.c/h)

Metal GPU Path

CPU Hot-Path

Planar-Only Mode (--planar-kv-cache-only)

Checkpoint Compatibility

Offline Quality Evaluator (tools/planar_eval)

Tests (tests/planar_quant_test.c)

Status

Memory Savings

Usage

References & Related Work

Core Methods

Community Projects

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hexxyan commented May 27, 2026 •

edited

Loading

Core Codec (`ds4_planar_quant.c/h`)

Planar-Only Mode (`--planar-kv-cache-only`)

Offline Quality Evaluator (`tools/planar_eval`)

Tests (`tests/planar_quant_test.c`)