Skip to content

RFC: Planar3 KV-cache quantization for compressed attention (experimental)#265

Open
hexxyan wants to merge 10 commits into
antirez:mainfrom
hexxyan:codex/ds4-planarquant-rfc
Open

RFC: Planar3 KV-cache quantization for compressed attention (experimental)#265
hexxyan wants to merge 10 commits into
antirez:mainfrom
hexxyan:codex/ds4-planarquant-rfc

Conversation

@hexxyan
Copy link
Copy Markdown
Contributor

@hexxyan hexxyan commented May 27, 2026

Thanks @antirez for ds4 — a remarkably clean and well-structured inference engine.

Summary

This PR adds an opt-in, experimental Planar3 KV-cache quantization based on the PlanarQuant approach. 2D Givens rotations decorrelate 128-dim blocks of KV cache rows, then Lloyd-Max 3-bit centroids quantize the rotated coefficients. For ds4's 512-dim heads, each compressed row is 200 bytes (4 blocks × 50 bytes) — a 5.12× density ratio vs FP16.

What's Included

Core Codec (ds4_planar_quant.c/h)

  • CPU quantize/dequantize with inverse Givens rotation
  • ds4_row_planar3 struct: 200 bytes per 512-dim row (4 × ds4_block_planar3 at 50 bytes each)
  • Subnormal-safe FP16 rounding

Metal GPU Path

  • GPU quantize kernel (kernel_planar3_quantize_row) — cooperative 128-thread per row
  • Planar3→F16 dequant kernel (kernel_planar3_dequant_to_f16_rows) — pre-stage for FlashAttention
  • Indexed attention: cooperative dequant inline in kernel_dsv4_indexed_mixed_attention_heads8 and _rb16
  • FlashAttention: Planar3→F16 pre-dequant in all FA encoder paths (gathered_heads, decode_mixed_batch, prefill_static_mixed nonvec+vec)

CPU Hot-Path

  • Per-layer planar_staging buffer for dequant (no per-token malloc)
  • comp_kv_for_attn() transparent dequant from Planar3

Planar-Only Mode (--planar-kv-cache-only)

  • Skips FP16/FP32 compressed cache allocation entirely
  • Only Planar3 compressed cache is allocated (actual memory savings)
  • Checkpoint save/restore: writes/reads Planar3 bytes directly
  • Implies --planar-kv-cache

Checkpoint Compatibility

  • GPU quantize on restore: read FP16 comp cache → CPU quantize → upload Planar3
  • Planar-only: read/write Planar3 directly, no FP16/FP32 staging

Offline Quality Evaluator (tools/planar_eval)

  • Random-normal, worst-case, and file-based test modes
  • Cosine similarity, MSE, per-row and batch metrics

Tests (tests/planar_quant_test.c)

  • Block size, roundtrip (random/basis/large-norm), batch, compression ratio, block independence, dim mismatch, zero-norm, single-element — 10/10 passing

Status

Experimental / RFC. Implementation and static/unit verification are solid. The codec round-trips random vectors at cosine similarity ~0.985 avg. End-to-end quality validation on a real 80GB+ DS4 model with compressed KV dumps is needed before recommending production use.

Memory Savings

With --planar-kv-cache-only, each compressed KV row is 200 bytes instead of 1024 bytes (FP16). The dual-cache overhead is eliminated — only Planar3 bytes are stored persistently. FA paths use a transient F16 scratch buffer for dequant.

Usage

# Enable Planar3 quantization (keeps FP16/FP32 cache too)
./ds4 -m model.bin --planar-kv-cache -p "Hello"

# Planar-only mode (actual memory savings)
./ds4 -m model.bin --planar-kv-cache-only -p "Hello"

References & Related Work

Planar3 sits within a family of rotation-based KV-cache compression methods that reduce the d×d random orthogonal projection (originally from TurboQuant) to progressively lighter structures:

Core Methods

Method Rotation Params/row FMA/row PPL (Llama 3.1 8B, 3-bit) Source
TurboQuant Dense d×d Walsh-Hadamard 16,384 ~16K 7.07 Google, ICLR 2026
RotorQuant Cl(3,0) Clifford SO(3) rotor 372 ~2,400 7.05* scrya-com/rotorquant
IsoQuant 4D quaternion SO(4) isoclinic 256 1,024 6.91 ParaMind2025/isoquant, arXiv:2603.28430
PlanarQuant (= IsoQuant-2D) 2D Givens rotation 128 256 7.05 experolk/planar-llama

* RotorQuant PPL from production llama.cpp benchmarks on RTX 5090.

Evolution: TurboQuant (dense WHT) → RotorQuant (3D Clifford, by John D. Pope) → IsoQuant (4D quaternion) / PlanarQuant (2D Givens, both by Zhongping Ji / ParaMind2025). RotorQuant's README credits ParaMind2025 for designing PlanarQuant and IsoQuant.

PlanarQuant is the lightest variant — 128 params, 256 FMAs — making it the best fit for fused GPU kernels in production inference engines. Its quality is competitive with TurboQuant at a fraction of the compute.

Community Projects

  • Multi-TurboQuant — Unified Python toolkit with 12 methods including all above, plus community Metal/MLX kernels.
  • HybridQuant — PlanarQuant + semantic activation indexing for layer-adaptive compression.
  • RotorQuant llama.cpp integration — Production reference for TurboQuant/RotorQuant/IsoQuant in llama.cpp with RTX 5090 benchmarks.

@hexxyan hexxyan changed the title RFC: add Planar3 KV-cache quantization prototype and offline evaluator RFC: Planar3 KV-cache quantization for compressed attention (experimental) May 27, 2026
@hexxyan hexxyan force-pushed the codex/ds4-planarquant-rfc branch from 51f333c to a192722 Compare May 27, 2026 17:31
hexxyan added 10 commits May 28, 2026 01:35
Pure C reference implementation of PlanarQuant (2D Givens rotation +
Lloyd-Max 3-bit) for ds4's head_dim=512, adapted from
experolk/planar-llama. Standalone — no ggml dependency.

Block layout: 50 bytes per 128-dim block (norm FP16 + 2-bit indices +
1-bit QJL signs). Four blocks per 512-dim row = 200 bytes (5.12x vs
FP16).

d=512 roundtrip quality baseline (random vectors):
  cosine avg=0.985, MSE avg=0.010, norm preservation <0.03% error

This is an RFC prototype for offline quality evaluation. Not yet wired
into ds4's hot path or Metal/CUDA attention kernels.

Reference: ParaMind2025 PlanarQuant, RotorQuant paper
- ds4_planar3_dequantize: use n_per_row for output stride instead of
  hardcoded 512; add assert(n_per_row == 512) in both batch functions
- Rotation parameters: clarify "64 pairs per block, reused across 4
  blocks" instead of misleading "256 pairs"
- Block signs field: correct comment from "QJL signs" to "high bit of
  3-bit centroid index"
Standalone CLI tool that evaluates Planar3 quantization quality on
synthetic KV-cache-like data. Supports four distribution modes:
random_normal, random_uniform, sparse, ds4_realistic.

Reports cosine similarity, MSE, max element error, relative norm error,
and attention score drift (Pearson correlation + top-1 agreement).

d=512 quality baseline (10K rows, ds4_realistic mode):
  cosine: mean=0.981 p99=0.986 min=0.967
  MSE:    mean=4.57e-02
  norm error: mean=3.17e-04 (< 0.04%)
  attention score corr: 0.981, top-1 preserved

Makefile targets: planar-eval, planar-quant-test.
- ds4.h: add planar_kv_cache and dump_comp_kv to engine options
- ds4.c: dual-cache strategy (FP32 + Planar3), comp_kv_for_attn()
  dequant helper, conditional planar staging in decode scratch,
  checkpoint load quantizes FP32→Planar3, dump-comp-kv tool
- ds4_cli.c: --planar-kv-cache and --dump-comp-kv flags
- ds4_planar_quant.c/h: soft asserts for dim contract, attribution
- metal/dsv4_misc.metal: Planar3 centroids/tables and dequant helper
  for Phase 2 inline attention dequant
- Makefile: ds4_planar_quant.o in all build variants
- tests/planar_quant_test.c: dim-mismatch edge case test
- tools/planar_eval.c: ds4_like mode, multi-query eval improvements
…tize

- metal/dsv4_misc.metal: rename pad0->comp_kv_planar in args struct,
  add cooperative Planar3 dequant path in both indexed attention kernels
  (heads8 and rb16), add kernel_planar3_quantize_row for GPU-side
  FP32->Planar3 conversion with midpoint-based fast quantization
- ds4_metal.m: thread comp_kv_planar through indexed mixed attention
  bridge, add ds4_gpu_planar3_quantize_tensor dispatch function,
  register planar3 quantize pipeline
- ds4_gpu.h: add comp_kv_planar param and planar3 quantize declaration
- ds4.c: allocate Planar3 GPU cache tensors per layer (layer_attn_comp_planar),
  add planar_kv_cache param through metal_graph_alloc_raw_cap, wire
  new parameter through all call sites
- Add metal_graph_quantize_attn_comp_planar() helper that reads from
  FP32 staging (F16 path) or FP32 cache, quantizes to Planar3 on GPU
- Call after all 4 compressed-row commit sites (single-row decode,
  batch prefill, chunked prefill, batch single-row)
- Add metal_graph_attn_comp_for_attention() / metal_graph_attn_comp_is_planar()
  helpers to select Planar3 cache when enabled
- Pass Planar3 cache and flag to all 4 indexed attention dispatch sites
- Rebuild Planar3 GPU cache after checkpoint restore (read-back via
  CPU, quantize, upload) for both F16 and FP32 comp cache paths
- Add Planar3 tensor to layer allocation validation check
P0: Move planar_staging from ds4_cpu_decode_scratch to ds4_layer_cache,
    eliminating per-token xmalloc/free in CPU decode and prefill hot-paths.
P0: Optimize _rb16 Planar3 dequant to use all 256 threads (2 rows/iteration).
P1: Add staging capacity guard in metal_graph_quantize_attn_comp_planar.
P1: Fix fp16_to_fp32 subnormal and infinity handling.
P1: Document ds4_session_dump_comp_kv single-layer limitation.
P2: Remove unused kv_cache_uses_planar (now per-layer staging).
P2: Add zero-norm and single-element edge-case tests.
Add comp_kv_planar parameter to all three FA wrapper functions
(decode_heads, decode_mixed_batch, prefill_static_mixed) and pass
Planar3→F16 dequant path through the internal dispatch chain.

FA encoder functions now branch: if comp_kv_planar, use the new
kernel_planar3_dequant_to_f16_rows to decompress directly into
g_flash_attn_kv_buffer; otherwise use existing copy_to_f16 path.

ds4.c dispatch sites use metal_graph_attn_comp_for_attention() and
metal_graph_attn_comp_is_planar() to select the correct tensor.
When --planar-kv-cache-only is enabled (implies --planar-kv-cache):
- GPU: skips layer_attn_comp_cache allocation, only allocates Planar3
- CPU: skips attn_comp_kv allocation, quantizes directly to Planar3
- Checkpoint save: writes Planar3 bytes directly instead of FP16/FP32
- Checkpoint restore: reads Planar3 directly, no CPU quantize step
- metal_graph_store_attn_comp_stage: returns true when comp cache is NULL
- kv_cache_push_comp: handles NULL rows pointer (planar-only path)

This eliminates the dual-cache memory overhead. Combined with the FA
Planar3→F16 pre-dequant path, all attention paths work without the
persistent FP16/FP32 compressed cache.
@hexxyan hexxyan force-pushed the codex/ds4-planarquant-rfc branch from a192722 to 381959a Compare May 27, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant