Skip to content

Add HISA hierarchical indexer for long-context decode#258

Open
TheTom wants to merge 1 commit into
antirez:mainfrom
TheTom:hisa-indexer
Open

Add HISA hierarchical indexer for long-context decode#258
TheTom wants to merge 1 commit into
antirez:mainfrom
TheTom:hisa-indexer

Conversation

@TheTom
Copy link
Copy Markdown

@TheTom TheTom commented May 26, 2026

Adds a HISA-style hierarchical indexer (arxiv 2603.28458) to the
decode-token path of the existing per-layer compressed indexer. The
flat indexer walks every compressed row at decode-token, which becomes
the dominant decode cost at long context. HISA replaces it with a
block-coarse pass over mean-pooled block representatives, a top-m block
selection, and a token-refine pass restricted to the selected blocks.

Output uses the same per-row scores layout (non-candidate rows are
-INF) so the existing top-K kernel runs unchanged. No new flags;
the dispatch decides at runtime which indexer to use.

Runtime gate

HISA fires when n_index_comp >= 49152 (roughly 196K context at ratio
4). Below that the block-rep rebuild plus the top-m selection cost
exceed the refine savings, so the dispatch routes through the existing
flat path and behavior is identical to main.

Bench

GB10 (ASUS Ascent, sm_121, 128 GB), Qwen3.6-A3B IQ2XXS, ds4-bench --backend cuda --kv-cache turbo3 --comp-cache turbo3 with the inline-
dequant comp_kv path on. Raw --csv from this branch is at
speed-bench/hisa/gb10_spark.csv:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
65536,65536,341.41,32,11.01,262379536
262144,262144,236.89,16,7.61,1004230672
ctx n_index_comp HISA dispatch gen_tps
65536 16394 dormant (under gate, flat indexer runs) 11.01
262144 65542 active (over gate, HISA runs) 7.61

The 64K row confirms zero regression when the gate keeps HISA off;
the 256K row is the long-context point where HISA replaces the flat
scan. Parent-commit baseline at the same 256K + turbo3 --comp-cache turbo3 config measured gen_tps = 7.47 on the same session, so the
on/off delta at 256K is +1.9%. Companion before/after CSVs across
fp8, turbo4, and the canonical --gen-tokens 128 --step-incr 16384
sweep are queued and will be added to speed-bench/hisa/.

prefill_tps is unchanged at both ctx points; HISA is a decode-token
optimization and the prefill batched-attention path is untouched.

Quality

Teacher-forced PPL on the same model and prompt at 64 scored tokens:

ds4-bench: PPL teacher-forced  kv_cache=turbo3  tokens=64  scored=63  elapsed=4.06s
ds4-bench:   nll_avg=4.674636  ppl=107.193523

Identical to the parent-commit baseline at the same configuration. The
HISA paper's >99% top-K IoU vs flat held in every configuration tested
on this session.

What's in the diff

  • ds4_cuda.cu: five kernels (block-rep mean-pool, block scores, top-m
    selection, refine scores, scores init) plus two launchers
    (ds4_gpu_hisa_block_rep_update_tensor, ds4_gpu_hisa_score_one_tensor).
    Block-size and head-dim constants are added to the existing
    DS4_CUDA_* enum. Gate threshold and top-m count live with the
    dispatch in ds4.c.
  • ds4_gpu.h: API declarations.
  • ds4_metal.m: stubs that return zero so the Metal backend falls back
    to the flat indexer; the Metal port is deferred to a follow-up PR.
  • ds4.c: graph state (layer_hisa_block_reps[], hisa_sel_blocks,
    hisa_block_scores), allocation alongside the comp cache, free, and
    the decode-token dispatch site that routes through HISA when
    n_index_comp is over the gate.
  • speed-bench/hisa/: raw --csv output from this branch plus a
    README describing the runs and the queued follow-up sweep.

Memory cost

block_reps is ceil(layer_comp_cap / 128) * 128 floats per layer,
plus a shared sel_blocks[128] uint32 and one block_scores[n_blocks_max] float scratch. At 256K ctx cap that is roughly 256 KB per layer and
about 6 MB across all layers, negligible against the comp cache itself.

Implementation notes

  • Block size is 128 rows. At 65K compressed rows this gives ~512
    block-scores in the coarse stage and 64 selected blocks for refine
    (top-m = 64, recommended by the paper).
  • The block topm kernel force-includes block 0 and the most recent
    visible block per the HISA recency rule.
  • The math inside each per-row dot is identical to
    indexer_score_one_direct_kernel (per-head ReLU dot, per-head
    weight, scale), so quality matches paper expectations.
  • v1 recomputes all block reps on every dispatch. An incremental
    update covering only the last partial block on each compressor
    emit is the natural follow-up; the rebuild cost at 256K is already
    a small fraction of the refine savings.

Tests

make clean && make
./ds4_test --server          # OK
./ds4_test --metal-kernels   # OK

Mac Metal exercises the flat-indexer fallback (HISA launchers stub to
zero) so the Metal build and kernel checks cover the unchanged path.

make cuda-spark
make cuda-regression
./ds4_test --logprob-vectors
./ds4_test --long-context

GB10 long-context is the gate that actually exercises HISA at runtime
since the fact-recall prompt drives n_index_comp past 49152 on the
deeper layers; it runs through the new dispatch path rather than the
flat indexer.

Status

Draft for a first review pass. CUDA-only; the Metal stubs intentionally
return zero so the existing flat indexer continues to handle Metal.
The full per-dtype before/after CSV sweep is queued and will land in
speed-bench/hisa/ before this leaves draft.

Related: #243 (TurboQuant+ 3-bit KV cache). This PR's bench is taken
with --kv-cache turbo3 --comp-cache turbo3 on a build that includes
#243; HISA itself is dtype-agnostic since the indexer operates on the
float index_comp cache, unchanged across KV dtypes.

@TheTom TheTom force-pushed the hisa-indexer branch 4 times, most recently from c49488a to 7132b83 Compare May 26, 2026 14:55
The indexer score scan walks every compressed row at decode-token cost
O(n_comp).  At 256K context that is ~65K rows per layer per token and
the scan starts to outweigh the actual sparse attention behind it.

HISA (arxiv 2603.28458) replaces the flat scan with a two-stage walk:
score block representatives (mean of 128 consecutive rows) coarsely,
pick the top-m blocks, then refine inside those blocks.  Cost drops to
O(n_blocks + m * 128).  Output uses the same per-row scores layout
(non-candidate rows are -INF) so the downstream top-K kernel runs
unchanged.

CUDA kernels live alongside the existing indexer kernels; Metal stubs
return zero so the backend falls through to the flat indexer.  The
per-layer block_reps buffers and the shared sel_blocks / block_scores
scratches are allocated alongside the comp cache; the cost is small
(roughly 256 KB per layer at 256K ctx cap, ~6 MB across all layers).

Dispatch gates on n_index_comp >= 49152 (about 196K context at ratio
4).  Below that the rebuild and top-m fixed costs exceed the refine
savings; rough numbers on a GB10 Spark with Qwen3.6-A3B IQ2XXS:

  64K  (n_comp ~16K) HISA  -4.7%   (gate skips it, flat indexer runs)
  128K (n_comp ~32K) HISA  -1.8%   (gate skips it, flat indexer runs)
  256K (n_comp ~65K) HISA  +2.3% turbo3+comp, +2.7% fp8, +1.7% turbo4

Perplexity at 64 scored tokens is unchanged (107.19 with and without
HISA) and the >99% top-K IoU claim from the paper holds across all
KV dtypes tested.

The implementation is v1 simple: block reps are recomputed in full on
every dispatch.  An incremental update covering only the last partial
block on each compressor emit is a straightforward follow-up.
@TheTom TheTom marked this pull request as ready for review May 26, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant