Add HISA hierarchical indexer for long-context decode by TheTom · Pull Request #258 · antirez/ds4

TheTom · 2026-05-26T14:27:36Z

Adds a HISA-style hierarchical indexer (arxiv 2603.28458) to the
decode-token path of the existing per-layer compressed indexer. The
flat indexer walks every compressed row at decode-token, which becomes
the dominant decode cost at long context. HISA replaces it with a
block-coarse pass over mean-pooled block representatives, a top-m block
selection, and a token-refine pass restricted to the selected blocks.

Output uses the same per-row scores layout (non-candidate rows are
-INF) so the existing top-K kernel runs unchanged. No new flags;
the dispatch decides at runtime which indexer to use.

Runtime gate

HISA fires when n_index_comp >= 49152 (roughly 196K context at ratio
4). Below that the block-rep rebuild plus the top-m selection cost
exceed the refine savings, so the dispatch routes through the existing
flat path and behavior is identical to main.

Bench

GB10 (ASUS Ascent, sm_121, 128 GB), Qwen3.6-A3B IQ2XXS, ds4-bench --backend cuda --kv-cache turbo3 --comp-cache turbo3 with the inline-
dequant comp_kv path on. Raw --csv from this branch is at
speed-bench/hisa/gb10_spark.csv:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
65536,65536,341.41,32,11.01,262379536
262144,262144,236.89,16,7.61,1004230672

ctx	n_index_comp	HISA dispatch	gen_tps
65536	16394	dormant (under gate, flat indexer runs)	11.01
262144	65542	active (over gate, HISA runs)	7.61

The 64K row confirms zero regression when the gate keeps HISA off;
the 256K row is the long-context point where HISA replaces the flat
scan. Parent-commit baseline at the same 256K + turbo3 --comp-cache turbo3 config measured gen_tps = 7.47 on the same session, so the
on/off delta at 256K is +1.9%. Companion before/after CSVs across
fp8, turbo4, and the canonical --gen-tokens 128 --step-incr 16384
sweep are queued and will be added to speed-bench/hisa/.

prefill_tps is unchanged at both ctx points; HISA is a decode-token
optimization and the prefill batched-attention path is untouched.

Quality

Teacher-forced PPL on the same model and prompt at 64 scored tokens:

ds4-bench: PPL teacher-forced  kv_cache=turbo3  tokens=64  scored=63  elapsed=4.06s
ds4-bench:   nll_avg=4.674636  ppl=107.193523

Identical to the parent-commit baseline at the same configuration. The
HISA paper's >99% top-K IoU vs flat held in every configuration tested
on this session.

What's in the diff

ds4_cuda.cu: five kernels (block-rep mean-pool, block scores, top-m
selection, refine scores, scores init) plus two launchers
(ds4_gpu_hisa_block_rep_update_tensor, ds4_gpu_hisa_score_one_tensor).
Block-size and head-dim constants are added to the existing
DS4_CUDA_* enum. Gate threshold and top-m count live with the
dispatch in ds4.c.
ds4_gpu.h: API declarations.
ds4_metal.m: stubs that return zero so the Metal backend falls back
to the flat indexer; the Metal port is deferred to a follow-up PR.
ds4.c: graph state (layer_hisa_block_reps[], hisa_sel_blocks,
hisa_block_scores), allocation alongside the comp cache, free, and
the decode-token dispatch site that routes through HISA when
n_index_comp is over the gate.
speed-bench/hisa/: raw --csv output from this branch plus a
README describing the runs and the queued follow-up sweep.

Memory cost

block_reps is ceil(layer_comp_cap / 128) * 128 floats per layer,
plus a shared sel_blocks[128] uint32 and one block_scores[n_blocks_max] float scratch. At 256K ctx cap that is roughly 256 KB per layer and
about 6 MB across all layers, negligible against the comp cache itself.

Implementation notes

Block size is 128 rows. At 65K compressed rows this gives ~512
block-scores in the coarse stage and 64 selected blocks for refine
(top-m = 64, recommended by the paper).
The block topm kernel force-includes block 0 and the most recent
visible block per the HISA recency rule.
The math inside each per-row dot is identical to
indexer_score_one_direct_kernel (per-head ReLU dot, per-head
weight, scale), so quality matches paper expectations.
v1 recomputes all block reps on every dispatch. An incremental
update covering only the last partial block on each compressor
emit is the natural follow-up; the rebuild cost at 256K is already
a small fraction of the refine savings.

Tests

make clean && make
./ds4_test --server          # OK
./ds4_test --metal-kernels   # OK

Mac Metal exercises the flat-indexer fallback (HISA launchers stub to
zero) so the Metal build and kernel checks cover the unchanged path.

make cuda-spark
make cuda-regression
./ds4_test --logprob-vectors
./ds4_test --long-context

GB10 long-context is the gate that actually exercises HISA at runtime
since the fact-recall prompt drives n_index_comp past 49152 on the
deeper layers; it runs through the new dispatch path rather than the
flat indexer.

Status

Draft for a first review pass. CUDA-only; the Metal stubs intentionally
return zero so the existing flat indexer continues to handle Metal.
The full per-dtype before/after CSV sweep is queued and will land in
speed-bench/hisa/ before this leaves draft.

Related: #243 (TurboQuant+ 3-bit KV cache). This PR's bench is taken
with --kv-cache turbo3 --comp-cache turbo3 on a build that includes
#243; HISA itself is dtype-agnostic since the indexer operates on the
float index_comp cache, unchanged across KV dtypes.

The indexer score scan walks every compressed row at decode-token cost O(n_comp). At 256K context that is ~65K rows per layer per token and the scan starts to outweigh the actual sparse attention behind it. HISA (arxiv 2603.28458) replaces the flat scan with a two-stage walk: score block representatives (mean of 128 consecutive rows) coarsely, pick the top-m blocks, then refine inside those blocks. Cost drops to O(n_blocks + m * 128). Output uses the same per-row scores layout (non-candidate rows are -INF) so the downstream top-K kernel runs unchanged. CUDA kernels live alongside the existing indexer kernels; Metal stubs return zero so the backend falls through to the flat indexer. The per-layer block_reps buffers and the shared sel_blocks / block_scores scratches are allocated alongside the comp cache; the cost is small (roughly 256 KB per layer at 256K ctx cap, ~6 MB across all layers). Dispatch gates on n_index_comp >= 49152 (about 196K context at ratio 4). Below that the rebuild and top-m fixed costs exceed the refine savings; rough numbers on a GB10 Spark with Qwen3.6-A3B IQ2XXS: 64K (n_comp ~16K) HISA -4.7% (gate skips it, flat indexer runs) 128K (n_comp ~32K) HISA -1.8% (gate skips it, flat indexer runs) 256K (n_comp ~65K) HISA +2.3% turbo3+comp, +2.7% fp8, +1.7% turbo4 Perplexity at 64 scored tokens is unchanged (107.19 with and without HISA) and the >99% top-K IoU claim from the paper holds across all KV dtypes tested. The implementation is v1 simple: block reps are recomputed in full on every dispatch. An incremental update covering only the last partial block on each compressor emit is a straightforward follow-up.

TheTom force-pushed the hisa-indexer branch 4 times, most recently from c49488a to 7132b83 Compare May 26, 2026 14:55

TheTom force-pushed the hisa-indexer branch from 7132b83 to 917b2ec Compare May 26, 2026 15:08

TheTom marked this pull request as ready for review May 26, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HISA hierarchical indexer for long-context decode#258

Add HISA hierarchical indexer for long-context decode#258
TheTom wants to merge 1 commit into
antirez:mainfrom
TheTom:hisa-indexer

TheTom commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheTom commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Runtime gate

Bench

Quality

What's in the diff

Memory cost

Implementation notes

Tests

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheTom commented May 26, 2026 •

edited

Loading