Add suffix-tree speculative decoding for repetitive/agentic generation patterns#261
Add suffix-tree speculative decoding for repetitive/agentic generation patterns#261hexxyan wants to merge 9 commits into
Conversation
Implements an opt-in, model-free suffix trie (arXiv:2411.04975) for speculative decoding in ds4, with probability estimation and adaptive draft caps ported from Snowflake ArcticInference. New files: - ds4_suffix_tree.h/c: bounded CPU-resident suffix trie with sorted-array children, frequency-aging pruning, memory budget enforcement, probability estimation, and cached best-child index (~500 lines) - tests/suffix_tree_test.c: 8 unit tests covering continuation matching, probability estimation, confidence filtering, best-child caching, and pruning Modified files: - ds4.h: 6 new engine options (suffix_decoding, suffix_max_depth, suffix_memory_budget, suffix_spec_factor, suffix_spec_offset, suffix_min_prob); suffix stats telemetry with draft_score_total - ds4.c: session lifecycle integration, incremental learning via suffix_learned_len, two-phase draft selection (match_depth then query), score-based draft gating, independent spec_logits allocation - ds4_cli.c: 6 new CLI flags - ds4_bench.c: suffix telemetry CSV columns, 3 new config options - Makefile: suffix-tree-test target - README.md: usage documentation - CONTRIBUTING.md: telemetry column guidance - speed-bench/README.md: MTP and suffix bench sweep examples Disabled by default (--suffix-decoding flag required). No external dependencies, no training, no GPU kernels. Falls back gracefully to MTP or single-token decode when no suffix match is found. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I tested this PR on an M5 Max 128GB with the DeepSeek V4 Flash q2 GGUF. Good news: the feature builds and works on real hardware. Passed: ./tests/suffix_tree_test
./ds4_test --server
make testMain finding: the current default draft cap is too aggressive on my Metal runs. It often proposes long drafts, but the verifier only commits a short prefix. That makes default Example, agent-style JSONL prompt, 8192 ctx, 512 gen tokens:
Repeated baseline vs conservative cap:
The useful tuning was: --suffix-spec-factor 0.01 --suffix-spec-offset 2I opened a small follow-up PR against this branch with the benchmark details and default tuning: |
|
Small follow-up: I rebased the tuning branch onto Follow-up PR: hexxyan#1 Final M5 Max 128GB numbers: Agent JSONL, 8192 ctx, 512 gen tokens
Code boilerplate, 2048 ctx, 512 gen tokens
Tests passed after rebase: ./tests/suffix_tree_test
./ds4_test --server
make testSo the main point is: the current default over-drafts and can cause a big slowdown on Metal; the conservative default avoids that and keeps suffix decoding near baseline / slightly faster on the agent-style workload. |
Thanks for the thorough benchmarking work! Merging with merge commit to preserve authorship.
|
Thanks @nhwaani for the thorough M5 Max benchmarking — the conservative defaults ( Your commit is preserved with you as author, so when this PR lands you'll appear in the project's Contributors list as well. Appreciate the real-hardware testing! |
Thanks, let me know if you see any concerns @hexxyan |
|
Given the complication and marginal performance benefits in these benchmarks, what's the reason to merge this? Is there a more representative benchmark for agentic coding? |
- Fix sentinel defaults: use >= 0.0f check so --suffix-spec-offset 0 works - Add speculative decode telemetry (accept rate, timing, hit/miss counts) - Pre-allocate verifier scratch buffers in session (eliminate hot-path malloc) - Enable decode2_exact for suffix tree N=2 drafts (faster verify path) - Print telemetry summary on exit when suffix decoding is active
Suffix decoding fixes pushedJust pushed a batch of fixes and improvements:
@nhwaani — would you mind re-testing with the updated branch? The telemetry output will now print at the end, which should make it easier to see what's happening with accept rates and verify timing. A quick run like: would be very helpful. Thanks! |
|
Retested latest Branch tested: 3d553db + 1800c4d
It fixes impossible replay timing like: replay=815247931.0msAfter the fix, telemetry is sane: verify=8316.1ms replay=3.1msTests passed: make -j$(sysctl -n hw.ncpu) suffix-tree-test ds4_test ds4-bench
./tests/suffix_tree_test
./ds4_test --serverModel used: q2-q4-imatrix GGUF. Agent JSONL, 8192 ctx, 512 gen
Code boilerplate, 2048 ctx, 512 gen
My read:
So I would not present this as a proven M5 Metal performance win today. The merge argument, if any, is:
I agree with @STRML that a more representative multi-turn coding-agent benchmark would be useful before making strong speedup claims. |
- Revert decode2_exact for suffix tree (was causing slowdown, it's a correctness path not a fast path; suffix N=2 should use batch verify) - Fix accept_rate formula: total_verified now counts draft_n (not draft_n-1) so the ratio correctly reflects draft token acceptance probability - Fix factor sentinel: use > 0.0f so zero-initialized API opts get engine default (0.01) instead of literal 0.0 - Apply PR #2: verify_done timestamp captured unconditionally so replay telemetry doesn't print garbage when DS4_MTP_TIMING is off
Regression fixes pushedGood catch on all four issues. Just pushed a fix commit (
@nhwaani — could you re-run your benchmarks with this version? The decode2_exact revert should bring performance back to the 1.02x baseline. |
Fix suffix telemetry replay timing
- Add total_verify_ms to all 5 generic batch verifier return points - Add total_replay_ms to the 2 replay paths (exact replay + general replay) - Always capture draft_query_ms (remove DS4_SUFFIX_SPEC_LOG gate) - Add ds4_engine_spec_telemetry_reset() API for per-frontier reset - Reset telemetry after each frontier in ds4-bench so CSV rows are independent snapshots, not cumulative Now verify/replay/draft_query timing covers the suffix-only main path (generic batch verifier), not just the decode2_exact branch.
…ect N buckets - draft_query_ms now always accumulates (removed DS4_SUFFIX_SPEC_LOG gate) - micro_verify_done captured unconditionally, verify_ms accumulated once after verifier completes rather than at each return point - total_replay_ms only counts post-verifier replay/restore cost - accept_rate prints only when total_verified > 0 - N= bucket index fixed: i+1 instead of i+2 (N=1 means 1 draft token) - first-draft miss now counts toward total_verified and partial_accept histogram so the two stay consistent
Telemetry completeness update pushed (
|
|
Retested latest Build/tests passed: make -j$(sysctl -n hw.ncpu) suffix-tree-test ds4_test ds4-bench
./tests/suffix_tree_test
./ds4_test --server
make test3-run median results, 512 generated tokens:
Raw gen tok/s:
Telemetry now looks sane and stable: agent r1: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8524.6ms draft_query=1.0ms
agent r2: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8708.5ms draft_query=0.9ms
agent r3: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8498.9ms draft_query=0.9ms
code r1: steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8834.8ms draft_query=0.8ms
code r2: steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8465.3ms draft_query=0.8ms
code r3: steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8497.3ms draft_query=1.0msI also did a single sanity run with the old aggressive cap (
So the latest fixes recover the expected behavior on my synthetic repetitive agent/code prompts: conservative suffix defaults are near-baseline to modestly faster, while avoiding the old aggressive over-drafting slowdown. I’d still avoid broad performance claims until there is a representative multi-turn coding-agent benchmark, but this version looks healthy on M5 Max Metal. |
Thanks @antirez for this wonderful project.
Summary
This PR adds an opt-in, model-free suffix-tree draft source for speculative decoding, implementing the SuffixDecoding approach (arXiv:2411.04975) with probability estimation features ported from Snowflake ArcticInference.
The suffix trie learns repetitive token patterns from prompt and prior output, then proposes draft tokens at zero model cost. The target-model verifier accepts or rejects each draft, so output correctness is always guaranteed — the trie can only offer speedups, never change results.
When this helps
The suffix trie is most effective when generated text contains repetitive or predictable subsequences:
The SuffixDecoding paper (arXiv:2411.04975) reports 1.3–2.5× speedup on agentic benchmarks with their reference implementation. ds4's implementation uses the same core algorithm (suffix trie + frequency-based continuation) with a different internal data structure (sorted arrays vs hash maps), so actual speedup depends on the workload's repetition patterns and the model's baseline token latency.
Key properties
--suffix-decodingflag required) — zero impact on existing behaviorWhat's included
New files:
ds4_suffix_tree.h/c— Bounded CPU-resident suffix trie with sorted-array children, frequency-aging pruning, and best-effort memory budget (~500 lines)tests/suffix_tree_test.c— 8 unit tests covering continuation matching, probability estimation, confidence filtering, best-child caching, and pruningModified files:
ds4.h— 6 new engine options (suffix_decoding,suffix_max_depth,suffix_memory_budget,suffix_spec_factor,suffix_spec_offset,suffix_min_prob)ds4.c— Session lifecycle integration, incremental learning, two-phase draft selection (match → query), score-based draft gating, independentspec_logitsallocation for suffix-only batch verificationds4_cli.c— 6 new CLI flagsds4_bench.c— Suffix telemetry CSV columns, new config options (note: CSV schema expanded from 6 columns to include context memory breakdown, MTP stats, and suffix telemetry — downstream CSV parsers will need updating)Makefile—suffix-tree-testtargetREADME.md— Usage documentationCONTRIBUTING.md— Telemetry column guidancespeed-bench/README.md— MTP and suffix bench sweep examplesArchitecture
The suffix trie learns repetitive token patterns from prompt, checkpoint, and accepted generation tokens. During speculative decode:
ds4_suffix_tree_match_depth()finds the longest matching suffix in the triecap = match_len × factor + offsetds4_suffix_tree_query()follows the highest-frequency continuation path with probability estimation (prob *= child_freq / parent_freq)min_probconfidence are filtered; score is used for quality gatingIncremental learning: the tree uses
suffix_learned_lento track what's already been inserted, appending only new tokens per decode step instead of re-inserting the entire checkpoint. This avoids frequency inflation for older patterns and keeps per-step cost O(max_depth).Pruning: frequency aging (decrement all by 1, clamp at 0) followed by zero-frequency leaf removal, with a configurable node budget (default 64 MB). Pruning runs up to 16 rounds per trigger and may temporarily exceed the budget under heavy insert load before converging back.
Verification
Important caveats
--suffix-spec-factor 1.0,--suffix-spec-offset 0.0,--suffix-min-prob 0.0), the implementation reproduces the originalalpha=1draft cap. The new parameters are configurable infrastructure for tuning.spec_logitsis now allocated independently when--suffix-decodingis enabled (without requiring MTP). The structural prerequisite is in place but needs real-model validation.Request for maintainers/community
If you have access to a machine with 80GB+ GPU running DeepSeek V4, help with the following would be appreciated:
--suffix-decodingAI assistance disclosure
This code was written with assistance from GPT-5.5-xhigh and GLM-5.1 models.
References
Follow-up fixes (commits after initial PR)
Build/test fixes
--suffix-spec-factor 0.01 --suffix-spec-offset 2avoids over-drafting on Metal, keeping suffix decoding near baseline on M5 MaxBug fixes
--suffix-spec-offset 0now works: sentinel defaults (-1.0f) + correct engine-side checks so explicit zero is respectedsuffix_spec_factorzero-init safety:> 0.0fcheck so API callers withds4_engine_options opt = {0}get engine defaultsTelemetry (speculative decode)
Added comprehensive telemetry printed on exit when
--suffix-decodingis active:spec_steps,first_draft_hit/miss— how often speculative decode runs and how often the first draft matchesaccept_rate— committed draft tokens / verified draft tokensN=X:full=Y:partial=Zhistogram — distribution of full vs partial accepts by draft depthverify_ms,replay_ms,draft_query_ms— timing breakdown across all verifier paths (decode2_exact, generic batch verifier, sequential fallback)Telemetry covers all verifier paths, not just one branch. In
ds4-bench, telemetry resets per frontier so each CSV row is an independent snapshot.What we learned from M5 Max testing