Tune suffix decoding defaults for M5 Max Metal#276
Conversation
Implements an opt-in, model-free suffix trie (arXiv:2411.04975) for speculative decoding in ds4, with probability estimation and adaptive draft caps ported from Snowflake ArcticInference. New files: - ds4_suffix_tree.h/c: bounded CPU-resident suffix trie with sorted-array children, frequency-aging pruning, memory budget enforcement, probability estimation, and cached best-child index (~500 lines) - tests/suffix_tree_test.c: 8 unit tests covering continuation matching, probability estimation, confidence filtering, best-child caching, and pruning Modified files: - ds4.h: 6 new engine options (suffix_decoding, suffix_max_depth, suffix_memory_budget, suffix_spec_factor, suffix_spec_offset, suffix_min_prob); suffix stats telemetry with draft_score_total - ds4.c: session lifecycle integration, incremental learning via suffix_learned_len, two-phase draft selection (match_depth then query), score-based draft gating, independent spec_logits allocation - ds4_cli.c: 6 new CLI flags - ds4_bench.c: suffix telemetry CSV columns, 3 new config options - Makefile: suffix-tree-test target - README.md: usage documentation - CONTRIBUTING.md: telemetry column guidance - speed-bench/README.md: MTP and suffix bench sweep examples Disabled by default (--suffix-decoding flag required). No external dependencies, no training, no GPU kernels. Falls back gracefully to MTP or single-token decode when no suffix match is found. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6bcd4c3 to
4faf050
Compare
|
Hi @nhwaani — thanks again for the excellent benchmark work on M5 Max. Your parameter tuning (factor 0.01, offset 2) is a valuable contribution and I've already incorporated it into PR #261 where the suffix-tree implementation is under review. Your commit is preserved with full authorship credit — when PR #261 lands, you'll appear in the Contributors list. Since the core implementation in this PR (commit b097d85) is the same as #261, would you consider closing this PR to keep the review focused in one place? Your tuning is already included there. Thanks! |
@hexxyan Happy to add some value with these clankers. |
Context
This PR is related to the suffix-decoding work in #261.
Important note: the suffix-tree implementation itself is from the #261 branch. My contribution here is the M5 Max validation + default tuning that was merged into
hexxyan/suffix-decodingin hexxyan#1.So the preferred merge path can still be #261. This PR exists to make the M5 Max benchmark/tuning visible directly on
antirez/ds4.Summary
Suffix decoding is a model-free speculative decoder. It learns repeated token patterns from the prompt and previous output, then proposes likely next tokens from a suffix tree. The main model still verifies every proposed token, so correctness remains gated by the target model.
On my M5 Max 128GB, the feature works, but the original default draft cap was too aggressive for the Metal verifier path:
That often proposes long drafts. The verifier then pays to check the long draft, but often commits only a short prefix. In my benchmarks this caused a large slowdown.
This tuning changes the default to:
In plain language: suffix decoding now tries short drafts first by default. Users can still increase the factor/offset for workloads with very high acceptance.
Why this helps
The suffix-tree lookup is cheap. The expensive part is target-model verification.
Example debug timing from the same M5 Max setup:
So the safer default is not “draft as much as possible”. It is “draft a small amount unless the user opts into larger drafts”.
Benchmark setup
Hardware:
./ds4-bench --warm-weightsBranch tested:
hexxyan/suffix-decodingafter merging Tune suffix decoding default draft cap for Metal hexxyan/ds4#11df9f1aWorkloads:
These are repetitive workloads where suffix decoding should have a chance to help.
Results: agent-style JSONL prompt
Command shape:
Average of 2 runs for baseline/new default; 1 run for old default.
Visual summary:
Results: code-boilerplate prompt
Command shape:
Average of 2 runs for baseline/new default; 1 run for old default.
Visual summary:
Telemetry
Agent JSONL, 8192 ctx, 512 generated tokens
Code boilerplate, 2048 ctx, 512 generated tokens
The new default keeps almost the same number of accepted draft tokens, but avoids expensive long verification attempts.
What changed
Only defaults and docs changed:
ds4.c: engine fallback defaultsds4_bench.c: bench defaults and help textds4_cli.c: help textREADME.md: explain conservative default and when to increase itNo verifier logic changed.
Tests
All passed on M5 Max.
Caveat
This is tuning from one Metal machine and two repetitive workloads. It does not claim to be universally optimal. The goal is to make the opt-in default safe: avoid turning on a large slowdown by default, while keeping the knobs exposed for larger-draft experiments.