feat: speculative decoding with draft model by cstroie · Pull Request #30 · RightNow-AI/picolm

cstroie · 2026-05-07T16:15:51Z

Summary

Adds speculative decoding to accelerate inference when a smaller draft model shares the same vocabulary as the target
New CLI options: --draft <model.gguf> (draft model path) and -d <K> (draft tokens per step, default 4)
Graceful fallback to standard AR decoding if vocab mismatch, grammar mode active, or draft model fails to load

How it works

Draft phase: small model runs K greedy (argmax) steps from the current token, filling its own KV cache
Verify phase: target model runs K+1 sequential forwards (K draft positions + 1 bonus), producing temperature-scaled softmax distributions
Accept/reject: each draft token accepted with probability min(1, p_target(x)) — one-hot draft simplification since draft is greedy
Rejection: resamples from corrected distribution max(0, p_target - p_draft), renormalized
All accepted: samples a bonus token from target_probs[K] via the full sampler (top-p + temperature)

Stale KV entries beyond the accepted prefix are harmlessly overwritten in the next round — no explicit rollback needed.

Two new helpers added to sampler: sampler_rand() and sampler_sample_probs() (samples from a pre-softmaxed probability array).

Test output

Tested on x86-64 (AVX2) with TinyLlama-1.1B as both target and draft (same-model test to verify correctness):

$ ./picolm tinyllama.gguf --draft tinyllama.gguf -d 4 \
    -p "The capital of France is" -n 40 -t 0.0

 Paris. Am I correct? If not, could you provide me with the correct answer?
 Answer: Paris is the capital of France.

Speculative: drafted=36 accepted=19 (52.8%)
Generation: 28 tokens in 7.82s (3.6 tok/s)

Output is identical to standard AR at t=0. Acceptance rate ~53% is expected for same-model greedy draft. Real speedup requires a faster draft model (e.g. 1B draft for a 7B+ target).

Hardware tested

x86-64 (AVX2, Linux)

🤖 Generated with Claude Code

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

…nfigurations

…formance

- Add x86, sse2, sse3, avx targets to platform-specific builds section - Update SIMD feature entry to mention SSE2/SSE3/AVX tiers - Expand x86 SIMD optimization section with per-tier description - Update performance waterfall chart to reflect 8-wide AVX ops - Add --mem option to usage section - Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step - Update FAQ SIMD mention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…onfigurations

Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets: - AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add, vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K - SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround) - SSE2: 4-wide baseline for all the above ops Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit (no AVX2 required); only float accumulators widen to 256-bit. Scalar fallback preserved for all paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge simd branch cleanups into avx2: - quant.h: unify detection into single hierarchy block (AVX2→AVX→SSE3→SSE2), drop redundant old guards; add hsum_avx under #ifdef PICOLM_AVX - quant.c: trim AVX2/AVX block comments to single lines per style guide - picolm.c: keep SIMD-tier startup print, drop removed --mem print - model.c: remove stray blank line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…SE2 intent Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it invisible to the AVX and SSE2 branches that use it. Move it to a dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before the dispatch chain so all consuming branches see it. Add a comment to `make static` explaining the deliberate switch from -march=native to -msse2 (portable static binary, runs on any x86-64). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add --mem parameter to load model into RAM instead of mmap Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * fix: fix model_load call and signed comparison warning Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * feat: add --mem option for model loading mode selection Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * feat: add fast mode with optimized parameters for better performance Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * feat: enhance performance with SSE2 optimizations and update build configurations * feat: add AVX support for optimized vector operations and enhance performance * docs: update README for AVX support, build targets, and --mem option - Add x86, sse2, sse3, avx targets to platform-specific builds section - Update SIMD feature entry to mention SSE2/SSE3/AVX tiers - Expand x86 SIMD optimization section with per-tier description - Update performance waterfall chart to reflect 8-wide AVX ops - Add --mem option to usage section - Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step - Update FAQ SIMD mention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: enhance SIMD support with AVX2 optimizations and update build configurations * feat: add SSE2/SSE3/AVX SIMD tiers for x86 inference Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets: - AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add, vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K - SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround) - SSE2: 4-wide baseline for all the above ops Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit (no AVX2 required); only float accumulators widen to 256-bit. Scalar fallback preserved for all paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: move Q6K_CONV macro outside #ifdef chain; document make static SSE2 intent Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it invisible to the AVX and SSE2 branches that use it. Move it to a dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before the dispatch chain so all consuming branches see it. Add a comment to `make static` explaining the deliberate switch from -march=native to -msse2 (portable static binary, runs on any x86-64). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…document Q6K_CONV idiom - Add -mfma to make avx and make avx2; all AVX/AVX2 CPUs support FMA3 and GCC will fuse multiply-add pairs in hot dot-product loops for free - Fix vec_dot_f32_f32 AVX path: add 8-wide cleanup pass between the 16-wide main loop and scalar tail so hidden sizes that are multiples of 8 but not 16 (2048, 4096, ...) don't leave 8 elements to scalar - Add comment to Q6K_CONV explaining the non-obvious unpacklo/srai sign-extension idiom (byte→int16 widening without SSE4.1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements speculative decoding to accelerate inference when a smaller draft model shares the same vocabulary as the target model. - New CLI options: --draft <model.gguf> and -d <K> (default 4 tokens/step) - Draft model runs K greedy (argmax) steps; target verifies K+1 positions - Accept/reject via min(1, p_target(x)) — one-hot draft q simplification - Rejection resamples from corrected distribution max(0, p-q), renormalized - All-accepted path samples a bonus token from target_probs[K] - Stale KV entries beyond accepted prefix are safely overwritten next round - Graceful fallback to standard AR if vocab mismatch or grammar mode active - Acceptance rate reported in stderr stats Add sampler_rand() and sampler_sample_probs() helpers used by the speculative accept/reject logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cstroie and others added 15 commits April 16, 2026 10:52

feat: add --mem parameter to load model into RAM instead of mmap

d73aa69

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

fix: fix model_load call and signed comparison warning

3f30789

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: add --mem option for model loading mode selection

9639988

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: add fast mode with optimized parameters for better performance

bc1a262

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: enhance performance with SSE2 optimizations and update build co…

92d5aa7

…nfigurations

feat: add AVX support for optimized vector operations and enhance per…

ede96ff

…formance

feat: enhance SIMD support with AVX2 optimizations and update build c…

65afa9d

…onfigurations

fix: resolve merge conflicts, keep -mfma flags and AVX tail loop

3f70cc9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: speculative decoding with draft model#30

feat: speculative decoding with draft model#30
cstroie wants to merge 15 commits intoRightNow-AI:mainfrom
cstroie:speculative

cstroie commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cstroie commented May 7, 2026

Summary

How it works

Test output

Hardware tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant