Skip to content

feat: speculative decoding with draft model#30

Open
cstroie wants to merge 15 commits intoRightNow-AI:mainfrom
cstroie:speculative
Open

feat: speculative decoding with draft model#30
cstroie wants to merge 15 commits intoRightNow-AI:mainfrom
cstroie:speculative

Conversation

@cstroie
Copy link
Copy Markdown

@cstroie cstroie commented May 7, 2026

Summary

  • Adds speculative decoding to accelerate inference when a smaller draft model shares the same vocabulary as the target
  • New CLI options: --draft <model.gguf> (draft model path) and -d <K> (draft tokens per step, default 4)
  • Graceful fallback to standard AR decoding if vocab mismatch, grammar mode active, or draft model fails to load

How it works

  1. Draft phase: small model runs K greedy (argmax) steps from the current token, filling its own KV cache
  2. Verify phase: target model runs K+1 sequential forwards (K draft positions + 1 bonus), producing temperature-scaled softmax distributions
  3. Accept/reject: each draft token accepted with probability min(1, p_target(x)) — one-hot draft simplification since draft is greedy
  4. Rejection: resamples from corrected distribution max(0, p_target - p_draft), renormalized
  5. All accepted: samples a bonus token from target_probs[K] via the full sampler (top-p + temperature)

Stale KV entries beyond the accepted prefix are harmlessly overwritten in the next round — no explicit rollback needed.

Two new helpers added to sampler: sampler_rand() and sampler_sample_probs() (samples from a pre-softmaxed probability array).

Test output

Tested on x86-64 (AVX2) with TinyLlama-1.1B as both target and draft (same-model test to verify correctness):

$ ./picolm tinyllama.gguf --draft tinyllama.gguf -d 4 \
    -p "The capital of France is" -n 40 -t 0.0

 Paris. Am I correct? If not, could you provide me with the correct answer?
 Answer: Paris is the capital of France.

Speculative: drafted=36 accepted=19 (52.8%)
Generation: 28 tokens in 7.82s (3.6 tok/s)

Output is identical to standard AR at t=0. Acceptance rate ~53% is expected for same-model greedy draft. Real speedup requires a faster draft model (e.g. 1B draft for a 7B+ target).

Hardware tested

  • x86-64 (AVX2, Linux)

🤖 Generated with Claude Code

cstroie and others added 15 commits April 16, 2026 10:52
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
- Add x86, sse2, sse3, avx targets to platform-specific builds section
- Update SIMD feature entry to mention SSE2/SSE3/AVX tiers
- Expand x86 SIMD optimization section with per-tier description
- Update performance waterfall chart to reflect 8-wide AVX ops
- Add --mem option to usage section
- Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step
- Update FAQ SIMD mention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets:
- AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add,
  vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K
- SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround)
- SSE2: 4-wide baseline for all the above ops

Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit
(no AVX2 required); only float accumulators widen to 256-bit.
Scalar fallback preserved for all paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge simd branch cleanups into avx2:
- quant.h: unify detection into single hierarchy block (AVX2→AVX→SSE3→SSE2),
  drop redundant old guards; add hsum_avx under #ifdef PICOLM_AVX
- quant.c: trim AVX2/AVX block comments to single lines per style guide
- picolm.c: keep SIMD-tier startup print, drop removed --mem print
- model.c: remove stray blank line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SE2 intent

Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it
invisible to the AVX and SSE2 branches that use it. Move it to a
dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before
the dispatch chain so all consuming branches see it.

Add a comment to `make static` explaining the deliberate switch from
-march=native to -msse2 (portable static binary, runs on any x86-64).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add --mem parameter to load model into RAM instead of mmap

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* fix: fix model_load call and signed comparison warning

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* feat: add --mem option for model loading mode selection

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* feat: add fast mode with optimized parameters for better performance

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* feat: enhance performance with SSE2 optimizations and update build configurations

* feat: add AVX support for optimized vector operations and enhance performance

* docs: update README for AVX support, build targets, and --mem option

- Add x86, sse2, sse3, avx targets to platform-specific builds section
- Update SIMD feature entry to mention SSE2/SSE3/AVX tiers
- Expand x86 SIMD optimization section with per-tier description
- Update performance waterfall chart to reflect 8-wide AVX ops
- Add --mem option to usage section
- Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step
- Update FAQ SIMD mention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: enhance SIMD support with AVX2 optimizations and update build configurations

* feat: add SSE2/SSE3/AVX SIMD tiers for x86 inference

Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets:
- AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add,
  vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K
- SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround)
- SSE2: 4-wide baseline for all the above ops

Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit
(no AVX2 required); only float accumulators widen to 256-bit.
Scalar fallback preserved for all paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: move Q6K_CONV macro outside #ifdef chain; document make static SSE2 intent

Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it
invisible to the AVX and SSE2 branches that use it. Move it to a
dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before
the dispatch chain so all consuming branches see it.

Add a comment to `make static` explaining the deliberate switch from
-march=native to -msse2 (portable static binary, runs on any x86-64).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…document Q6K_CONV idiom

- Add -mfma to make avx and make avx2; all AVX/AVX2 CPUs support FMA3
  and GCC will fuse multiply-add pairs in hot dot-product loops for free
- Fix vec_dot_f32_f32 AVX path: add 8-wide cleanup pass between the
  16-wide main loop and scalar tail so hidden sizes that are multiples
  of 8 but not 16 (2048, 4096, ...) don't leave 8 elements to scalar
- Add comment to Q6K_CONV explaining the non-obvious unpacklo/srai
  sign-extension idiom (byte→int16 widening without SSE4.1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements speculative decoding to accelerate inference when a smaller
draft model shares the same vocabulary as the target model.

- New CLI options: --draft <model.gguf> and -d <K> (default 4 tokens/step)
- Draft model runs K greedy (argmax) steps; target verifies K+1 positions
- Accept/reject via min(1, p_target(x)) — one-hot draft q simplification
- Rejection resamples from corrected distribution max(0, p-q), renormalized
- All-accepted path samples a bonus token from target_probs[K]
- Stale KV entries beyond accepted prefix are safely overwritten next round
- Graceful fallback to standard AR if vocab mismatch or grammar mode active
- Acceptance rate reported in stderr stats

Add sampler_rand() and sampler_sample_probs() helpers used by the
speculative accept/reject logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant