feat: speculative decoding with draft model#30
Open
cstroie wants to merge 15 commits intoRightNow-AI:mainfrom
Open
feat: speculative decoding with draft model#30cstroie wants to merge 15 commits intoRightNow-AI:mainfrom
cstroie wants to merge 15 commits intoRightNow-AI:mainfrom
Conversation
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
- Add x86, sse2, sse3, avx targets to platform-specific builds section - Update SIMD feature entry to mention SSE2/SSE3/AVX tiers - Expand x86 SIMD optimization section with per-tier description - Update performance waterfall chart to reflect 8-wide AVX ops - Add --mem option to usage section - Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step - Update FAQ SIMD mention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets: - AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add, vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K - SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround) - SSE2: 4-wide baseline for all the above ops Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit (no AVX2 required); only float accumulators widen to 256-bit. Scalar fallback preserved for all paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge simd branch cleanups into avx2: - quant.h: unify detection into single hierarchy block (AVX2→AVX→SSE3→SSE2), drop redundant old guards; add hsum_avx under #ifdef PICOLM_AVX - quant.c: trim AVX2/AVX block comments to single lines per style guide - picolm.c: keep SIMD-tier startup print, drop removed --mem print - model.c: remove stray blank line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SE2 intent Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it invisible to the AVX and SSE2 branches that use it. Move it to a dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before the dispatch chain so all consuming branches see it. Add a comment to `make static` explaining the deliberate switch from -march=native to -msse2 (portable static binary, runs on any x86-64). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add --mem parameter to load model into RAM instead of mmap Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * fix: fix model_load call and signed comparison warning Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * feat: add --mem option for model loading mode selection Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * feat: add fast mode with optimized parameters for better performance Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> * feat: enhance performance with SSE2 optimizations and update build configurations * feat: add AVX support for optimized vector operations and enhance performance * docs: update README for AVX support, build targets, and --mem option - Add x86, sse2, sse3, avx targets to platform-specific builds section - Update SIMD feature entry to mention SSE2/SSE3/AVX tiers - Expand x86 SIMD optimization section with per-tier description - Update performance waterfall chart to reflect 8-wide AVX ops - Add --mem option to usage section - Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step - Update FAQ SIMD mention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: enhance SIMD support with AVX2 optimizations and update build configurations * feat: add SSE2/SSE3/AVX SIMD tiers for x86 inference Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets: - AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add, vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K - SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround) - SSE2: 4-wide baseline for all the above ops Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit (no AVX2 required); only float accumulators widen to 256-bit. Scalar fallback preserved for all paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: move Q6K_CONV macro outside #ifdef chain; document make static SSE2 intent Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it invisible to the AVX and SSE2 branches that use it. Move it to a dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before the dispatch chain so all consuming branches see it. Add a comment to `make static` explaining the deliberate switch from -march=native to -msse2 (portable static binary, runs on any x86-64). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…document Q6K_CONV idiom - Add -mfma to make avx and make avx2; all AVX/AVX2 CPUs support FMA3 and GCC will fuse multiply-add pairs in hot dot-product loops for free - Fix vec_dot_f32_f32 AVX path: add 8-wide cleanup pass between the 16-wide main loop and scalar tail so hidden sizes that are multiples of 8 but not 16 (2048, 4096, ...) don't leave 8 elements to scalar - Add comment to Q6K_CONV explaining the non-obvious unpacklo/srai sign-extension idiom (byte→int16 widening without SSE4.1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements speculative decoding to accelerate inference when a smaller draft model shares the same vocabulary as the target model. - New CLI options: --draft <model.gguf> and -d <K> (default 4 tokens/step) - Draft model runs K greedy (argmax) steps; target verifies K+1 positions - Accept/reject via min(1, p_target(x)) — one-hot draft q simplification - Rejection resamples from corrected distribution max(0, p-q), renormalized - All-accepted path samples a bonus token from target_probs[K] - Stale KV entries beyond accepted prefix are safely overwritten next round - Graceful fallback to standard AR if vocab mismatch or grammar mode active - Acceptance rate reported in stderr stats Add sampler_rand() and sampler_sample_probs() helpers used by the speculative accept/reject logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--draft <model.gguf>(draft model path) and-d <K>(draft tokens per step, default 4)How it works
min(1, p_target(x))— one-hot draft simplification since draft is greedymax(0, p_target - p_draft), renormalizedtarget_probs[K]via the full sampler (top-p + temperature)Stale KV entries beyond the accepted prefix are harmlessly overwritten in the next round — no explicit rollback needed.
Two new helpers added to
sampler:sampler_rand()andsampler_sample_probs()(samples from a pre-softmaxed probability array).Test output
Tested on x86-64 (AVX2) with TinyLlama-1.1B as both target and draft (same-model test to verify correctness):
Output is identical to standard AR at
t=0. Acceptance rate ~53% is expected for same-model greedy draft. Real speedup requires a faster draft model (e.g. 1B draft for a 7B+ target).Hardware tested
🤖 Generated with Claude Code