feat: add SSE2/SSE3/AVX/AVX2 SIMD tiers for x86 inference#29
Open
cstroie wants to merge 12 commits intoRightNow-AI:mainfrom
Open
feat: add SSE2/SSE3/AVX/AVX2 SIMD tiers for x86 inference#29cstroie wants to merge 12 commits intoRightNow-AI:mainfrom
cstroie wants to merge 12 commits intoRightNow-AI:mainfrom
Conversation
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
- Add x86, sse2, sse3, avx targets to platform-specific builds section - Update SIMD feature entry to mention SSE2/SSE3/AVX tiers - Expand x86 SIMD optimization section with per-tier description - Update performance waterfall chart to reflect 8-wide AVX ops - Add --mem option to usage section - Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step - Update FAQ SIMD mention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets: - AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add, vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K - SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround) - SSE2: 4-wide baseline for all the above ops Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit (no AVX2 required); only float accumulators widen to 256-bit. Scalar fallback preserved for all paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge simd branch cleanups into avx2: - quant.h: unify detection into single hierarchy block (AVX2→AVX→SSE3→SSE2), drop redundant old guards; add hsum_avx under #ifdef PICOLM_AVX - quant.c: trim AVX2/AVX block comments to single lines per style guide - picolm.c: keep SIMD-tier startup print, drop removed --mem print - model.c: remove stray blank line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SE2 intent Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it invisible to the AVX and SSE2 branches that use it. Move it to a dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before the dispatch chain so all consuming branches see it. Add a comment to `make static` explaining the deliberate switch from -march=native to -msse2 (portable static binary, runs on any x86-64). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…document Q6K_CONV idiom - Add -mfma to make avx and make avx2; all AVX/AVX2 CPUs support FMA3 and GCC will fuse multiply-add pairs in hot dot-product loops for free - Fix vec_dot_f32_f32 AVX path: add 8-wide cleanup pass between the 16-wide main loop and scalar tail so hidden sizes that are multiples of 8 but not 16 (2048, 4096, ...) don't leave 8 elements to scalar - Add comment to Q6K_CONV explaining the non-obvious unpacklo/srai sign-extension idiom (byte→int16 widening without SSE4.1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
quant.c/quant.hpicolm.cviaPICOLM_SSE2,PICOLM_AVX,PICOLM_AVX2macrosmake sse2,make avx,make avx2build targets alongside existingmake native/make scalar-mfmatoavx/avx2targets — GCC fuses multiply-add pairs in dot-product loops for freeQ6K_CONVmacro placement (was inside AVX2 branch, invisible to AVX/SSE2 consumers)CONTRIBUTING.mdandREADME.mdwith SIMD tier hierarchy and new build targetsPerformance (TinyLlama Q4_K_M, 22 tokens, Intel x86-64)
AVX2 is ~3× faster than scalar (12.0 vs 3.8 tok/s prefill, 11.6 vs 3.8 tok/s generation).
Test plan
make clean && make nativebuilds cleanlymake scalar,make sse2,make avx,make avx2all build without warnings./picolm model.gguf -p "The capital of France is" -n 20 -t 0produces correct greedy output across all tiers./picolm model.gguf -p "Hello" -n 10 2>&1 | grep Memory)🤖 Generated with Claude Code