feat: add SSE2/SSE3/AVX/AVX2 SIMD tiers for x86 inference by cstroie · Pull Request #29 · RightNow-AI/picolm

cstroie · 2026-05-07T15:28:39Z

Summary

Add SSE2, SSE3, AVX, and AVX2 SIMD kernel tiers for x86 inference in quant.c/quant.h
Wire up tier detection and dispatch in picolm.c via PICOLM_SSE2, PICOLM_AVX, PICOLM_AVX2 macros
Add make sse2, make avx, make avx2 build targets alongside existing make native / make scalar
Add -mfma to avx/avx2 targets — GCC fuses multiply-add pairs in dot-product loops for free
Fix Q6K_CONV macro placement (was inside AVX2 branch, invisible to AVX/SSE2 consumers)
Update CONTRIBUTING.md and README.md with SIMD tier hierarchy and new build targets

Performance (TinyLlama Q4_K_M, 22 tokens, Intel x86-64)

Build	Prefill	Generation	Total
Old Native	3.9 tok/s	4.0 tok/s	7.22s
Scalar	3.8 tok/s	3.8 tok/s	7.60s
SSE2	10.1 tok/s	9.8 tok/s	2.93s
SSE3	10.1 tok/s	9.8 tok/s	2.93s
AVX	9.8 tok/s	9.7 tok/s	2.99s
AVX2	12.0 tok/s	11.6 tok/s	2.47s

AVX2 is ~3× faster than scalar (12.0 vs 3.8 tok/s prefill, 11.6 vs 3.8 tok/s generation).

Test plan

make clean && make native builds cleanly
make scalar, make sse2, make avx, make avx2 all build without warnings
./picolm model.gguf -p "The capital of France is" -n 20 -t 0 produces correct greedy output across all tiers
Memory usage unchanged vs baseline (./picolm model.gguf -p "Hello" -n 10 2>&1 | grep Memory)
Tested on: Intel x86-64 (Haswell+ for AVX2)

🤖 Generated with Claude Code

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

…nfigurations

…formance

- Add x86, sse2, sse3, avx targets to platform-specific builds section - Update SIMD feature entry to mention SSE2/SSE3/AVX tiers - Expand x86 SIMD optimization section with per-tier description - Update performance waterfall chart to reflect 8-wide AVX ops - Add --mem option to usage section - Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step - Update FAQ SIMD mention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…onfigurations

Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets: - AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add, vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K - SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround) - SSE2: 4-wide baseline for all the above ops Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit (no AVX2 required); only float accumulators widen to 256-bit. Scalar fallback preserved for all paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge simd branch cleanups into avx2: - quant.h: unify detection into single hierarchy block (AVX2→AVX→SSE3→SSE2), drop redundant old guards; add hsum_avx under #ifdef PICOLM_AVX - quant.c: trim AVX2/AVX block comments to single lines per style guide - picolm.c: keep SIMD-tier startup print, drop removed --mem print - model.c: remove stray blank line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…SE2 intent Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it invisible to the AVX and SSE2 branches that use it. Move it to a dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before the dispatch chain so all consuming branches see it. Add a comment to `make static` explaining the deliberate switch from -march=native to -msse2 (portable static binary, runs on any x86-64). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…document Q6K_CONV idiom - Add -mfma to make avx and make avx2; all AVX/AVX2 CPUs support FMA3 and GCC will fuse multiply-add pairs in hot dot-product loops for free - Fix vec_dot_f32_f32 AVX path: add 8-wide cleanup pass between the 16-wide main loop and scalar tail so hidden sizes that are multiples of 8 but not 16 (2048, 4096, ...) don't leave 8 elements to scalar - Add comment to Q6K_CONV explaining the non-obvious unpacklo/srai sign-extension idiom (byte→int16 widening without SSE4.1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cstroie and others added 12 commits April 16, 2026 10:52

feat: add --mem parameter to load model into RAM instead of mmap

d73aa69

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

fix: fix model_load call and signed comparison warning

3f30789

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: add --mem option for model loading mode selection

9639988

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: add fast mode with optimized parameters for better performance

bc1a262

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: enhance performance with SSE2 optimizations and update build co…

92d5aa7

…nfigurations

feat: add AVX support for optimized vector operations and enhance per…

ede96ff

…formance

feat: enhance SIMD support with AVX2 optimizations and update build c…

65afa9d

…onfigurations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SSE2/SSE3/AVX/AVX2 SIMD tiers for x86 inference#29

feat: add SSE2/SSE3/AVX/AVX2 SIMD tiers for x86 inference#29
cstroie wants to merge 12 commits intoRightNow-AI:mainfrom
cstroie:x86-simd

cstroie commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cstroie commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (TinyLlama Q4_K_M, 22 tokens, Intel x86-64)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cstroie commented May 7, 2026 •

edited

Loading