Skip to content

feat: add SSE2/SSE3/AVX/AVX2 SIMD tiers for x86 inference#29

Open
cstroie wants to merge 12 commits intoRightNow-AI:mainfrom
cstroie:x86-simd
Open

feat: add SSE2/SSE3/AVX/AVX2 SIMD tiers for x86 inference#29
cstroie wants to merge 12 commits intoRightNow-AI:mainfrom
cstroie:x86-simd

Conversation

@cstroie
Copy link
Copy Markdown

@cstroie cstroie commented May 7, 2026

Summary

  • Add SSE2, SSE3, AVX, and AVX2 SIMD kernel tiers for x86 inference in quant.c/quant.h
  • Wire up tier detection and dispatch in picolm.c via PICOLM_SSE2, PICOLM_AVX, PICOLM_AVX2 macros
  • Add make sse2, make avx, make avx2 build targets alongside existing make native / make scalar
  • Add -mfma to avx/avx2 targets — GCC fuses multiply-add pairs in dot-product loops for free
  • Fix Q6K_CONV macro placement (was inside AVX2 branch, invisible to AVX/SSE2 consumers)
  • Update CONTRIBUTING.md and README.md with SIMD tier hierarchy and new build targets

Performance (TinyLlama Q4_K_M, 22 tokens, Intel x86-64)

Build Prefill Generation Total
Old Native 3.9 tok/s 4.0 tok/s 7.22s
Scalar 3.8 tok/s 3.8 tok/s 7.60s
SSE2 10.1 tok/s 9.8 tok/s 2.93s
SSE3 10.1 tok/s 9.8 tok/s 2.93s
AVX 9.8 tok/s 9.7 tok/s 2.99s
AVX2 12.0 tok/s 11.6 tok/s 2.47s

AVX2 is ~3× faster than scalar (12.0 vs 3.8 tok/s prefill, 11.6 vs 3.8 tok/s generation).

Test plan

  • make clean && make native builds cleanly
  • make scalar, make sse2, make avx, make avx2 all build without warnings
  • ./picolm model.gguf -p "The capital of France is" -n 20 -t 0 produces correct greedy output across all tiers
  • Memory usage unchanged vs baseline (./picolm model.gguf -p "Hello" -n 10 2>&1 | grep Memory)
  • Tested on: Intel x86-64 (Haswell+ for AVX2)

🤖 Generated with Claude Code

cstroie and others added 12 commits April 16, 2026 10:52
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
- Add x86, sse2, sse3, avx targets to platform-specific builds section
- Update SIMD feature entry to mention SSE2/SSE3/AVX tiers
- Expand x86 SIMD optimization section with per-tier description
- Update performance waterfall chart to reflect 8-wide AVX ops
- Add --mem option to usage section
- Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step
- Update FAQ SIMD mention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets:
- AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add,
  vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K
- SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround)
- SSE2: 4-wide baseline for all the above ops

Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit
(no AVX2 required); only float accumulators widen to 256-bit.
Scalar fallback preserved for all paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge simd branch cleanups into avx2:
- quant.h: unify detection into single hierarchy block (AVX2→AVX→SSE3→SSE2),
  drop redundant old guards; add hsum_avx under #ifdef PICOLM_AVX
- quant.c: trim AVX2/AVX block comments to single lines per style guide
- picolm.c: keep SIMD-tier startup print, drop removed --mem print
- model.c: remove stray blank line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SE2 intent

Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it
invisible to the AVX and SSE2 branches that use it. Move it to a
dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before
the dispatch chain so all consuming branches see it.

Add a comment to `make static` explaining the deliberate switch from
-march=native to -msse2 (portable static binary, runs on any x86-64).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…document Q6K_CONV idiom

- Add -mfma to make avx and make avx2; all AVX/AVX2 CPUs support FMA3
  and GCC will fuse multiply-add pairs in hot dot-product loops for free
- Fix vec_dot_f32_f32 AVX path: add 8-wide cleanup pass between the
  16-wide main loop and scalar tail so hidden sizes that are multiples
  of 8 but not 16 (2048, 4096, ...) don't leave 8 elements to scalar
- Add comment to Q6K_CONV explaining the non-obvious unpacklo/srai
  sign-extension idiom (byte→int16 widening without SSE4.1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant