RightNow-AI · cstroie · Apr 16, 2026 · Apr 16, 2026 · Apr 16, 2026 · Apr 16, 2026
diff --git a/.gitignore b/.gitignore
@@ -50,3 +50,4 @@ picoclaw/
 
 # Internal dev docs
 picoclaw/PICOLM_INTEGRATION.md
+.aider*
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -12,7 +12,7 @@ Thanks for your interest in PicoLLM! This project is intentionally small (~2,500
 ## What We Need Help With
 
 ### High Impact
-- **SIMD kernels** — AVX2/AVX-512 for x86, optimized NEON for ARM
+- **SIMD kernels** — AVX-512 for x86 server CPUs, optimized NEON for ARM
 - **New quantization formats** — Q5_K fused dot product, IQ formats
 - **New model architectures** — Mistral, Phi, Gemma (LLaMA-compatible)
 - **Platform testing** — RISC-V boards, Pi Zero, exotic ARM SBCs
@@ -106,12 +106,20 @@ If you're adding SIMD code:
     // ARM NEON path (Pi 3/4/5)
     float32x4_t v = vld1q_f32(ptr);
     ...
+#elif defined(PICOLM_AVX2)
+    // x86 AVX2 path (Haswell+, Excavator+ — 256-bit integer + float)
+    __m256i v = _mm256_loadu_si256((const __m256i *)ptr);
+    ...
+#elif defined(PICOLM_AVX)
+    // x86 AVX path (Sandy Bridge+, Bulldozer+ — 8-wide float)
+    __m256 v = _mm256_loadu_ps(ptr);
+    ...
 #elif defined(PICOLM_SSE2)
-    // x86 SSE2 path (Intel/AMD)
+    // x86 SSE2 path (any x86-64 — 4-wide float)
     __m128 v = _mm_loadu_ps(ptr);
     ...
 #endif
-    // Scalar fallback (always works)
+    // Scalar fallback (always works — also reachable via `make scalar`)
     for (int i = 0; i < n; i++) { ... }
 ```
 

diff --git a/README.md b/README.md
@@ -183,7 +183,7 @@ The model file (638MB) stays on disk. PicoLM **memory-maps** it and streams one
 | **FP16 KV Cache** | Halves KV cache memory (44MB vs 88MB for 2048 context) |
 | **Flash Attention** | Online softmax — no O(seq_len) attention buffer needed |
 | **Pre-computed RoPE** | cos/sin lookup tables eliminate transcendentals from hot loop |
-| **SIMD Acceleration** | ARM NEON (Pi 3/4/5) and x86 SSE2 (Intel/AMD) auto-detected |
+| **SIMD Acceleration** | ARM NEON (Pi 3/4/5), x86 SSE2/SSE3/AVX/AVX2 — auto-detected at compile time |
 | **Fused Dot Products** | Dequantize + dot-product in one pass — no intermediate buffer |
 | **Multi-threaded matmul** | Parallel matrix-vector multiply across CPU cores |
 | **Grammar-Constrained JSON** | `--json` flag forces valid JSON output (for tool calling) |
@@ -234,14 +234,23 @@ make model
 
 ```cmd
 cd picolm
-build.bat
+build.bat           :: SSE2 baseline (any x86-64)
+build.bat avx2      :: AVX2 (Haswell+ / Excavator+, fastest)
+build.bat avx       :: AVX  (Sandy Bridge+ / Bulldozer+)
+build.bat scalar    :: no SIMD (portable fallback)
 picolm.exe model.gguf -p "Hello world" -n 50
 ```
 
 ### Platform-specific builds
 
 ```bash
 make native      # x86/ARM auto-detect (recommended for local machine)
+make x86         # x86-64 safe default (SSE2 only — runs on any x86-64)
+make sse2        # x86-64 SSE2 only (same as x86)
+make sse3        # x86-64 SSE2+SSE3+SSSE3 (AMD Phenom/Athlon, older Intel)
+make avx         # x86-64 AVX (Sandy Bridge+, Bulldozer+ — wider SIMD, faster)
+make avx2        # x86-64 AVX2 (Haswell+, Excavator+ — widest SIMD, fastest)
+make scalar      # No SIMD (portable scalar fallback, any architecture)
 make pi          # Raspberry Pi 3/4/5 (64-bit ARM + NEON SIMD)
 make pi-arm32    # Pi Zero / Pi 1 (32-bit ARM)
 make cross-pi    # Cross-compile for Pi from x86 (static binary)
@@ -348,7 +357,7 @@ Measured on TinyLlama 1.1B Q4_K_M (638 MB model):
  + FP16 KV cache          █████████████████░░░  (halve memory bandwidth)
  + Pre-computed RoPE      ██████████████████░░  (no sin/cos in hot loop)
  + Flash attention        ██████████████████░░  (no O(n) attention alloc)
- + NEON/SSE2 SIMD         ███████████████████░  (4-wide vector ops)
+ + NEON/SSE2/AVX SIMD     ███████████████████░  (4-wide to 8-wide vector ops)
  + KV cache persistence   ████████████████████  (skip prefill entirely)
 ```
 
@@ -477,9 +486,14 @@ PicoLM implements 9 optimizations that brought generation speed from **1.6 tok/s
 
 4-wide float vector operations for all hot paths. Example: dequantizing Q4_K nibbles with `vmovl_u8` → `vmovl_u16` → `vcvtq_f32_u32`, and RoPE with interleaved `vld2q_f32` / `vst2q_f32`.
 
-### 2. x86 SSE2 SIMD
+### 2. x86 SIMD (SSE2 / SSE3 / AVX / AVX2)
 
-Auto-detected on Intel/AMD. 4-wide `__m128` operations for dot products, RMSNorm, and vector operations.
+Four compile-time tiers for Intel/AMD:
+
+- **SSE2** (`make sse2` or `make x86`): 4-wide `__m128` operations for dot products, RMSNorm, softmax, RoPE, and element-wise ops. Safe baseline for all x86-64 CPUs.
+- **SSE3** (`make sse3`): adds `_mm_addsub_ps` for a cleaner RoPE rotation kernel (no sign-mask workaround needed).
+- **AVX** (`make avx`): 8-wide `__m256` float accumulators for all ops. Q4_K and Q6_K dot products widen the float accumulation stage while keeping integer nibble extraction at 128-bit (no AVX2 required). RoPE processes 4 complex pairs per iteration with `_mm256_addsub_ps`.
+- **AVX2** (`make avx2`): adds 256-bit integer operations. Q4_0 nibble extraction uses `_mm256_cvtepu8_epi32` (8 nibbles → 8 int32 in 2 ops vs. 4-step unpack chain). Q6_K weight extraction uses `_mm256_cvtepi8_epi32` (8 int8 → 8 int32 in 2 ops vs. 4-instruction macro chain). Targets Haswell+ Intel and Excavator+ AMD.
 
 ### 3. FP16 KV Cache
 
@@ -636,7 +650,7 @@ A: llama.cpp is excellent but requires ~200MB+ for the runtime on small models,
 A: TinyLlama 1.1B is a small model — it handles simple tasks (Q&A, summarization, basic reasoning, JSON generation) well. It won't match GPT-4, but it runs on a $10 board with no internet. For structured output, the `--json` grammar mode guarantees valid JSON regardless of model quality.
 
 **Q: What about GPU acceleration?**
-A: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86/ARM CPUs, SIMD (NEON/SSE2) provides meaningful speedup.
+A: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86/ARM CPUs, SIMD (NEON/SSE2/AVX) provides meaningful speedup.
 
 **Q: Can I use a different model?**
 A: Any LLaMA-architecture GGUF model works. Download from [HuggingFace](https://huggingface.co/models?search=gguf) and point PicoLM at it. Recommended quantizations: Q4_K_M (best quality/size balance) or Q2_K (smallest, lower quality).
@@ -645,7 +659,9 @@ A: Any LLaMA-architecture GGUF model works. Download from [HuggingFace](https://
 
 ## Roadmap
 
-- [ ] AVX2/AVX-512 kernels for x86 (2-4x generation speed on modern CPUs)
+- [x] AVX kernels for x86 (`make avx` — 8-wide float ops, ~2x vs SSE2)
+- [x] AVX2 kernels for x86 (`make avx2` — 256-bit integer ops for Q4_0 and Q6_K quantized paths)
+- [ ] AVX-512 kernels for x86 (512-bit ops for server CPUs)
 - [ ] Speculative decoding with a draft model
 - [ ] Context sliding window (infinite generation beyond max_seq_len)
 - [ ] Weight pruning for further memory reduction

diff --git a/picolm/Makefile b/picolm/Makefile
@@ -1,5 +1,5 @@
 CC      = gcc
-CFLAGS  = -O2 -std=c11 -D_GNU_SOURCE -Wall -Wextra -Wpedantic
+CFLAGS  = -O3 -std=c11 -D_GNU_SOURCE -Wall -Wextra -Wpedantic
 LDFLAGS = -lm -lpthread
 SRCS    = picolm.c model.c tensor.c quant.c tokenizer.c sampler.c grammar.c
 TARGET  = picolm
@@ -11,11 +11,35 @@ MODEL_DIR ?= /opt/picolm/models
 native: CFLAGS += -march=native
 native: $(TARGET)
 
+# --- x86-64 default (SSE2 only, safe for all x86-64) ---
+x86: sse2
+
+# --- No SIMD (scalar fallback, portable to any architecture) ---
+scalar: CFLAGS += -mno-sse2 -mno-avx
+scalar: $(TARGET)
+
+# --- x86-64 with SSE2 only ---
+sse2: CFLAGS += -msse2
+sse2: $(TARGET)
+
+# --- x86-64 with SSE2+SSE3+SSSE3 (covers AMD Phenom/Athlon and similar without AVX) ---
+sse3: CFLAGS += -msse2 -msse3 -mssse3 -mpopcnt
+sse3: $(TARGET)
+
+# --- x86-64 with AVX (Sandy Bridge and newer Intel; Bulldozer and newer AMD) ---
+avx: CFLAGS += -mavx -mfma -mpopcnt
+avx: $(TARGET)
+
+# --- x86-64 with AVX2 (Haswell and newer Intel; Excavator and newer AMD) ---
+avx2: CFLAGS += -mavx2 -mfma -mpopcnt
+avx2: $(TARGET)
+
 $(TARGET): $(SRCS)
 	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
 
 # --- Static build for single-binary deployment ---
-static: CFLAGS += -march=native
+# Uses SSE2 (not -march=native) so the binary runs on any x86-64, not just the build machine.
+static: CFLAGS += -msse2
 static: LDFLAGS += -static
 static: $(TARGET)
 
@@ -70,4 +94,4 @@ model:
 clean:
 	rm -f $(TARGET) $(TARGET).exe *.obj *.o
 
-.PHONY: native static pi pi-arm32 cross-pi riscv cross-riscv debug install model clean
+.PHONY: native x86 scalar sse2 sse3 avx avx2 static pi pi-arm32 cross-pi riscv cross-riscv debug install model clean
diff --git a/picolm/build.bat b/picolm/build.bat
@@ -1,7 +1,26 @@
 @echo off
+REM PicoLM Windows build script (MSVC)
+REM
+REM SIMD targets:
+REM   build.bat          -- SSE2 baseline (safe for any x86-64)
+REM   build.bat avx2     -- AVX2 (Haswell+ / Excavator+, fastest)
+REM   build.bat avx      -- AVX  (Sandy Bridge+ / Bulldozer+)
+REM   build.bat scalar   -- no SIMD (portable scalar fallback)
+
 call "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1
-echo Compiling...
-cl /O2 /W3 /Fe:picolm.exe picolm.c model.c tensor.c quant.c tokenizer.c sampler.c grammar.c
+
+set SIMD_FLAG=
+if /I "%1"=="avx2"   set SIMD_FLAG=/arch:AVX2
+if /I "%1"=="avx"    set SIMD_FLAG=/arch:AVX
+if /I "%1"=="scalar" set SIMD_FLAG=/d2archSSE42-
+
+if "%SIMD_FLAG%"=="" (
+    echo Building: SSE2 baseline
+) else (
+    echo Building: %1 ^(%SIMD_FLAG%^)
+)
+
+cl /O2 /W3 %SIMD_FLAG% /Fe:picolm.exe picolm.c model.c tensor.c quant.c tokenizer.c sampler.c grammar.c
 if %ERRORLEVEL% neq 0 (
     echo BUILD FAILED
 ) else (

diff --git a/picolm/model.c b/picolm/model.c
@@ -406,7 +406,6 @@ static int parse_gguf(model_t *m, int max_seq_len) {
     fprintf(stderr, "  n_layers=%d, vocab_size=%d, max_seq=%d\n",
             cfg->n_layers, cfg->vocab_size, cfg->max_seq_len);
     fprintf(stderr, "  head_dim=%d, rope_base=%.1f\n", cfg->head_dim, cfg->rope_freq_base);
-
     free(tinfos);
     return 0;
 }
Original file line number	Diff line number	Diff line change
Expand Up		@@ -50,3 +50,4 @@ picoclaw/

		# Internal dev docs
		picoclaw/PICOLM_INTEGRATION.md
		.aider*