Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,4 @@ picoclaw/

# Internal dev docs
picoclaw/PICOLM_INTEGRATION.md
.aider*
14 changes: 11 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Thanks for your interest in PicoLLM! This project is intentionally small (~2,500
## What We Need Help With

### High Impact
- **SIMD kernels** — AVX2/AVX-512 for x86, optimized NEON for ARM
- **SIMD kernels** — AVX-512 for x86 server CPUs, optimized NEON for ARM
- **New quantization formats** — Q5_K fused dot product, IQ formats
- **New model architectures** — Mistral, Phi, Gemma (LLaMA-compatible)
- **Platform testing** — RISC-V boards, Pi Zero, exotic ARM SBCs
Expand Down Expand Up @@ -106,12 +106,20 @@ If you're adding SIMD code:
// ARM NEON path (Pi 3/4/5)
float32x4_t v = vld1q_f32(ptr);
...
#elif defined(PICOLM_AVX2)
// x86 AVX2 path (Haswell+, Excavator+ — 256-bit integer + float)
__m256i v = _mm256_loadu_si256((const __m256i *)ptr);
...
#elif defined(PICOLM_AVX)
// x86 AVX path (Sandy Bridge+, Bulldozer+ — 8-wide float)
__m256 v = _mm256_loadu_ps(ptr);
...
#elif defined(PICOLM_SSE2)
// x86 SSE2 path (Intel/AMD)
// x86 SSE2 path (any x86-64 — 4-wide float)
__m128 v = _mm_loadu_ps(ptr);
...
#endif
// Scalar fallback (always works)
// Scalar fallback (always works — also reachable via `make scalar`)
for (int i = 0; i < n; i++) { ... }
```

Expand Down
30 changes: 23 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ The model file (638MB) stays on disk. PicoLM **memory-maps** it and streams one
| **FP16 KV Cache** | Halves KV cache memory (44MB vs 88MB for 2048 context) |
| **Flash Attention** | Online softmax — no O(seq_len) attention buffer needed |
| **Pre-computed RoPE** | cos/sin lookup tables eliminate transcendentals from hot loop |
| **SIMD Acceleration** | ARM NEON (Pi 3/4/5) and x86 SSE2 (Intel/AMD) auto-detected |
| **SIMD Acceleration** | ARM NEON (Pi 3/4/5), x86 SSE2/SSE3/AVX/AVX2 — auto-detected at compile time |
| **Fused Dot Products** | Dequantize + dot-product in one pass — no intermediate buffer |
| **Multi-threaded matmul** | Parallel matrix-vector multiply across CPU cores |
| **Grammar-Constrained JSON** | `--json` flag forces valid JSON output (for tool calling) |
Expand Down Expand Up @@ -234,14 +234,23 @@ make model

```cmd
cd picolm
build.bat
build.bat :: SSE2 baseline (any x86-64)
build.bat avx2 :: AVX2 (Haswell+ / Excavator+, fastest)
build.bat avx :: AVX (Sandy Bridge+ / Bulldozer+)
build.bat scalar :: no SIMD (portable fallback)
picolm.exe model.gguf -p "Hello world" -n 50
```

### Platform-specific builds

```bash
make native # x86/ARM auto-detect (recommended for local machine)
make x86 # x86-64 safe default (SSE2 only — runs on any x86-64)
make sse2 # x86-64 SSE2 only (same as x86)
make sse3 # x86-64 SSE2+SSE3+SSSE3 (AMD Phenom/Athlon, older Intel)
make avx # x86-64 AVX (Sandy Bridge+, Bulldozer+ — wider SIMD, faster)
make avx2 # x86-64 AVX2 (Haswell+, Excavator+ — widest SIMD, fastest)
make scalar # No SIMD (portable scalar fallback, any architecture)
make pi # Raspberry Pi 3/4/5 (64-bit ARM + NEON SIMD)
make pi-arm32 # Pi Zero / Pi 1 (32-bit ARM)
make cross-pi # Cross-compile for Pi from x86 (static binary)
Expand Down Expand Up @@ -348,7 +357,7 @@ Measured on TinyLlama 1.1B Q4_K_M (638 MB model):
+ FP16 KV cache █████████████████░░░ (halve memory bandwidth)
+ Pre-computed RoPE ██████████████████░░ (no sin/cos in hot loop)
+ Flash attention ██████████████████░░ (no O(n) attention alloc)
+ NEON/SSE2 SIMD ███████████████████░ (4-wide vector ops)
+ NEON/SSE2/AVX SIMD ███████████████████░ (4-wide to 8-wide vector ops)
+ KV cache persistence ████████████████████ (skip prefill entirely)
```

Expand Down Expand Up @@ -477,9 +486,14 @@ PicoLM implements 9 optimizations that brought generation speed from **1.6 tok/s

4-wide float vector operations for all hot paths. Example: dequantizing Q4_K nibbles with `vmovl_u8` → `vmovl_u16` → `vcvtq_f32_u32`, and RoPE with interleaved `vld2q_f32` / `vst2q_f32`.

### 2. x86 SSE2 SIMD
### 2. x86 SIMD (SSE2 / SSE3 / AVX / AVX2)

Auto-detected on Intel/AMD. 4-wide `__m128` operations for dot products, RMSNorm, and vector operations.
Four compile-time tiers for Intel/AMD:

- **SSE2** (`make sse2` or `make x86`): 4-wide `__m128` operations for dot products, RMSNorm, softmax, RoPE, and element-wise ops. Safe baseline for all x86-64 CPUs.
- **SSE3** (`make sse3`): adds `_mm_addsub_ps` for a cleaner RoPE rotation kernel (no sign-mask workaround needed).
- **AVX** (`make avx`): 8-wide `__m256` float accumulators for all ops. Q4_K and Q6_K dot products widen the float accumulation stage while keeping integer nibble extraction at 128-bit (no AVX2 required). RoPE processes 4 complex pairs per iteration with `_mm256_addsub_ps`.
- **AVX2** (`make avx2`): adds 256-bit integer operations. Q4_0 nibble extraction uses `_mm256_cvtepu8_epi32` (8 nibbles → 8 int32 in 2 ops vs. 4-step unpack chain). Q6_K weight extraction uses `_mm256_cvtepi8_epi32` (8 int8 → 8 int32 in 2 ops vs. 4-instruction macro chain). Targets Haswell+ Intel and Excavator+ AMD.

### 3. FP16 KV Cache

Expand Down Expand Up @@ -636,7 +650,7 @@ A: llama.cpp is excellent but requires ~200MB+ for the runtime on small models,
A: TinyLlama 1.1B is a small model — it handles simple tasks (Q&A, summarization, basic reasoning, JSON generation) well. It won't match GPT-4, but it runs on a $10 board with no internet. For structured output, the `--json` grammar mode guarantees valid JSON regardless of model quality.

**Q: What about GPU acceleration?**
A: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86/ARM CPUs, SIMD (NEON/SSE2) provides meaningful speedup.
A: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86/ARM CPUs, SIMD (NEON/SSE2/AVX) provides meaningful speedup.

**Q: Can I use a different model?**
A: Any LLaMA-architecture GGUF model works. Download from [HuggingFace](https://huggingface.co/models?search=gguf) and point PicoLM at it. Recommended quantizations: Q4_K_M (best quality/size balance) or Q2_K (smallest, lower quality).
Expand All @@ -645,7 +659,9 @@ A: Any LLaMA-architecture GGUF model works. Download from [HuggingFace](https://

## Roadmap

- [ ] AVX2/AVX-512 kernels for x86 (2-4x generation speed on modern CPUs)
- [x] AVX kernels for x86 (`make avx` — 8-wide float ops, ~2x vs SSE2)
- [x] AVX2 kernels for x86 (`make avx2` — 256-bit integer ops for Q4_0 and Q6_K quantized paths)
- [ ] AVX-512 kernels for x86 (512-bit ops for server CPUs)
- [ ] Speculative decoding with a draft model
- [ ] Context sliding window (infinite generation beyond max_seq_len)
- [ ] Weight pruning for further memory reduction
Expand Down
30 changes: 27 additions & 3 deletions picolm/Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
CC = gcc
CFLAGS = -O2 -std=c11 -D_GNU_SOURCE -Wall -Wextra -Wpedantic
CFLAGS = -O3 -std=c11 -D_GNU_SOURCE -Wall -Wextra -Wpedantic
LDFLAGS = -lm -lpthread
SRCS = picolm.c model.c tensor.c quant.c tokenizer.c sampler.c grammar.c
TARGET = picolm
Expand All @@ -11,11 +11,35 @@ MODEL_DIR ?= /opt/picolm/models
native: CFLAGS += -march=native
native: $(TARGET)

# --- x86-64 default (SSE2 only, safe for all x86-64) ---
x86: sse2

# --- No SIMD (scalar fallback, portable to any architecture) ---
scalar: CFLAGS += -mno-sse2 -mno-avx
scalar: $(TARGET)

# --- x86-64 with SSE2 only ---
sse2: CFLAGS += -msse2
sse2: $(TARGET)

# --- x86-64 with SSE2+SSE3+SSSE3 (covers AMD Phenom/Athlon and similar without AVX) ---
sse3: CFLAGS += -msse2 -msse3 -mssse3 -mpopcnt
sse3: $(TARGET)

# --- x86-64 with AVX (Sandy Bridge and newer Intel; Bulldozer and newer AMD) ---
avx: CFLAGS += -mavx -mfma -mpopcnt
avx: $(TARGET)

# --- x86-64 with AVX2 (Haswell and newer Intel; Excavator and newer AMD) ---
avx2: CFLAGS += -mavx2 -mfma -mpopcnt
avx2: $(TARGET)

$(TARGET): $(SRCS)
$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)

# --- Static build for single-binary deployment ---
static: CFLAGS += -march=native
# Uses SSE2 (not -march=native) so the binary runs on any x86-64, not just the build machine.
static: CFLAGS += -msse2
static: LDFLAGS += -static
static: $(TARGET)

Expand Down Expand Up @@ -70,4 +94,4 @@ model:
clean:
rm -f $(TARGET) $(TARGET).exe *.obj *.o

.PHONY: native static pi pi-arm32 cross-pi riscv cross-riscv debug install model clean
.PHONY: native x86 scalar sse2 sse3 avx avx2 static pi pi-arm32 cross-pi riscv cross-riscv debug install model clean
23 changes: 21 additions & 2 deletions picolm/build.bat
Original file line number Diff line number Diff line change
@@ -1,7 +1,26 @@
@echo off
REM PicoLM Windows build script (MSVC)
REM
REM SIMD targets:
REM build.bat -- SSE2 baseline (safe for any x86-64)
REM build.bat avx2 -- AVX2 (Haswell+ / Excavator+, fastest)
REM build.bat avx -- AVX (Sandy Bridge+ / Bulldozer+)
REM build.bat scalar -- no SIMD (portable scalar fallback)

call "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1
echo Compiling...
cl /O2 /W3 /Fe:picolm.exe picolm.c model.c tensor.c quant.c tokenizer.c sampler.c grammar.c

set SIMD_FLAG=
if /I "%1"=="avx2" set SIMD_FLAG=/arch:AVX2
if /I "%1"=="avx" set SIMD_FLAG=/arch:AVX
if /I "%1"=="scalar" set SIMD_FLAG=/d2archSSE42-

if "%SIMD_FLAG%"=="" (
echo Building: SSE2 baseline
) else (
echo Building: %1 ^(%SIMD_FLAG%^)
)

cl /O2 /W3 %SIMD_FLAG% /Fe:picolm.exe picolm.c model.c tensor.c quant.c tokenizer.c sampler.c grammar.c
if %ERRORLEVEL% neq 0 (
echo BUILD FAILED
) else (
Expand Down
1 change: 0 additions & 1 deletion picolm/model.c
Original file line number Diff line number Diff line change
Expand Up @@ -406,7 +406,6 @@ static int parse_gguf(model_t *m, int max_seq_len) {
fprintf(stderr, " n_layers=%d, vocab_size=%d, max_seq=%d\n",
cfg->n_layers, cfg->vocab_size, cfg->max_seq_len);
fprintf(stderr, " head_dim=%d, rope_base=%.1f\n", cfg->head_dim, cfg->rope_freq_base);

free(tinfos);
return 0;
}
Expand Down
Loading