Skip to content

gdassori/carbon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

carbon — local LLMs on an Intel ThinkPad

Run a 30B-class language model, vision models, image generation, and a local coding agent on a Lenovo ThinkPad — on the Intel Arc iGPU, no cloud, no Docker, no Ollama.

license python platform engine host-native made for

carbon is a single, reproducible CLI that turns a 2025–26 Intel Core Ultra ThinkPad into a serious local-LLM workstation. It builds the right inference engine, downloads the right models, diagnoses and fixes the host, and serves an OpenAI-compatible API you can point your editor or a coding agent at — and every command prints the exact host command it runs, so nothing is a black box.

Who this is for. ThinkPad owners with an Intel Core Ultra chip (Arrow Lake-H / Lunar Lake) and an Arc iGPU. This repo is deliberately tuned for one machine (see portability is a non-goal) — but it's a worked, measured blueprint for what's actually achievable, and the design ports cleanly to any similar box.


Table of contents


The machine

carbon targets one laptop — the Lenovo ThinkPad Carbon X13 Aura — and is validated on it:

Component Value Why it matters
Laptop Lenovo ThinkPad Carbon X13 Aura the target machine (repo codename carbon-x13-aura)
CPU Intel Core Ultra 7 255H (Arrow Lake-H), 16 cores the host
iGPU Intel Arc 140T (Xe2, 8 Xe-cores) the real LLM engine — most bandwidth + matrix cores
NPU NPU 3720 (~11–13 TOPS, Meteor-Lake-gen silicon) low-power ≤8B / prefill / embeddings (roadmap)
RAM 30 GiB unified LPDDR5x (~100 GB/s) decode is bandwidth-bound → this is the ceiling
OS Ubuntu 25.04, kernel 6.14

The two facts that shape every decision in this repo: decode speed is limited by memory bandwidth, not TOPS (so the iGPU beats the NPU for big models), and you have lots of RAM but slow-ish bandwidth (so a Mixture-of-Experts model is the sweet spot). The full reasoning is in docs/why-these-choices.md.


Quickstart

# 0. clone, then install the CLI (uv creates the venv + installs typer/rich, ~2 s)
pip install uv && uv sync

# 1. is the host ready? (read-only checks; --fix applies the host setup with sudo)
uv run carbon doctor
uv run carbon doctor --fix          # render group, oneAPI, Level-Zero, swap …

# 2. give big models a safety margin (this laptop ships with only 512 MB of swap)
uv run carbon swap 16G

# 3. build the from-source SYCL engine (only needed for newest archs / long context;
#    the default IPEX engine ships prebuilt — see "The engines")
uv run carbon build

# 4. download a model, then run it
uv run carbon pull qwen3-coder-30b
uv run carbon run  qwen3-coder-30b -p "Write a haiku about ThinkPads"

# 5. or serve an OpenAI-compatible API on http://localhost:8080
uv run carbon serve qwen3-coder-30b
#    -> point your editor / agent at http://localhost:8080/v1  (see "local coding agent")
uv run carbon stop

Everything is declared in carbon.toml (models + engines) — add your own there and the CLI picks them up. Want to see exactly what a command will do before it touches your machine? Append --dry-run to any command:

$ uv run carbon run qwen3-coder-30b --dry-run -p "Write a haiku about ThinkPads"
$ bash -c 'export ZES_ENABLE_SYSMAN=1; export SYCL_CACHE_PERSISTENT=1; \
    exec ~/.cache/carbon-llm/ipex-cpp/.../llama-cli \
    -m ~/.cache/carbon-llm/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
    -c 16384 -ngl 99 -p 'Write a haiku about ThinkPads' -no-cnv'

carbon doctor — is my ThinkPad ready?

carbon doctor checks the whole stack and tells you the one-line fix for anything red. Real output on the target machine (all green):

                     carbon doctor — host (common)
┏━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ check        ┃ status ┃ detail                                      ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ igpu         │   ✓    │ /dev/dri/renderD128 present (Intel Arc)     │
│ render-group │   ✓    │ user in the render group (can use /dev/dri) │
│ gpu-runtime  │   ✓    │ Level-Zero loader present (libze_loader)    │
│ git-curl     │   ✓    │ git + curl present                          │
│ npu-device   │   ✓    │ /dev/accel/accel0 present (NPU, roadmap)    │
│ npu-driver   │   ✓    │ intel_vpu kernel module loaded              │
│ ram          │   ✓    │ 30.8 GiB total, 24.5 GiB available          │
│ swap         │   ✓    │ 16.0 GiB of swap                            │
│ disk         │   ✓    │ 126 GB free                                 │
└──────────────┴────────┴─────────────────────────────────────────────┘
       carbon doctor — LLM engine — llama.cpp / SYCL
│ cmake    │ ✓ │ cmake present                                              │
│ oneapi   │ ✓ │ oneAPI toolchain present (SYCL build available)            │
│ ipex-llm │ ✓ │ IPEX-LLM portable present (~2x decode on Arc, default)     │
       carbon doctor — image generation — OpenVINO
│ openvino │ ✓ │ OpenVINO GenAI venv present                                │
       carbon doctor — agentic coding — OpenCode + OpenSpec
│ node     │ ✓ │ Node.js present (manage with mise)                         │
│ opencode │ ✓ │ opencode present                                           │
│ openspec │ ✓ │ openspec present                                           │
✓ no host action needed

Run uv run carbon doctor --fix and carbon will, for each failing check, run the exact remediation (adding you to the render group, installing the Level-Zero runtime, bumping swap, …).


What you actually get (honest numbers)

These are measured decode rates (carbon benchtg128) on the Arc 140T, Q4 weights, not marketing figures. The headline: an MoE gives you ~30B-class quality at small-model speed, while a dense 14B+ on this iGPU is slow.

Model Kind Footprint measured tok/s Use it for
qwen3-coder-30b (MoE, 3.3B active) ⭐ coding/agentic 17.7 GB 15.6 the coding daily driver
qwen3-30b-a3b (MoE) general 18.6 GB 12.8 max-quality chat/reasoning
ernie-4.5-21b (MoE) reasoning 13.5 GB 14.0 thinking
gpt-oss-20b (MoE, MXFP4) general 12.1 GB 10.8 open reasoning
granite-4-h-tiny (hybrid Mamba-2) fast/light 4.3 GB 21.6 snappy chat, low power
qwen3-8b fast/light 5.0 GB 10.9 lean chat, the NPU-class model
qwen3-14b (dense) general 9.0 GB 5.8 solid dense fallback
deepseek-r1-14b (dense) deep reasoning 10.5 GB 4.2 competition math (slow)
moondream2 (VLM) vision 3.7 GB 23.8 fast image captioning
qwen3-vl-8b / qwen25-vl-7b (VLM) vision/OCR ~6 GB 9.8 / 10.7 documents, OCR

Why the MoE wins: Qwen3-30B-A3B has 30B total parameters but only ~3.3B active per token, so decode (bandwidth-bound) reads ~3.3B — it runs ~2× faster than a dense 14B despite 2× the total size. That's the whole thesis of this project (TD-004).

A note on the engine tradeoff (don't be surprised): the default IPEX-LLM engine gives the fastest decode, but its flash-attention garbles MoE output, so on IPEX carbon drops -fa + quantized KV and caps context to 16K (f16 KV). For long context (up to 64K) with quantized KV, use the from-source SYCL build or the OpenVINO server. See The engines.

The honest ceiling. As a fully autonomous coding agent, no model that fits in 30 GiB reliably converges "fix-until-green" on a hard task (best ≈ 6/10 one-shot; see docs/research/agentic-coding-benchmark-2026.md). Local on this class of laptop is a copilot you supervise, not an autopilot. carbon is built so the same workflow scales, RAM-permitting, to a frontier-scale MoE on a bigger box.


The model catalog

carbon ships 25 curated, ready-to-pull models — LLMs, vision (VLM), image generation, and embeddings — each with measured perf and the right engine pre-selected. uv run carbon models:

Full carbon models output (click to expand)
                                            configured models
┃ name                   ┃ local ┃    size ┃ ctx ┃ params ┃ tk/s iGPU ┃ tags
│ qwen3-14b              │   ✓   │  9.0 GB │ 32K │    14B │       5.8 │ chat, coding, reasoning
│ qwen3-14b-spec         │   ✓   │  9.0 GB │ 32K │    14B │       —   │ chat, coding, reasoning, speculative
│ qwen3-8b               │   ✓   │  5.0 GB │ 32K │     8B │      10.9 │ chat, coding, fast, low-power
│ deepseek-r1-14b        │   ✓   │ 10.5 GB │ 32K │    14B │       4.2 │ reasoning, coding, thinking, slow
│ qwen3-30b-a3b          │   ✓   │ 18.6 GB │ 32K │    30B │      12.8 │ chat, coding, reasoning, moe, long-context
│ gpt-oss-20b            │   ✓   │ 12.1 GB │ 32K │    21B │      10.8 │ chat, reasoning, moe, mxfp4
│ qwen3-coder-30b        │   ✓   │ 17.7 GB │ 32K │    30B │      15.6 │ coding, agentic, moe, long-context
│ qwen3-30b-thinking     │   ✓   │ 17.7 GB │ 32K │    30B │      ~15  │ reasoning, thinking, moe, agentic, coding
│ qwen3-coder-30b-ov     │   –   │ 16.3 GB │ 32K │    30B │       —   │ coding, agentic, moe, openvino
│ deepseek-coder-v2-lite │   ✓   │ 10.4 GB │ 32K │    16B │       —   │ coding, agentic, moe, fast
│ glm-4.7-flash          │   ✓   │ 17.5 GB │ 32K │    30B │       —   │ coding, agentic, moe, tool-use
│ ernie-4.5-21b          │   ✓   │ 13.5 GB │ 32K │    21B │      14.0 │ reasoning, thinking, moe
│ granite-4-h-tiny       │   ✓   │  4.3 GB │ 32K │     7B │      21.6 │ chat, fast, moe, mamba, low-power
│ mistral-small-3.2      │   ✓   │ 14.3 GB │ 32K │    24B │       3.8 │ chat, multilingual, function-calling
│ gemma-3-27b            │   ✓   │ 16.5 GB │ 32K │    27B │       3.1 │ chat, multilingual, vision-capable
│ gemma-3-12b            │   ✓   │  7.3 GB │ 32K │    12B │       6.2 │ chat, multilingual, vision-capable, fast
│ qwen3-vl-8b            │   ✓   │  6.3 GB │ 32K │     8B │       9.8 │ vision, ocr, documents
│ qwen25-vl-7b           │   ✓   │  6.0 GB │ 32K │     7B │      10.7 │ vision, ocr
│ moondream2             │   ✓   │  3.7 GB │  8K │     2B │      23.8 │ vision, captioning, fast
│ sdxl-turbo             │   –   │    —    │  —  │     3B │       —   │ image, fast
│ flux1-schnell          │   –   │  9.2 GB │  —  │    12B │       —   │ image, high-quality, slow
│ qwen3-embed-0.6b       │   –   │  0.6 GB │ 32K │   0.6B │       —   │ embeddings, fast
│ qwen3-embed-4b         │   –   │  2.5 GB │ 32K │     4B │       —   │ embeddings
│ bge-m3                 │   –   │  0.6 GB │  8K │   0.6B │       —   │ embeddings, multilingual
│ nomic-embed            │   –   │  0.1 GB │  8K │   0.1B │       —   │ embeddings, fast

uv run carbon show <model> drills into one (paths, server args, perf, notes).


Command reference

Every command accepts --dry-run (print the host command, change nothing).

Command What it does
carbon doctor [--fix] Diagnose the host (iGPU, render group, Level-Zero, oneAPI, NPU, RAM, swap, agent tools); --fix applies remediations.
carbon swap [SIZE] [--resize/--new] [--force] Create/enable a swapfile (the laptop ships with 512 MB); a resize is OOM-guarded.
carbon build [-e ENGINE] [--no-cache] Build llama.cpp from source with SYCL. No -e → builds only source engines; prebuilt (IPEX) is a no-op.
carbon models List configured models: presence, size, context, params, measured tok/s, tags.
carbon show MODEL Full detail for one model.
carbon pull MODEL [-b] Download a model (GGUF + projector + draft) with resume; -b detaches.
carbon run MODEL [-p PROMPT] Host-native inference (chat if no -p); auto-routes LLM / VLM / embedding / image.
carbon serve MODEL [--port] Start an OpenAI-compatible server (/v1) for a model.
carbon warm MODEL Serve + prefill & KV-cache the project system prompt (skips prefill on reuse).
carbon bench MODEL [-s] Benchmark with llama-bench; -s writes the measured tok/s back to carbon.toml.
carbon imatrix MODEL --calib FILE Generate an importance matrix (guides low-bit quant).
carbon quantize MODEL [--recipe] Asymmetric MoE quant (routed experts low-bit, attention/output high).
carbon stop Stop the running server(s).
carbon agent install / config / run Install OpenCode+OpenSpec, point them at a served model, or drive a model headless on a task.
carbon version Print the version.

The engines

carbon is engine-pluggable (TD-011); each model in carbon.toml picks the right one.

Engine What it is When carbon uses it
llamacpp-ipex (default) Intel's Xe-optimized prebuilt llama.cpp (IPEX-LLM portable) — ~2× decode on the Arc 140T. Self-contained, nothing to build. The speed default for most models. Caveat: tracks a mid-2025 llama.cpp (newest archs go to SYCL) and its flash-attn garbles MoE → no -fa/quant-KV, 16K context. Intel archived IPEX-LLM in Jan 2026, but the binaries still give the best decode, so it stays the default.
llamacpp-sycl llama.cpp built from source with SYCL/oneAPI. Correct flash-attention + quantized KV → up to 64K context. Newest architectures, long context, speculative decoding, and anything IPEX can't load.
llm-openvino OpenVINO GenAI as an OpenAI server with tool-calling and FA-independent int8/int4 KV. The maintained long-context agentic backend (TD-022).
image-openvino OpenVINO GenAI image generation (SDXL-Turbo, FLUX.1-schnell). carbon run <image-model> -p "..." → PNG.

Use it as a local coding agent

carbon integrates with OpenCode so a locally-served model can drive your editor/agent — fully offline.

uv run carbon agent install               # OpenCode + OpenSpec (via npm)
uv run carbon serve qwen3-coder-30b       # OpenAI API on :8080
uv run carbon agent config qwen3-coder-30b   # writes opencode.json pointing at the local server
# ... now use OpenCode normally, or drive it headlessly:
uv run carbon agent run qwen3-coder-30b "implement the spec in SPEC.md until tests pass" --dir ./task

The served API is plain OpenAI /v1, so any tool that speaks OpenAI works — point it at http://localhost:8080/v1 with any API key (e.g. sk-local). A sample opencode.json is included. (Tool-calling works on both llama.cpp and the OpenVINO engine; see the agentic benchmark for the realistic capability ceiling.)


How it works

  • Host-native, no Docker, no Ollama (TD-016). carbon is a thin orchestrator over host processes: it builds llama.cpp, manages llama-server/llama-cli (and the OpenVINO server) directly, and accesses the Arc 140T through /dev/dri. No container indirection between you and the inference flags.
  • Reproducible by construction. Every command prints the exact shell line it runs; --dry-run shows it without executing. Dependencies are pinned with uv (uv.lock committed).
  • Declarative. Models and engines live in carbon.toml, read with the stdlib tomllib — the only runtime deps are typer + rich (TD-013, TD-014).
  • Tuned to the silicon. Q4_K_M weights + Q8 KV + flash-attention for ~32K comfortable context; the quant floor and context levers are measured, not guessed (TD-006, TD-007, TD-017).

The decision record

Every architectural choice is captured as a Technical Decision in docs/decisions/ — 31 of them, with context, rationale, consequences, and the alternatives rejected. The measurements behind them are in docs/research/.

TD Decision Status
002 iGPU (Arc 140T) is the primary engine; NPU is secondary accepted
003 No dense 14B on the NPU 3720 (it's a dead-end) accepted
004 Daily driver = MoE Qwen3-30B-A3B accepted
005 · 006 · 007 2507 variant · Q8 KV / 32K · Q4_K_M default accepted
008 llama.cpp/SYCL over Ollama (matrix cores + flag control) accepted
011 · 013 · 014 Pluggable engines · TOML config · uv accepted
016 Host-native execution, remove Docker (the pivot) accepted
017 · 018 Q4_K is the quant floor; flash-attn is the context lever accepted / phase-0
022 OpenVINO engine + OpenAI serving for agentic coding in progress
001 · 009 · 010 Docker / Ollama / custom images superseded by TD-016
019 · 021 Heterogeneous kernels · port IPEX kernels blocked (documented)
012 · 020 NPU INT4 export · NPU embeddings for RAG proposed (roadmap)
023 carbon warm — system-prompt KV-root sharing (prefill once, reuse) accepted
024 "Agent Mind" — repo digest injected into the warmed prefix proposed (roadmap)
025 · 026 · 027 KV-segment composition · semantic cache · agentic trace cache proposed (R&D)
028 Frontier MoEs (Kimi K2, Qwen3-235B) out of scope at 30 GiB accepted
029 On-device weight RL / fine-tuning out of scope accepted
030 · 031 Optimize the agent OS, not the weights · first-party trace+reward loop proposed (R&D)

The narrative arc — from "can I run a 14B on the NPU?" to the final architecture, including the dead-ends (IPEX archived, the Vulkan classifier bug, sub-Q4 being compute-bound) — is told in docs/why-these-choices.md.


This is tuned for one machine, on purpose

Portability is an explicit non-goal (TD-016). carbon configures one box, once — that's exactly why it can drop the container/abstraction layers that exist to make software run on arbitrary hosts, and instead expose llama.cpp's flags directly, pick per-model engines by hand, and bake in measured numbers.

That said, the design travels even though the tuning doesn't. To adapt it to your own ThinkPad:

  1. Run uv run carbon doctor — it'll tell you what your silicon is missing.
  2. Edit carbon.toml: set bin_dir/oneapi_setvars for your engine, and add models (each is ~10 lines: engine, gguf/hf_repo, server_args, context_length).
  3. Re-measure with carbon bench <model> -s — the crossover points (which engine/quant/context wins) are bandwidth-dependent and will differ from this machine's. Don't trust these numbers on a box with different memory bandwidth; trust your own carbon bench.

FAQ & gotchas

  • "NPU not detected." Almost always: you're not in the render group, or a Level-Zero/OpenVINO ABI mismatch. carbon doctor checks both. The NPU is on the roadmap (embeddings/RAG); today the iGPU does the heavy lifting.
  • Default context is 16K, not 32K? On the default IPEX engine, yes — its flash-attn garbles MoE output, so carbon drops -fa/quant-KV and caps context to 16K (f16 KV). Use llamacpp-sycl or llm-openvino for 32–64K with quantized KV.
  • Big MoE near the RAM ceiling. ~18 GB of weights + KV + OS sits at the edge of 30 GiB → keep KV in Q8 and run carbon swap 16G so a spike can't OOM.
  • First SYCL run is slow. Kernels compile on first use; SYCL_CACHE_PERSISTENT=1 (set by the engine) caches them across runs.

License

MIT © Guido Dassori. Built for, and validated on, a Lenovo ThinkPad Carbon X13 Aura (Intel Core Ultra 7 255H · Arc 140T · NPU 3720). Model weights and the inference engines (llama.cpp, IPEX-LLM, OpenVINO) are the property of their respective authors under their own licenses.

About

Run a 30B-class LLM, vision, image generation and a local coding agent on an Intel ThinkPad (Arc 140T iGPU) — host-native, reproducible, no Docker.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors