carbon — local LLMs on an Intel ThinkPad

Run a 30B-class language model, vision models, image generation, and a local coding agent on a Lenovo ThinkPad — on the Intel Arc iGPU, no cloud, no Docker, no Ollama.

carbon is a single, reproducible CLI that turns a 2025–26 Intel Core Ultra ThinkPad into a serious local-LLM workstation. It builds the right inference engine, downloads the right models, diagnoses and fixes the host, and serves an OpenAI-compatible API you can point your editor or a coding agent at — and every command prints the exact host command it runs, so nothing is a black box.

Who this is for. ThinkPad owners with an Intel Core Ultra chip (Arrow Lake-H / Lunar Lake) and an Arc iGPU. This repo is deliberately tuned for one machine (see portability is a non-goal) — but it's a worked, measured blueprint for what's actually achievable, and the design ports cleanly to any similar box.

The machine

carbon targets one laptop — the Lenovo ThinkPad Carbon X13 Aura — and is validated on it:

Component	Value	Why it matters
Laptop	Lenovo ThinkPad Carbon X13 Aura	the target machine (repo codename `carbon-x13-aura`)
CPU	Intel Core Ultra 7 255H (Arrow Lake-H), 16 cores	the host
iGPU	Intel Arc 140T (Xe2, 8 Xe-cores)	the real LLM engine — most bandwidth + matrix cores
NPU	NPU 3720 (~11–13 TOPS, Meteor-Lake-gen silicon)	low-power ≤8B / prefill / embeddings (roadmap)
RAM	30 GiB unified LPDDR5x (~100 GB/s)	decode is bandwidth-bound → this is the ceiling
OS	Ubuntu 25.04, kernel 6.14

The two facts that shape every decision in this repo: decode speed is limited by memory bandwidth, not TOPS (so the iGPU beats the NPU for big models), and you have lots of RAM but slow-ish bandwidth (so a Mixture-of-Experts model is the sweet spot). The full reasoning is in docs/why-these-choices.md.

Quickstart

# 0. clone, then install the CLI (uv creates the venv + installs typer/rich, ~2 s)
pip install uv && uv sync

# 1. is the host ready? (read-only checks; --fix applies the host setup with sudo)
uv run carbon doctor
uv run carbon doctor --fix          # render group, oneAPI, Level-Zero, swap …

# 2. give big models a safety margin (this laptop ships with only 512 MB of swap)
uv run carbon swap 16G

# 3. build the from-source SYCL engine (only needed for newest archs / long context;
#    the default IPEX engine ships prebuilt — see "The engines")
uv run carbon build

# 4. download a model, then run it
uv run carbon pull qwen3-coder-30b
uv run carbon run  qwen3-coder-30b -p "Write a haiku about ThinkPads"

# 5. or serve an OpenAI-compatible API on http://localhost:8080
uv run carbon serve qwen3-coder-30b
#    -> point your editor / agent at http://localhost:8080/v1  (see "local coding agent")
uv run carbon stop

Everything is declared in carbon.toml (models + engines) — add your own there and the CLI picks them up. Want to see exactly what a command will do before it touches your machine? Append --dry-run to any command:

$ uv run carbon run qwen3-coder-30b --dry-run -p "Write a haiku about ThinkPads"
$ bash -c 'export ZES_ENABLE_SYSMAN=1; export SYCL_CACHE_PERSISTENT=1; \
    exec ~/.cache/carbon-llm/ipex-cpp/.../llama-cli \
    -m ~/.cache/carbon-llm/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
    -c 16384 -ngl 99 -p 'Write a haiku about ThinkPads' -no-cnv'

`carbon doctor` — is my ThinkPad ready?

carbon doctor checks the whole stack and tells you the one-line fix for anything red. Real output on the target machine (all green):

                     carbon doctor — host (common)
┏━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ check        ┃ status ┃ detail                                      ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ igpu         │   ✓    │ /dev/dri/renderD128 present (Intel Arc)     │
│ render-group │   ✓    │ user in the render group (can use /dev/dri) │
│ gpu-runtime  │   ✓    │ Level-Zero loader present (libze_loader)    │
│ git-curl     │   ✓    │ git + curl present                          │
│ npu-device   │   ✓    │ /dev/accel/accel0 present (NPU, roadmap)    │
│ npu-driver   │   ✓    │ intel_vpu kernel module loaded              │
│ ram          │   ✓    │ 30.8 GiB total, 24.5 GiB available          │
│ swap         │   ✓    │ 16.0 GiB of swap                            │
│ disk         │   ✓    │ 126 GB free                                 │
└──────────────┴────────┴─────────────────────────────────────────────┘
       carbon doctor — LLM engine — llama.cpp / SYCL
│ cmake    │ ✓ │ cmake present                                              │
│ oneapi   │ ✓ │ oneAPI toolchain present (SYCL build available)            │
│ ipex-llm │ ✓ │ IPEX-LLM portable present (~2x decode on Arc, default)     │
       carbon doctor — image generation — OpenVINO
│ openvino │ ✓ │ OpenVINO GenAI venv present                                │
       carbon doctor — agentic coding — OpenCode + OpenSpec
│ node     │ ✓ │ Node.js present (manage with mise)                         │
│ opencode │ ✓ │ opencode present                                           │
│ openspec │ ✓ │ openspec present                                           │
✓ no host action needed

Run uv run carbon doctor --fix and carbon will, for each failing check, run the exact remediation (adding you to the render group, installing the Level-Zero runtime, bumping swap, …).

What you actually get (honest numbers)

These are measured decode rates (carbon bench → tg128) on the Arc 140T, Q4 weights, not marketing figures. The headline: an MoE gives you ~30B-class quality at small-model speed, while a dense 14B+ on this iGPU is slow.

Model	Kind	Footprint	measured tok/s	Use it for
qwen3-coder-30b (MoE, 3.3B active) ⭐	coding/agentic	17.7 GB	15.6	the coding daily driver
qwen3-30b-a3b (MoE)	general	18.6 GB	12.8	max-quality chat/reasoning
ernie-4.5-21b (MoE)	reasoning	13.5 GB	14.0	thinking
gpt-oss-20b (MoE, MXFP4)	general	12.1 GB	10.8	open reasoning
granite-4-h-tiny (hybrid Mamba-2)	fast/light	4.3 GB	21.6	snappy chat, low power
qwen3-8b	fast/light	5.0 GB	10.9	lean chat, the NPU-class model
qwen3-14b (dense)	general	9.0 GB	5.8	solid dense fallback
deepseek-r1-14b (dense)	deep reasoning	10.5 GB	4.2	competition math (slow)
moondream2 (VLM)	vision	3.7 GB	23.8	fast image captioning
qwen3-vl-8b / qwen25-vl-7b (VLM)	vision/OCR	~6 GB	9.8 / 10.7	documents, OCR

Why the MoE wins: Qwen3-30B-A3B has 30B total parameters but only ~3.3B active per token, so decode (bandwidth-bound) reads ~3.3B — it runs ~2× faster than a dense 14B despite 2× the total size. That's the whole thesis of this project (TD-004).

A note on the engine tradeoff (don't be surprised): the default IPEX-LLM engine gives the fastest decode, but its flash-attention garbles MoE output, so on IPEX carbon drops -fa + quantized KV and caps context to 16K (f16 KV). For long context (up to 64K) with quantized KV, use the from-source SYCL build or the OpenVINO server. See The engines.

The honest ceiling. As a fully autonomous coding agent, no model that fits in 30 GiB reliably converges "fix-until-green" on a hard task (best ≈ 6/10 one-shot; see docs/research/agentic-coding-benchmark-2026.md). Local on this class of laptop is a copilot you supervise, not an autopilot. carbon is built so the same workflow scales, RAM-permitting, to a frontier-scale MoE on a bigger box.

The model catalog

carbon ships 25 curated, ready-to-pull models — LLMs, vision (VLM), image generation, and embeddings — each with measured perf and the right engine pre-selected. uv run carbon models:

Full carbon models output (click to expand)

                                            configured models
┃ name                   ┃ local ┃    size ┃ ctx ┃ params ┃ tk/s iGPU ┃ tags
│ qwen3-14b              │   ✓   │  9.0 GB │ 32K │    14B │       5.8 │ chat, coding, reasoning
│ qwen3-14b-spec         │   ✓   │  9.0 GB │ 32K │    14B │       —   │ chat, coding, reasoning, speculative
│ qwen3-8b               │   ✓   │  5.0 GB │ 32K │     8B │      10.9 │ chat, coding, fast, low-power
│ deepseek-r1-14b        │   ✓   │ 10.5 GB │ 32K │    14B │       4.2 │ reasoning, coding, thinking, slow
│ qwen3-30b-a3b          │   ✓   │ 18.6 GB │ 32K │    30B │      12.8 │ chat, coding, reasoning, moe, long-context
│ gpt-oss-20b            │   ✓   │ 12.1 GB │ 32K │    21B │      10.8 │ chat, reasoning, moe, mxfp4
│ qwen3-coder-30b        │   ✓   │ 17.7 GB │ 32K │    30B │      15.6 │ coding, agentic, moe, long-context
│ qwen3-30b-thinking     │   ✓   │ 17.7 GB │ 32K │    30B │      ~15  │ reasoning, thinking, moe, agentic, coding
│ qwen3-coder-30b-ov     │   –   │ 16.3 GB │ 32K │    30B │       —   │ coding, agentic, moe, openvino
│ deepseek-coder-v2-lite │   ✓   │ 10.4 GB │ 32K │    16B │       —   │ coding, agentic, moe, fast
│ glm-4.7-flash          │   ✓   │ 17.5 GB │ 32K │    30B │       —   │ coding, agentic, moe, tool-use
│ ernie-4.5-21b          │   ✓   │ 13.5 GB │ 32K │    21B │      14.0 │ reasoning, thinking, moe
│ granite-4-h-tiny       │   ✓   │  4.3 GB │ 32K │     7B │      21.6 │ chat, fast, moe, mamba, low-power
│ mistral-small-3.2      │   ✓   │ 14.3 GB │ 32K │    24B │       3.8 │ chat, multilingual, function-calling
│ gemma-3-27b            │   ✓   │ 16.5 GB │ 32K │    27B │       3.1 │ chat, multilingual, vision-capable
│ gemma-3-12b            │   ✓   │  7.3 GB │ 32K │    12B │       6.2 │ chat, multilingual, vision-capable, fast
│ qwen3-vl-8b            │   ✓   │  6.3 GB │ 32K │     8B │       9.8 │ vision, ocr, documents
│ qwen25-vl-7b           │   ✓   │  6.0 GB │ 32K │     7B │      10.7 │ vision, ocr
│ moondream2             │   ✓   │  3.7 GB │  8K │     2B │      23.8 │ vision, captioning, fast
│ sdxl-turbo             │   –   │    —    │  —  │     3B │       —   │ image, fast
│ flux1-schnell          │   –   │  9.2 GB │  —  │    12B │       —   │ image, high-quality, slow
│ qwen3-embed-0.6b       │   –   │  0.6 GB │ 32K │   0.6B │       —   │ embeddings, fast
│ qwen3-embed-4b         │   –   │  2.5 GB │ 32K │     4B │       —   │ embeddings
│ bge-m3                 │   –   │  0.6 GB │  8K │   0.6B │       —   │ embeddings, multilingual
│ nomic-embed            │   –   │  0.1 GB │  8K │   0.1B │       —   │ embeddings, fast

uv run carbon show <model> drills into one (paths, server args, perf, notes).

Command reference

Every command accepts --dry-run (print the host command, change nothing).

Command	What it does
`carbon doctor [--fix]`	Diagnose the host (iGPU, render group, Level-Zero, oneAPI, NPU, RAM, swap, agent tools); `--fix` applies remediations.
`carbon swap [SIZE] [--resize/--new] [--force]`	Create/enable a swapfile (the laptop ships with 512 MB); a resize is OOM-guarded.
`carbon build [-e ENGINE] [--no-cache]`	Build llama.cpp from source with SYCL. No `-e` → builds only source engines; prebuilt (IPEX) is a no-op.
`carbon models`	List configured models: presence, size, context, params, measured tok/s, tags.
`carbon show MODEL`	Full detail for one model.
`carbon pull MODEL [-b]`	Download a model (GGUF + projector + draft) with resume; `-b` detaches.
`carbon run MODEL [-p PROMPT]`	Host-native inference (chat if no `-p`); auto-routes LLM / VLM / embedding / image.
`carbon serve MODEL [--port]`	Start an OpenAI-compatible server (`/v1`) for a model.
`carbon warm MODEL`	Serve + prefill & KV-cache the project system prompt (skips prefill on reuse).
`carbon bench MODEL [-s]`	Benchmark with `llama-bench`; `-s` writes the measured tok/s back to `carbon.toml`.
`carbon imatrix MODEL --calib FILE`	Generate an importance matrix (guides low-bit quant).
`carbon quantize MODEL [--recipe]`	Asymmetric MoE quant (routed experts low-bit, attention/output high).
`carbon stop`	Stop the running server(s).
`carbon agent install / config / run`	Install OpenCode+OpenSpec, point them at a served model, or drive a model headless on a task.
`carbon version`	Print the version.

The engines

carbon is engine-pluggable (TD-011); each model in carbon.toml picks the right one.

Engine	What it is	When `carbon` uses it
`llamacpp-ipex` (default)	Intel's Xe-optimized prebuilt llama.cpp (IPEX-LLM portable) — ~2× decode on the Arc 140T. Self-contained, nothing to build.	The speed default for most models. Caveat: tracks a mid-2025 llama.cpp (newest archs go to SYCL) and its flash-attn garbles MoE → no `-fa`/quant-KV, 16K context. Intel archived IPEX-LLM in Jan 2026, but the binaries still give the best decode, so it stays the default.
`llamacpp-sycl`	llama.cpp built from source with SYCL/oneAPI. Correct flash-attention + quantized KV → up to 64K context.	Newest architectures, long context, speculative decoding, and anything IPEX can't load.
`llm-openvino`	OpenVINO GenAI as an OpenAI server with tool-calling and FA-independent int8/int4 KV.	The maintained long-context agentic backend (TD-022).
`image-openvino`	OpenVINO GenAI image generation (SDXL-Turbo, FLUX.1-schnell).	`carbon run <image-model> -p "..."` → PNG.

Use it as a local coding agent

carbon integrates with OpenCode so a locally-served model can drive your editor/agent — fully offline.

uv run carbon agent install               # OpenCode + OpenSpec (via npm)
uv run carbon serve qwen3-coder-30b       # OpenAI API on :8080
uv run carbon agent config qwen3-coder-30b   # writes opencode.json pointing at the local server
# ... now use OpenCode normally, or drive it headlessly:
uv run carbon agent run qwen3-coder-30b "implement the spec in SPEC.md until tests pass" --dir ./task

The served API is plain OpenAI /v1, so any tool that speaks OpenAI works — point it at http://localhost:8080/v1 with any API key (e.g. sk-local). A sample opencode.json is included. (Tool-calling works on both llama.cpp and the OpenVINO engine; see the agentic benchmark for the realistic capability ceiling.)

How it works

Host-native, no Docker, no Ollama (TD-016). carbon is a thin orchestrator over host processes: it builds llama.cpp, manages llama-server/llama-cli (and the OpenVINO server) directly, and accesses the Arc 140T through /dev/dri. No container indirection between you and the inference flags.
Reproducible by construction. Every command prints the exact shell line it runs; --dry-run shows it without executing. Dependencies are pinned with uv (uv.lock committed).
Declarative. Models and engines live in carbon.toml, read with the stdlib tomllib — the only runtime deps are typer + rich (TD-013, TD-014).
Tuned to the silicon. Q4_K_M weights + Q8 KV + flash-attention for ~32K comfortable context; the quant floor and context levers are measured, not guessed (TD-006, TD-007, TD-017).

The decision record

Every architectural choice is captured as a Technical Decision in docs/decisions/ — 31 of them, with context, rationale, consequences, and the alternatives rejected. The measurements behind them are in docs/research/.

TD	Decision	Status
002	iGPU (Arc 140T) is the primary engine; NPU is secondary	accepted
003	No dense 14B on the NPU 3720 (it's a dead-end)	accepted
004	Daily driver = MoE Qwen3-30B-A3B	accepted
005 · 006 · 007	2507 variant · Q8 KV / 32K · Q4_K_M default	accepted
008	llama.cpp/SYCL over Ollama (matrix cores + flag control)	accepted
011 · 013 · 014	Pluggable engines · TOML config · uv	accepted
016	Host-native execution, remove Docker (the pivot)	accepted
017 · 018	Q4_K is the quant floor; flash-attn is the context lever	accepted / phase-0
022	OpenVINO engine + OpenAI serving for agentic coding	in progress
001 · 009 · 010	Docker / Ollama / custom images	superseded by TD-016
019 · 021	Heterogeneous kernels · port IPEX kernels	blocked (documented)
012 · 020	NPU INT4 export · NPU embeddings for RAG	proposed (roadmap)
023	`carbon warm` — system-prompt KV-root sharing (prefill once, reuse)	accepted
024	"Agent Mind" — repo digest injected into the warmed prefix	proposed (roadmap)
025 · 026 · 027	KV-segment composition · semantic cache · agentic trace cache	proposed (R&D)
028	Frontier MoEs (Kimi K2, Qwen3-235B) out of scope at 30 GiB	accepted
029	On-device weight RL / fine-tuning out of scope	accepted
030 · 031	Optimize the agent OS, not the weights · first-party trace+reward loop	proposed (R&D)

The narrative arc — from "can I run a 14B on the NPU?" to the final architecture, including the dead-ends (IPEX archived, the Vulkan classifier bug, sub-Q4 being compute-bound) — is told in docs/why-these-choices.md.

This is tuned for one machine, on purpose

Portability is an explicit non-goal (TD-016). carbon configures one box, once — that's exactly why it can drop the container/abstraction layers that exist to make software run on arbitrary hosts, and instead expose llama.cpp's flags directly, pick per-model engines by hand, and bake in measured numbers.

That said, the design travels even though the tuning doesn't. To adapt it to your own ThinkPad:

Run uv run carbon doctor — it'll tell you what your silicon is missing.
Edit carbon.toml: set bin_dir/oneapi_setvars for your engine, and add models (each is ~10 lines: engine, gguf/hf_repo, server_args, context_length).
Re-measure with carbon bench <model> -s — the crossover points (which engine/quant/context wins) are bandwidth-dependent and will differ from this machine's. Don't trust these numbers on a box with different memory bandwidth; trust your own carbon bench.

FAQ & gotchas

"NPU not detected." Almost always: you're not in the render group, or a Level-Zero/OpenVINO ABI mismatch. carbon doctor checks both. The NPU is on the roadmap (embeddings/RAG); today the iGPU does the heavy lifting.
Default context is 16K, not 32K? On the default IPEX engine, yes — its flash-attn garbles MoE output, so carbon drops -fa/quant-KV and caps context to 16K (f16 KV). Use llamacpp-sycl or llm-openvino for 32–64K with quantized KV.
Big MoE near the RAM ceiling. ~18 GB of weights + KV + OS sits at the edge of 30 GiB → keep KV in Q8 and run carbon swap 16G so a spike can't OOM.
First SYCL run is slow. Kernels compile on first use; SYCL_CACHE_PERSISTENT=1 (set by the engine) caches them across runs.

License

MIT © Guido Dassori. Built for, and validated on, a Lenovo ThinkPad Carbon X13 Aura (Intel Core Ultra 7 255H · Arc 140T · NPU 3720). Model weights and the inference engines (llama.cpp, IPEX-LLM, OpenVINO) are the property of their respective authors under their own licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
bench		bench
carbon_llm		carbon_llm
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
carbon.toml		carbon.toml
opencode.json		opencode.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

carbon — local LLMs on an Intel ThinkPad

Table of contents

The machine

Quickstart

`carbon doctor` — is my ThinkPad ready?

What you actually get (honest numbers)

The model catalog

Command reference

The engines

Use it as a local coding agent

How it works

The decision record

This is tuned for one machine, on purpose

FAQ & gotchas

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

carbon — local LLMs on an Intel ThinkPad

Table of contents

The machine

Quickstart

carbon doctor — is my ThinkPad ready?

What you actually get (honest numbers)

The model catalog

Command reference

The engines

Use it as a local coding agent

How it works

The decision record

This is tuned for one machine, on purpose

FAQ & gotchas

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`carbon doctor` — is my ThinkPad ready?

Packages