Run a 30B-class language model, vision models, image generation, and a local coding agent on a Lenovo ThinkPad — on the Intel Arc iGPU, no cloud, no Docker, no Ollama.
carbon is a single, reproducible CLI that turns a 2025–26 Intel Core Ultra ThinkPad into a
serious local-LLM workstation. It builds the right inference engine, downloads the right models,
diagnoses and fixes the host, and serves an OpenAI-compatible API you can point your editor or
a coding agent at — and every command prints the exact host command it runs, so nothing is a
black box.
Who this is for. ThinkPad owners with an Intel Core Ultra chip (Arrow Lake-H / Lunar Lake) and an Arc iGPU. This repo is deliberately tuned for one machine (see portability is a non-goal) — but it's a worked, measured blueprint for what's actually achievable, and the design ports cleanly to any similar box.
- The machine
- Quickstart
carbon doctor— is my ThinkPad ready?- What you actually get (honest numbers)
- The model catalog
- Command reference
- The engines
- Use it as a local coding agent
- How it works
- The decision record (31 TDs)
- This is tuned for one machine, on purpose
- FAQ & gotchas
- License
carbon targets one laptop — the Lenovo ThinkPad Carbon X13 Aura — and is validated on it:
| Component | Value | Why it matters |
|---|---|---|
| Laptop | Lenovo ThinkPad Carbon X13 Aura | the target machine (repo codename carbon-x13-aura) |
| CPU | Intel Core Ultra 7 255H (Arrow Lake-H), 16 cores | the host |
| iGPU | Intel Arc 140T (Xe2, 8 Xe-cores) | the real LLM engine — most bandwidth + matrix cores |
| NPU | NPU 3720 (~11–13 TOPS, Meteor-Lake-gen silicon) | low-power ≤8B / prefill / embeddings (roadmap) |
| RAM | 30 GiB unified LPDDR5x (~100 GB/s) | decode is bandwidth-bound → this is the ceiling |
| OS | Ubuntu 25.04, kernel 6.14 |
The two facts that shape every decision in this repo: decode speed is limited by memory bandwidth, not TOPS (so the iGPU beats the NPU for big models), and you have lots of RAM but slow-ish bandwidth (so a Mixture-of-Experts model is the sweet spot). The full reasoning is in docs/why-these-choices.md.
# 0. clone, then install the CLI (uv creates the venv + installs typer/rich, ~2 s)
pip install uv && uv sync
# 1. is the host ready? (read-only checks; --fix applies the host setup with sudo)
uv run carbon doctor
uv run carbon doctor --fix # render group, oneAPI, Level-Zero, swap …
# 2. give big models a safety margin (this laptop ships with only 512 MB of swap)
uv run carbon swap 16G
# 3. build the from-source SYCL engine (only needed for newest archs / long context;
# the default IPEX engine ships prebuilt — see "The engines")
uv run carbon build
# 4. download a model, then run it
uv run carbon pull qwen3-coder-30b
uv run carbon run qwen3-coder-30b -p "Write a haiku about ThinkPads"
# 5. or serve an OpenAI-compatible API on http://localhost:8080
uv run carbon serve qwen3-coder-30b
# -> point your editor / agent at http://localhost:8080/v1 (see "local coding agent")
uv run carbon stopEverything is declared in carbon.toml (models + engines) — add your own there and
the CLI picks them up. Want to see exactly what a command will do before it touches your machine?
Append --dry-run to any command:
$ uv run carbon run qwen3-coder-30b --dry-run -p "Write a haiku about ThinkPads"
$ bash -c 'export ZES_ENABLE_SYSMAN=1; export SYCL_CACHE_PERSISTENT=1; \
exec ~/.cache/carbon-llm/ipex-cpp/.../llama-cli \
-m ~/.cache/carbon-llm/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
-c 16384 -ngl 99 -p 'Write a haiku about ThinkPads' -no-cnv'
carbon doctor checks the whole stack and tells you the one-line fix for anything red. Real output
on the target machine (all green):
carbon doctor — host (common)
┏━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ check ┃ status ┃ detail ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ igpu │ ✓ │ /dev/dri/renderD128 present (Intel Arc) │
│ render-group │ ✓ │ user in the render group (can use /dev/dri) │
│ gpu-runtime │ ✓ │ Level-Zero loader present (libze_loader) │
│ git-curl │ ✓ │ git + curl present │
│ npu-device │ ✓ │ /dev/accel/accel0 present (NPU, roadmap) │
│ npu-driver │ ✓ │ intel_vpu kernel module loaded │
│ ram │ ✓ │ 30.8 GiB total, 24.5 GiB available │
│ swap │ ✓ │ 16.0 GiB of swap │
│ disk │ ✓ │ 126 GB free │
└──────────────┴────────┴─────────────────────────────────────────────┘
carbon doctor — LLM engine — llama.cpp / SYCL
│ cmake │ ✓ │ cmake present │
│ oneapi │ ✓ │ oneAPI toolchain present (SYCL build available) │
│ ipex-llm │ ✓ │ IPEX-LLM portable present (~2x decode on Arc, default) │
carbon doctor — image generation — OpenVINO
│ openvino │ ✓ │ OpenVINO GenAI venv present │
carbon doctor — agentic coding — OpenCode + OpenSpec
│ node │ ✓ │ Node.js present (manage with mise) │
│ opencode │ ✓ │ opencode present │
│ openspec │ ✓ │ openspec present │
✓ no host action needed
Run uv run carbon doctor --fix and carbon will, for each failing check, run the exact remediation
(adding you to the render group, installing the Level-Zero runtime, bumping swap, …).
These are measured decode rates (carbon bench → tg128) on the Arc 140T, Q4 weights, not
marketing figures. The headline: an MoE gives you ~30B-class quality at small-model speed, while a
dense 14B+ on this iGPU is slow.
| Model | Kind | Footprint | measured tok/s | Use it for |
|---|---|---|---|---|
| qwen3-coder-30b (MoE, 3.3B active) ⭐ | coding/agentic | 17.7 GB | 15.6 | the coding daily driver |
| qwen3-30b-a3b (MoE) | general | 18.6 GB | 12.8 | max-quality chat/reasoning |
| ernie-4.5-21b (MoE) | reasoning | 13.5 GB | 14.0 | thinking |
| gpt-oss-20b (MoE, MXFP4) | general | 12.1 GB | 10.8 | open reasoning |
| granite-4-h-tiny (hybrid Mamba-2) | fast/light | 4.3 GB | 21.6 | snappy chat, low power |
| qwen3-8b | fast/light | 5.0 GB | 10.9 | lean chat, the NPU-class model |
| qwen3-14b (dense) | general | 9.0 GB | 5.8 | solid dense fallback |
| deepseek-r1-14b (dense) | deep reasoning | 10.5 GB | 4.2 | competition math (slow) |
| moondream2 (VLM) | vision | 3.7 GB | 23.8 | fast image captioning |
| qwen3-vl-8b / qwen25-vl-7b (VLM) | vision/OCR | ~6 GB | 9.8 / 10.7 | documents, OCR |
Why the MoE wins: Qwen3-30B-A3B has 30B total parameters but only ~3.3B active per token, so decode (bandwidth-bound) reads ~3.3B — it runs ~2× faster than a dense 14B despite 2× the total size. That's the whole thesis of this project (TD-004).
A note on the engine tradeoff (don't be surprised): the default IPEX-LLM engine gives the
fastest decode, but its flash-attention garbles MoE output, so on IPEX carbon drops -fa +
quantized KV and caps context to 16K (f16 KV). For long context (up to 64K) with quantized KV,
use the from-source SYCL build or the OpenVINO server. See The engines.
The honest ceiling. As a fully autonomous coding agent, no model that fits in 30 GiB reliably converges "fix-until-green" on a hard task (best ≈ 6/10 one-shot; see docs/research/agentic-coding-benchmark-2026.md). Local on this class of laptop is a copilot you supervise, not an autopilot.
carbonis built so the same workflow scales, RAM-permitting, to a frontier-scale MoE on a bigger box.
carbon ships 25 curated, ready-to-pull models — LLMs, vision (VLM), image generation, and
embeddings — each with measured perf and the right engine pre-selected. uv run carbon models:
Full carbon models output (click to expand)
configured models
┃ name ┃ local ┃ size ┃ ctx ┃ params ┃ tk/s iGPU ┃ tags
│ qwen3-14b │ ✓ │ 9.0 GB │ 32K │ 14B │ 5.8 │ chat, coding, reasoning
│ qwen3-14b-spec │ ✓ │ 9.0 GB │ 32K │ 14B │ — │ chat, coding, reasoning, speculative
│ qwen3-8b │ ✓ │ 5.0 GB │ 32K │ 8B │ 10.9 │ chat, coding, fast, low-power
│ deepseek-r1-14b │ ✓ │ 10.5 GB │ 32K │ 14B │ 4.2 │ reasoning, coding, thinking, slow
│ qwen3-30b-a3b │ ✓ │ 18.6 GB │ 32K │ 30B │ 12.8 │ chat, coding, reasoning, moe, long-context
│ gpt-oss-20b │ ✓ │ 12.1 GB │ 32K │ 21B │ 10.8 │ chat, reasoning, moe, mxfp4
│ qwen3-coder-30b │ ✓ │ 17.7 GB │ 32K │ 30B │ 15.6 │ coding, agentic, moe, long-context
│ qwen3-30b-thinking │ ✓ │ 17.7 GB │ 32K │ 30B │ ~15 │ reasoning, thinking, moe, agentic, coding
│ qwen3-coder-30b-ov │ – │ 16.3 GB │ 32K │ 30B │ — │ coding, agentic, moe, openvino
│ deepseek-coder-v2-lite │ ✓ │ 10.4 GB │ 32K │ 16B │ — │ coding, agentic, moe, fast
│ glm-4.7-flash │ ✓ │ 17.5 GB │ 32K │ 30B │ — │ coding, agentic, moe, tool-use
│ ernie-4.5-21b │ ✓ │ 13.5 GB │ 32K │ 21B │ 14.0 │ reasoning, thinking, moe
│ granite-4-h-tiny │ ✓ │ 4.3 GB │ 32K │ 7B │ 21.6 │ chat, fast, moe, mamba, low-power
│ mistral-small-3.2 │ ✓ │ 14.3 GB │ 32K │ 24B │ 3.8 │ chat, multilingual, function-calling
│ gemma-3-27b │ ✓ │ 16.5 GB │ 32K │ 27B │ 3.1 │ chat, multilingual, vision-capable
│ gemma-3-12b │ ✓ │ 7.3 GB │ 32K │ 12B │ 6.2 │ chat, multilingual, vision-capable, fast
│ qwen3-vl-8b │ ✓ │ 6.3 GB │ 32K │ 8B │ 9.8 │ vision, ocr, documents
│ qwen25-vl-7b │ ✓ │ 6.0 GB │ 32K │ 7B │ 10.7 │ vision, ocr
│ moondream2 │ ✓ │ 3.7 GB │ 8K │ 2B │ 23.8 │ vision, captioning, fast
│ sdxl-turbo │ – │ — │ — │ 3B │ — │ image, fast
│ flux1-schnell │ – │ 9.2 GB │ — │ 12B │ — │ image, high-quality, slow
│ qwen3-embed-0.6b │ – │ 0.6 GB │ 32K │ 0.6B │ — │ embeddings, fast
│ qwen3-embed-4b │ – │ 2.5 GB │ 32K │ 4B │ — │ embeddings
│ bge-m3 │ – │ 0.6 GB │ 8K │ 0.6B │ — │ embeddings, multilingual
│ nomic-embed │ – │ 0.1 GB │ 8K │ 0.1B │ — │ embeddings, fast
uv run carbon show <model> drills into one (paths, server args, perf, notes).
Every command accepts --dry-run (print the host command, change nothing).
| Command | What it does |
|---|---|
carbon doctor [--fix] |
Diagnose the host (iGPU, render group, Level-Zero, oneAPI, NPU, RAM, swap, agent tools); --fix applies remediations. |
carbon swap [SIZE] [--resize/--new] [--force] |
Create/enable a swapfile (the laptop ships with 512 MB); a resize is OOM-guarded. |
carbon build [-e ENGINE] [--no-cache] |
Build llama.cpp from source with SYCL. No -e → builds only source engines; prebuilt (IPEX) is a no-op. |
carbon models |
List configured models: presence, size, context, params, measured tok/s, tags. |
carbon show MODEL |
Full detail for one model. |
carbon pull MODEL [-b] |
Download a model (GGUF + projector + draft) with resume; -b detaches. |
carbon run MODEL [-p PROMPT] |
Host-native inference (chat if no -p); auto-routes LLM / VLM / embedding / image. |
carbon serve MODEL [--port] |
Start an OpenAI-compatible server (/v1) for a model. |
carbon warm MODEL |
Serve + prefill & KV-cache the project system prompt (skips prefill on reuse). |
carbon bench MODEL [-s] |
Benchmark with llama-bench; -s writes the measured tok/s back to carbon.toml. |
carbon imatrix MODEL --calib FILE |
Generate an importance matrix (guides low-bit quant). |
carbon quantize MODEL [--recipe] |
Asymmetric MoE quant (routed experts low-bit, attention/output high). |
carbon stop |
Stop the running server(s). |
carbon agent install / config / run |
Install OpenCode+OpenSpec, point them at a served model, or drive a model headless on a task. |
carbon version |
Print the version. |
carbon is engine-pluggable (TD-011); each
model in carbon.toml picks the right one.
| Engine | What it is | When carbon uses it |
|---|---|---|
llamacpp-ipex (default) |
Intel's Xe-optimized prebuilt llama.cpp (IPEX-LLM portable) — ~2× decode on the Arc 140T. Self-contained, nothing to build. | The speed default for most models. Caveat: tracks a mid-2025 llama.cpp (newest archs go to SYCL) and its flash-attn garbles MoE → no -fa/quant-KV, 16K context. Intel archived IPEX-LLM in Jan 2026, but the binaries still give the best decode, so it stays the default. |
llamacpp-sycl |
llama.cpp built from source with SYCL/oneAPI. Correct flash-attention + quantized KV → up to 64K context. | Newest architectures, long context, speculative decoding, and anything IPEX can't load. |
llm-openvino |
OpenVINO GenAI as an OpenAI server with tool-calling and FA-independent int8/int4 KV. | The maintained long-context agentic backend (TD-022). |
image-openvino |
OpenVINO GenAI image generation (SDXL-Turbo, FLUX.1-schnell). | carbon run <image-model> -p "..." → PNG. |
carbon integrates with OpenCode so a locally-served model can drive your
editor/agent — fully offline.
uv run carbon agent install # OpenCode + OpenSpec (via npm)
uv run carbon serve qwen3-coder-30b # OpenAI API on :8080
uv run carbon agent config qwen3-coder-30b # writes opencode.json pointing at the local server
# ... now use OpenCode normally, or drive it headlessly:
uv run carbon agent run qwen3-coder-30b "implement the spec in SPEC.md until tests pass" --dir ./taskThe served API is plain OpenAI /v1, so any tool that speaks OpenAI works — point it at
http://localhost:8080/v1 with any API key (e.g. sk-local). A sample opencode.json
is included. (Tool-calling works on both llama.cpp and the OpenVINO engine; see the agentic benchmark
for the realistic capability ceiling.)
- Host-native, no Docker, no Ollama (TD-016).
carbonis a thin orchestrator over host processes: it builds llama.cpp, managesllama-server/llama-cli(and the OpenVINO server) directly, and accesses the Arc 140T through/dev/dri. No container indirection between you and the inference flags. - Reproducible by construction. Every command prints the exact shell line it runs;
--dry-runshows it without executing. Dependencies are pinned with uv (uv.lockcommitted). - Declarative. Models and engines live in
carbon.toml, read with the stdlibtomllib— the only runtime deps aretyper+rich(TD-013, TD-014). - Tuned to the silicon. Q4_K_M weights + Q8 KV + flash-attention for ~32K comfortable context; the quant floor and context levers are measured, not guessed (TD-006, TD-007, TD-017).
Every architectural choice is captured as a Technical Decision in
docs/decisions/ — 31 of them, with context, rationale, consequences, and the
alternatives rejected. The measurements behind them are in docs/research/.
| TD | Decision | Status |
|---|---|---|
| 002 | iGPU (Arc 140T) is the primary engine; NPU is secondary | accepted |
| 003 | No dense 14B on the NPU 3720 (it's a dead-end) | accepted |
| 004 | Daily driver = MoE Qwen3-30B-A3B | accepted |
| 005 · 006 · 007 | 2507 variant · Q8 KV / 32K · Q4_K_M default | accepted |
| 008 | llama.cpp/SYCL over Ollama (matrix cores + flag control) | accepted |
| 011 · 013 · 014 | Pluggable engines · TOML config · uv | accepted |
| 016 | Host-native execution, remove Docker (the pivot) | accepted |
| 017 · 018 | Q4_K is the quant floor; flash-attn is the context lever | accepted / phase-0 |
| 022 | OpenVINO engine + OpenAI serving for agentic coding | in progress |
| 001 · 009 · 010 | Docker / Ollama / custom images | superseded by TD-016 |
| 019 · 021 | Heterogeneous kernels · port IPEX kernels | blocked (documented) |
| 012 · 020 | NPU INT4 export · NPU embeddings for RAG | proposed (roadmap) |
| 023 | carbon warm — system-prompt KV-root sharing (prefill once, reuse) |
accepted |
| 024 | "Agent Mind" — repo digest injected into the warmed prefix | proposed (roadmap) |
| 025 · 026 · 027 | KV-segment composition · semantic cache · agentic trace cache | proposed (R&D) |
| 028 | Frontier MoEs (Kimi K2, Qwen3-235B) out of scope at 30 GiB | accepted |
| 029 | On-device weight RL / fine-tuning out of scope | accepted |
| 030 · 031 | Optimize the agent OS, not the weights · first-party trace+reward loop | proposed (R&D) |
The narrative arc — from "can I run a 14B on the NPU?" to the final architecture, including the dead-ends (IPEX archived, the Vulkan classifier bug, sub-Q4 being compute-bound) — is told in docs/why-these-choices.md.
Portability is an explicit non-goal (TD-016).
carbon configures one box, once — that's exactly why it can drop the container/abstraction layers
that exist to make software run on arbitrary hosts, and instead expose llama.cpp's flags directly,
pick per-model engines by hand, and bake in measured numbers.
That said, the design travels even though the tuning doesn't. To adapt it to your own ThinkPad:
- Run
uv run carbon doctor— it'll tell you what your silicon is missing. - Edit
carbon.toml: setbin_dir/oneapi_setvarsfor your engine, and add models (each is ~10 lines:engine,gguf/hf_repo,server_args,context_length). - Re-measure with
carbon bench <model> -s— the crossover points (which engine/quant/context wins) are bandwidth-dependent and will differ from this machine's. Don't trust these numbers on a box with different memory bandwidth; trust your owncarbon bench.
- "NPU not detected." Almost always: you're not in the
rendergroup, or a Level-Zero/OpenVINO ABI mismatch.carbon doctorchecks both. The NPU is on the roadmap (embeddings/RAG); today the iGPU does the heavy lifting. - Default context is 16K, not 32K? On the default IPEX engine, yes — its flash-attn garbles
MoE output, so
carbondrops-fa/quant-KV and caps context to 16K (f16 KV). Usellamacpp-syclorllm-openvinofor 32–64K with quantized KV. - Big MoE near the RAM ceiling. ~18 GB of weights + KV + OS sits at the edge of 30 GiB → keep KV
in Q8 and run
carbon swap 16Gso a spike can't OOM. - First SYCL run is slow. Kernels compile on first use;
SYCL_CACHE_PERSISTENT=1(set by the engine) caches them across runs.
MIT © Guido Dassori. Built for, and validated on, a Lenovo ThinkPad Carbon X13 Aura (Intel Core Ultra 7 255H · Arc 140T · NPU 3720). Model weights and the inference engines (llama.cpp, IPEX-LLM, OpenVINO) are the property of their respective authors under their own licenses.