Running Qwen3.6-35B-A3B (Unsloth UD-Q4_K_M, 21 GB MoE) on a single 16 GB
consumer GPU with 128K context using llama.cpp.
This repo is a thin layer over llama.cpp:
run-server.sh— launch the OpenAI-compatible HTTP server with the tuned flagsrun-bench.sh— reproduce the prefill/decode throughput numbersopencode.json— OpenCode provider config pointing at the local server
The model and llama.cpp itself live next to these files at runtime but
aren't checked in (see Setup).
Requirements: NVIDIA GPU with CUDA 12+ (tested on CUDA 13.2 / Blackwell
sm_120), ~24 GB free disk, ~24 GB free RAM.
# 1. Clone llama.cpp alongside these scripts and build with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
cmake --build build -j
cd ..
# 2. Download the Qwen3.6-35B-A3B UD-Q4_K_M GGUF (~21 GB)
mkdir -p models
wget -c -O models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_M.ggufReplace native with your compute capability (e.g. 120 for Blackwell,
89 for Ada, 86 for Ampere) if native doesn't work.
./run-server.sh # desktop-friendly default: leaves VRAM+threads for Plasma/IDE/browser
PORT=9000 ./run-server.sh # override port
HOST=0.0.0.0 ./run-server.sh # expose on LAN (no auth — trusted networks only)
FIT_TARGET=512 THREADS=12 ./run-server.sh # dedicated/max throughput (~10% more tok/s, no concurrent apps)
FIT_TARGET=256 ./run-server.sh # squeeze +1 MoE layer onto GPU (OOM-prone)
MODEL=/abs/path.gguf ./run-server.sh # if multiple .gguf in models/
./run-bench.sh # pp512/pp2048/pp4096 + tg128, 3 reps eachHit http://localhost:8033 for the built-in chat UI or point any OpenAI
client at http://localhost:8033/v1. The server binds to 127.0.0.1 by
default — the OpenAI-compatible endpoint has no authentication, so exposing
it with HOST=0.0.0.0 is only safe on a network you control.
Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35 B total parameters, but
only ~3 B activate per token. The naive --cpu-moe flag pushes every expert
to CPU, leaving the 16 GB GPU almost idle (only ~1.9 GB of attention weights
stay on it). --n-cpu-moe N is the right knob: keep the first N layers'
experts on CPU, put the rest on GPU. For a ~22 GB model on a 16 GB GPU the
sweet spot is around N=19–20.
┌──────────────────────────────┐
│ Qwen3.6-35B-A3B, 40 layers │
└──────────────────────────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
Attention weights MoE expert weights
(always on GPU) (split per layer)
~1.9 GB VRAM ~0.53 GB VRAM per layer
│
┌──────────────────────────┴──────────┐
▼ ▼
Layers 0 .. N-1 → CPU Layers N .. 39 → GPU
(RAM, runs on 5600X cores) (VRAM, runs on RTX 5070 Ti)
── per token, the router picks ~8 of 128 experts in each layer ──
┌────────────────────────────────────────────────────────┐
│ Ryzen 5 5600X ◀── PCIe ──▶ RTX 5070 Ti (16 GB) │
│ 32 GB DDR4 Blackwell, CUDA 13.2 │
└────────────────────────────────────────────────────────┘
--fit on --fit-ctx 128000 --fit-target 1536 auto-probes VRAM at startup and
picks N for you. The 1536 MiB headroom is sized for concurrent desktop use
(Plasma compositor + Chrome/Firefox HW accel + IDE GPU surfaces). Measured on
this host with an active KDE/Wayland 4K desktop it converges to N≈26 (desktop
already using ~1.8 GB of the 16 GB, so ~7 GB of MoE surplus fits on GPU).
Pass FIT_TARGET=512 to recover a chunk of throughput when you're not using
the machine for anything else. Skip manual --n-cpu-moe tuning unless you
want to pin a specific value.
| Flag | Purpose |
|---|---|
--fit on |
Auto-tune expert placement to fill VRAM |
--fit-ctx 128000 |
Guarantee --fit reserves room for 128K context (bare --fit silently falls back to 4K) |
--fit-target 1536 |
VRAM headroom in MiB. 1536 is the desktop-friendly default — leaves room for Plasma/Chrome/IDE on your GPU. FIT_TARGET=512 ./run-server.sh recovers ~2 MoE layers (~10% more tok/s) for dedicated inference. FIT_TARGET=256 pushes one layer further but is OOM-prone |
-np 1 |
Single user slot. Default is 4, wastes ~190 MB on recurrent state for Qwen's Gated Delta Net layers |
-fa on |
FlashAttention |
--no-mmap --mlock |
Load once into locked RAM pages; don't let the kernel page them out |
-b 2048 -ub 2048 |
Larger logical and physical batch for prefill — 59 % faster on 2K prompts vs. the 512 default |
-t 10 -tb 10 |
Reserve 1 physical core (2 SMT threads) on the 5600X for the desktop. Near-zero throughput hit, much better Plasma/IDE responsiveness during decode. Override with THREADS=12 for dedicated use |
-ctk q8_0 -ctv q8_0 |
Quantise KV cache to 8-bit. Near-lossless, halves KV memory. 128K ctx only costs ~1.4 GB |
--reasoning-budget -1 |
No hard cap on reasoning tokens |
--chat-template-kwargs '{"preserve_thinking": true}' |
Carry thinking traces across all historical turns — better tool-use consistency and KV reuse in agent loops |
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 |
Unsloth's "Precise Coding" preset. Swap to --temp 1.0 --presence-penalty 1.5 for general chat |
Only 10 of the 40 layers use standard attention; the rest are Gated Delta Net
(a recurrent variant). That's why the KV cache is tiny even at 128K and why
-np 1 matters — each extra slot duplicates recurrent state, not just KV.
Each MoE layer on GPU costs ~530 MB VRAM. Attention weights are ~1.9 GB fixed.
| VRAM | Strategy |
|---|---|
| 8 GB | --cpu-moe (all experts on CPU) |
| 12 GB | FIT_TARGET=512 ./run-server.sh → ~N=26 |
| 16 GB, concurrent desktop use | ./run-server.sh (default) → ~N=24–26 (depends on how much VRAM your compositor/browser is already using) |
| 16 GB, dedicated / headless GPU | FIT_TARGET=512 THREADS=12 ./run-server.sh → ~N=19 |
| 24 GB | ./run-server.sh → ~N=10 |
If you want to pin a value, override with -ncmoe N instead of --fit.
The CPU matters more than you'd think. During prefill and decode, tokens flow through every layer — so MoE experts placed on CPU affect both. A CPU without 3D V-Cache sees meaningfully lower prefill, not just lower decode.
Measured on Ryzen 5 5600X + 5070 Ti (this host) vs. upstream reference
(9800X3D + 5070 Ti). Two columns for this host: the default
desktop-friendly preset (N_CPU_MOE=26, THREADS=10, ~1.5 GB VRAM + 1
physical core reserved for Plasma/IDE/browser) and the dedicated preset
(FIT_TARGET=512 THREADS=12 ./run-server.sh → N_CPU_MOE=19, all threads).
| Test | default / concurrent | dedicated / headless | 9800X3D reference (N=19) |
|---|---|---|---|
| pp512 | 630 | 808 | ~2770 |
| pp2048 | 1707 | 2091 | ~4450 |
| pp4096 | 1703 | 2097 | ~4420 |
| tg128 | 59.9 | 76.5 | ~98 |
The concurrent preset costs ~22% decode and ~19–25% prefill vs. dedicated on
this host — that's the real price of leaving VRAM and cores for the desktop.
The 5600X costs most at pp512 (small-batch kernels are more sensitive to
CPU-side latency for the offloaded MoE path) and least at tg128 relative
to the 9800X3D reference. 3D V-Cache on the reference CPU helps decode more
than you'd expect from the clock-speed delta alone.
Run ./run-bench.sh to measure on your host. If it OOMs — usually
because another process is eating VRAM — either free GPU memory or bump
N_CPU_MOE / drop UBATCH (see Troubleshooting).
mlock failed: cannot allocate memory — the kernel's locked-memory
limit is too low. Check with ulimit -l; if it's not unlimited add to
/etc/security/limits.conf:
<your-user> hard memlock unlimited
<your-user> soft memlock unlimited
Log out and back in. Or launch the server without locking by editing
run-server.sh to drop --mlock --no-mmap (costs a bit of first-token
latency but won't fail).
Note: the CPU-side weights (CUDA_Host model buffer) are page-locked by
CUDA itself as part of host-pinned allocation, so the warning is largely
cosmetic for this config — the pages that matter are already pinned.
Fixing the ulimit still removes the warning and makes the lock state
unambiguous.
OOM at model load or during prefill — --fit-target 512 is the safe
default but a desktop compositor or other GPU consumer can still push it
over. Bump headroom: FIT_TARGET=1024 ./run-server.sh. To squeeze one more
MoE layer onto the GPU at the cost of fragility: FIT_TARGET=256.
llama-bench OOMs on pp2048 with CUDA pool error — bench has no --fit
so it can't reserve room for the compute buffer dynamically. Increase
N_CPU_MOE for bench only (default is 22; try 24 if 22 still OOMs) or lower
UBATCH: UBATCH=1024 ./run-bench.sh.
bind: address already in use — something else is on port 8033. Either
kill it (ss -tlnp | grep 8033) or PORT=9000 ./run-server.sh. Remember to
update opencode.json or ANTHROPIC_BASE_URL to match.
OpenCode / Claude Code connects but no model listed — the model ID your
config declares must match what the server advertises. Check with
curl http://localhost:8033/v1/models and edit the models key in
opencode.json accordingly.
Different quant / different model file — run-server.sh auto-picks the
single .gguf in models/. With multiple files, pass MODEL=/abs/path.gguf
as an env var.
This build exposes /v1/messages natively, so no proxy is needed:
export ANTHROPIC_BASE_URL=http://localhost:8033
export ANTHROPIC_API_KEY=dummy # required to be set, not checked locally
claudeUse /model inside the session to pick the local model. Expect degraded
agentic quality vs. real Claude — Qwen3.6-35B is capable but gets tripped up
on multi-step tool use more often, and prompt caching (which Claude Code
leans on) doesn't apply to a local backend.
OpenCode talks OpenAI-compatible natively and is tuned around local models, so it's usually a better fit for Qwen.
Install (Arch / CachyOS, via AUR — needs your sudo password):
paru -S opencode-binThe package in the official extra repo is pinned to an old 1.4.7; the AUR
opencode-bin tracks upstream (currently 1.14.x) — same project
(anomalyco/opencode), just current.
Config — this repo already ships opencode.json pointing at
http://127.0.0.1:8033/v1. Either run OpenCode from this directory so it
picks up the local config, or install it globally:
# per-project: run it from this repo
./run-server.sh & # keep the server running
opencode # picks up ./opencode.json
# or install globally
mkdir -p ~/.config/opencode
cp opencode.json ~/.config/opencode/opencode.jsonOn first launch, select the llama-local provider. The model ID in
opencode.json is the full GGUF filename (Qwen3.6-35B-A3B-UD-Q4_K_M.gguf)
because that's exactly what /v1/models returns. If you use a different
quant, update the key under models to match
curl http://localhost:8033/v1/models.
- Upstream llama.cpp: https://github.com/ggml-org/llama.cpp
- Model card: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
- r/LocalLLaMA tuning thread this config is derived from (Qwen3.6 +
--n-cpu-moeon 16 GB GPUs)
MIT. See LICENSE. The scripts here are original to this repo;
llama.cpp and the Qwen3.6 weights have their own upstream licenses (MIT
and Apache-2.0 respectively).