localQwen

Running Qwen3.6-35B-A3B (Unsloth UD-Q4_K_M, 21 GB MoE) on a single 16 GB consumer GPU with 128K context using llama.cpp.

This repo is a thin layer over llama.cpp:

run-server.sh — launch the OpenAI-compatible HTTP server with the tuned flags
run-bench.sh — reproduce the prefill/decode throughput numbers
opencode.json — OpenCode provider config pointing at the local server

The model and llama.cpp itself live next to these files at runtime but aren't checked in (see Setup).

Setup

Requirements: NVIDIA GPU with CUDA 12+ (tested on CUDA 13.2 / Blackwell sm_120), ~24 GB free disk, ~24 GB free RAM.

# 1. Clone llama.cpp alongside these scripts and build with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
cmake --build build -j
cd ..

# 2. Download the Qwen3.6-35B-A3B UD-Q4_K_M GGUF (~21 GB)
mkdir -p models
wget -c -O models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
    https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

Replace native with your compute capability (e.g. 120 for Blackwell, 89 for Ada, 86 for Ampere) if native doesn't work.

Quick start

./run-server.sh            # desktop-friendly default: leaves VRAM+threads for Plasma/IDE/browser
PORT=9000 ./run-server.sh  # override port
HOST=0.0.0.0 ./run-server.sh                # expose on LAN (no auth — trusted networks only)
FIT_TARGET=512 THREADS=12 ./run-server.sh   # dedicated/max throughput (~10% more tok/s, no concurrent apps)
FIT_TARGET=256 ./run-server.sh              # squeeze +1 MoE layer onto GPU (OOM-prone)
MODEL=/abs/path.gguf ./run-server.sh        # if multiple .gguf in models/
./run-bench.sh             # pp512/pp2048/pp4096 + tg128, 3 reps each

Hit http://localhost:8033 for the built-in chat UI or point any OpenAI client at http://localhost:8033/v1. The server binds to 127.0.0.1 by default — the OpenAI-compatible endpoint has no authentication, so exposing it with HOST=0.0.0.0 is only safe on a network you control.

Why this config

Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35 B total parameters, but only ~3 B activate per token. The naive --cpu-moe flag pushes every expert to CPU, leaving the 16 GB GPU almost idle (only ~1.9 GB of attention weights stay on it). --n-cpu-moe N is the right knob: keep the first N layers' experts on CPU, put the rest on GPU. For a ~22 GB model on a 16 GB GPU the sweet spot is around N=19–20.

How the MoE split works

                     ┌──────────────────────────────┐
                     │  Qwen3.6-35B-A3B, 40 layers  │
                     └──────────────────────────────┘
                                    │
                  ┌─────────────────┴─────────────────┐
                  ▼                                   ▼
          Attention weights                   MoE expert weights
          (always on GPU)                     (split per layer)
          ~1.9 GB VRAM                        ~0.53 GB VRAM per layer
                                                     │
                          ┌──────────────────────────┴──────────┐
                          ▼                                     ▼
             Layers 0 .. N-1 → CPU                Layers N .. 39 → GPU
             (RAM, runs on 5600X cores)           (VRAM, runs on RTX 5070 Ti)

               ── per token, the router picks ~8 of 128 experts in each layer ──

             ┌────────────────────────────────────────────────────────┐
             │  Ryzen 5 5600X  ◀── PCIe ──▶  RTX 5070 Ti (16 GB)      │
             │  32 GB DDR4                   Blackwell, CUDA 13.2     │
             └────────────────────────────────────────────────────────┘

--fit on --fit-ctx 128000 --fit-target 1536 auto-probes VRAM at startup and picks N for you. The 1536 MiB headroom is sized for concurrent desktop use (Plasma compositor + Chrome/Firefox HW accel + IDE GPU surfaces). Measured on this host with an active KDE/Wayland 4K desktop it converges to N≈26 (desktop already using ~1.8 GB of the 16 GB, so ~7 GB of MoE surplus fits on GPU). Pass FIT_TARGET=512 to recover a chunk of throughput when you're not using the machine for anything else. Skip manual --n-cpu-moe tuning unless you want to pin a specific value.

What each flag does

Flag	Purpose
`--fit on`	Auto-tune expert placement to fill VRAM
`--fit-ctx 128000`	Guarantee `--fit` reserves room for 128K context (bare `--fit` silently falls back to 4K)
`--fit-target 1536`	VRAM headroom in MiB. 1536 is the desktop-friendly default — leaves room for Plasma/Chrome/IDE on your GPU. `FIT_TARGET=512 ./run-server.sh` recovers ~2 MoE layers (~10% more tok/s) for dedicated inference. `FIT_TARGET=256` pushes one layer further but is OOM-prone
`-np 1`	Single user slot. Default is 4, wastes ~190 MB on recurrent state for Qwen's Gated Delta Net layers
`-fa on`	FlashAttention
`--no-mmap` `--mlock`	Load once into locked RAM pages; don't let the kernel page them out
`-b 2048` `-ub 2048`	Larger logical and physical batch for prefill — 59 % faster on 2K prompts vs. the 512 default
`-t 10 -tb 10`	Reserve 1 physical core (2 SMT threads) on the 5600X for the desktop. Near-zero throughput hit, much better Plasma/IDE responsiveness during decode. Override with `THREADS=12` for dedicated use
`-ctk q8_0` `-ctv q8_0`	Quantise KV cache to 8-bit. Near-lossless, halves KV memory. 128K ctx only costs ~1.4 GB
`--reasoning-budget -1`	No hard cap on reasoning tokens
`--chat-template-kwargs '{"preserve_thinking": true}'`	Carry thinking traces across all historical turns — better tool-use consistency and KV reuse in agent loops
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0`	Unsloth's "Precise Coding" preset. Swap to `--temp 1.0 --presence-penalty 1.5` for general chat

Qwen3.6 architectural note

Only 10 of the 40 layers use standard attention; the rest are Gated Delta Net (a recurrent variant). That's why the KV cache is tiny even at 128K and why -np 1 matters — each extra slot duplicates recurrent state, not just KV.

Tuning for a different GPU

Each MoE layer on GPU costs ~530 MB VRAM. Attention weights are ~1.9 GB fixed.

VRAM	Strategy
8 GB	`--cpu-moe` (all experts on CPU)
12 GB	`FIT_TARGET=512 ./run-server.sh` → ~N=26
16 GB, concurrent desktop use	`./run-server.sh` (default) → ~N=24–26 (depends on how much VRAM your compositor/browser is already using)
16 GB, dedicated / headless GPU	`FIT_TARGET=512 THREADS=12 ./run-server.sh` → ~N=19
24 GB	`./run-server.sh` → ~N=10

If you want to pin a value, override with -ncmoe N instead of --fit.

Expected throughput

The CPU matters more than you'd think. During prefill and decode, tokens flow through every layer — so MoE experts placed on CPU affect both. A CPU without 3D V-Cache sees meaningfully lower prefill, not just lower decode.

Measured on Ryzen 5 5600X + 5070 Ti (this host) vs. upstream reference (9800X3D + 5070 Ti). Two columns for this host: the default desktop-friendly preset (N_CPU_MOE=26, THREADS=10, ~1.5 GB VRAM + 1 physical core reserved for Plasma/IDE/browser) and the dedicated preset (FIT_TARGET=512 THREADS=12 ./run-server.sh → N_CPU_MOE=19, all threads).

Test	default / concurrent	dedicated / headless	9800X3D reference (N=19)
pp512	630	808	~2770
pp2048	1707	2091	~4450
pp4096	1703	2097	~4420
tg128	59.9	76.5	~98

The concurrent preset costs ~22% decode and ~19–25% prefill vs. dedicated on this host — that's the real price of leaving VRAM and cores for the desktop. The 5600X costs most at pp512 (small-batch kernels are more sensitive to CPU-side latency for the offloaded MoE path) and least at tg128 relative to the 9800X3D reference. 3D V-Cache on the reference CPU helps decode more than you'd expect from the clock-speed delta alone.

Run ./run-bench.sh to measure on your host. If it OOMs — usually because another process is eating VRAM — either free GPU memory or bump N_CPU_MOE / drop UBATCH (see Troubleshooting).

Troubleshooting

mlock failed: cannot allocate memory — the kernel's locked-memory limit is too low. Check with ulimit -l; if it's not unlimited add to /etc/security/limits.conf:

<your-user>  hard  memlock  unlimited
<your-user>  soft  memlock  unlimited

Log out and back in. Or launch the server without locking by editing run-server.sh to drop --mlock --no-mmap (costs a bit of first-token latency but won't fail).

Note: the CPU-side weights (CUDA_Host model buffer) are page-locked by CUDA itself as part of host-pinned allocation, so the warning is largely cosmetic for this config — the pages that matter are already pinned. Fixing the ulimit still removes the warning and makes the lock state unambiguous.

OOM at model load or during prefill — --fit-target 512 is the safe default but a desktop compositor or other GPU consumer can still push it over. Bump headroom: FIT_TARGET=1024 ./run-server.sh. To squeeze one more MoE layer onto the GPU at the cost of fragility: FIT_TARGET=256.

llama-bench OOMs on pp2048 with CUDA pool error — bench has no --fit so it can't reserve room for the compute buffer dynamically. Increase N_CPU_MOE for bench only (default is 22; try 24 if 22 still OOMs) or lower UBATCH: UBATCH=1024 ./run-bench.sh.

bind: address already in use — something else is on port 8033. Either kill it (ss -tlnp | grep 8033) or PORT=9000 ./run-server.sh. Remember to update opencode.json or ANTHROPIC_BASE_URL to match.

OpenCode / Claude Code connects but no model listed — the model ID your config declares must match what the server advertises. Check with curl http://localhost:8033/v1/models and edit the models key in opencode.json accordingly.

Different quant / different model file — run-server.sh auto-picks the single .gguf in models/. With multiple files, pass MODEL=/abs/path.gguf as an env var.

Using it from a coding agent

Claude Code

This build exposes /v1/messages natively, so no proxy is needed:

export ANTHROPIC_BASE_URL=http://localhost:8033
export ANTHROPIC_API_KEY=dummy    # required to be set, not checked locally
claude

Use /model inside the session to pick the local model. Expect degraded agentic quality vs. real Claude — Qwen3.6-35B is capable but gets tripped up on multi-step tool use more often, and prompt caching (which Claude Code leans on) doesn't apply to a local backend.

OpenCode

OpenCode talks OpenAI-compatible natively and is tuned around local models, so it's usually a better fit for Qwen.

Install (Arch / CachyOS, via AUR — needs your sudo password):

paru -S opencode-bin

The package in the official extra repo is pinned to an old 1.4.7; the AUR opencode-bin tracks upstream (currently 1.14.x) — same project (anomalyco/opencode), just current.

Config — this repo already ships opencode.json pointing at http://127.0.0.1:8033/v1. Either run OpenCode from this directory so it picks up the local config, or install it globally:

# per-project: run it from this repo
./run-server.sh &                   # keep the server running
opencode                            # picks up ./opencode.json

# or install globally
mkdir -p ~/.config/opencode
cp opencode.json ~/.config/opencode/opencode.json

On first launch, select the llama-local provider. The model ID in opencode.json is the full GGUF filename (Qwen3.6-35B-A3B-UD-Q4_K_M.gguf) because that's exactly what /v1/models returns. If you use a different quant, update the key under models to match curl http://localhost:8033/v1/models.

References

Upstream llama.cpp: https://github.com/ggml-org/llama.cpp
Model card: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
r/LocalLLaMA tuning thread this config is derived from (Qwen3.6 + --n-cpu-moe on 16 GB GPUs)

License

MIT. See LICENSE. The scripts here are original to this repo; llama.cpp and the Qwen3.6 weights have their own upstream licenses (MIT and Apache-2.0 respectively).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
opencode.json		opencode.json
run-bench.sh		run-bench.sh
run-server.sh		run-server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

localQwen

Setup

Quick start

Why this config

How the MoE split works

What each flag does

Qwen3.6 architectural note

Tuning for a different GPU

Expected throughput

Troubleshooting

Using it from a coding agent

Claude Code

OpenCode

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

localQwen

Setup

Quick start

Why this config

How the MoE split works

What each flag does

Qwen3.6 architectural note

Tuning for a different GPU

Expected throughput

Troubleshooting

Using it from a coding agent

Claude Code

OpenCode

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages