Skip to content

n-n-code/qwen3-6-local-setup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

localQwen

Running Qwen3.6-35B-A3B (Unsloth UD-Q4_K_M, 21 GB MoE) on a single 16 GB consumer GPU with 128K context using llama.cpp.

This repo is a thin layer over llama.cpp:

  • run-server.sh — launch the OpenAI-compatible HTTP server with the tuned flags
  • run-bench.sh — reproduce the prefill/decode throughput numbers
  • opencode.json — OpenCode provider config pointing at the local server

The model and llama.cpp itself live next to these files at runtime but aren't checked in (see Setup).

Setup

Requirements: NVIDIA GPU with CUDA 12+ (tested on CUDA 13.2 / Blackwell sm_120), ~24 GB free disk, ~24 GB free RAM.

# 1. Clone llama.cpp alongside these scripts and build with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
cmake --build build -j
cd ..

# 2. Download the Qwen3.6-35B-A3B UD-Q4_K_M GGUF (~21 GB)
mkdir -p models
wget -c -O models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
    https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

Replace native with your compute capability (e.g. 120 for Blackwell, 89 for Ada, 86 for Ampere) if native doesn't work.

Quick start

./run-server.sh            # desktop-friendly default: leaves VRAM+threads for Plasma/IDE/browser
PORT=9000 ./run-server.sh  # override port
HOST=0.0.0.0 ./run-server.sh                # expose on LAN (no auth — trusted networks only)
FIT_TARGET=512 THREADS=12 ./run-server.sh   # dedicated/max throughput (~10% more tok/s, no concurrent apps)
FIT_TARGET=256 ./run-server.sh              # squeeze +1 MoE layer onto GPU (OOM-prone)
MODEL=/abs/path.gguf ./run-server.sh        # if multiple .gguf in models/
./run-bench.sh             # pp512/pp2048/pp4096 + tg128, 3 reps each

Hit http://localhost:8033 for the built-in chat UI or point any OpenAI client at http://localhost:8033/v1. The server binds to 127.0.0.1 by default — the OpenAI-compatible endpoint has no authentication, so exposing it with HOST=0.0.0.0 is only safe on a network you control.

Why this config

Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35 B total parameters, but only ~3 B activate per token. The naive --cpu-moe flag pushes every expert to CPU, leaving the 16 GB GPU almost idle (only ~1.9 GB of attention weights stay on it). --n-cpu-moe N is the right knob: keep the first N layers' experts on CPU, put the rest on GPU. For a ~22 GB model on a 16 GB GPU the sweet spot is around N=19–20.

How the MoE split works

                     ┌──────────────────────────────┐
                     │  Qwen3.6-35B-A3B, 40 layers  │
                     └──────────────────────────────┘
                                    │
                  ┌─────────────────┴─────────────────┐
                  ▼                                   ▼
          Attention weights                   MoE expert weights
          (always on GPU)                     (split per layer)
          ~1.9 GB VRAM                        ~0.53 GB VRAM per layer
                                                     │
                          ┌──────────────────────────┴──────────┐
                          ▼                                     ▼
             Layers 0 .. N-1 → CPU                Layers N .. 39 → GPU
             (RAM, runs on 5600X cores)           (VRAM, runs on RTX 5070 Ti)

               ── per token, the router picks ~8 of 128 experts in each layer ──

             ┌────────────────────────────────────────────────────────┐
             │  Ryzen 5 5600X  ◀── PCIe ──▶  RTX 5070 Ti (16 GB)      │
             │  32 GB DDR4                   Blackwell, CUDA 13.2     │
             └────────────────────────────────────────────────────────┘

--fit on --fit-ctx 128000 --fit-target 1536 auto-probes VRAM at startup and picks N for you. The 1536 MiB headroom is sized for concurrent desktop use (Plasma compositor + Chrome/Firefox HW accel + IDE GPU surfaces). Measured on this host with an active KDE/Wayland 4K desktop it converges to N≈26 (desktop already using ~1.8 GB of the 16 GB, so ~7 GB of MoE surplus fits on GPU). Pass FIT_TARGET=512 to recover a chunk of throughput when you're not using the machine for anything else. Skip manual --n-cpu-moe tuning unless you want to pin a specific value.

What each flag does

Flag Purpose
--fit on Auto-tune expert placement to fill VRAM
--fit-ctx 128000 Guarantee --fit reserves room for 128K context (bare --fit silently falls back to 4K)
--fit-target 1536 VRAM headroom in MiB. 1536 is the desktop-friendly default — leaves room for Plasma/Chrome/IDE on your GPU. FIT_TARGET=512 ./run-server.sh recovers ~2 MoE layers (~10% more tok/s) for dedicated inference. FIT_TARGET=256 pushes one layer further but is OOM-prone
-np 1 Single user slot. Default is 4, wastes ~190 MB on recurrent state for Qwen's Gated Delta Net layers
-fa on FlashAttention
--no-mmap --mlock Load once into locked RAM pages; don't let the kernel page them out
-b 2048 -ub 2048 Larger logical and physical batch for prefill — 59 % faster on 2K prompts vs. the 512 default
-t 10 -tb 10 Reserve 1 physical core (2 SMT threads) on the 5600X for the desktop. Near-zero throughput hit, much better Plasma/IDE responsiveness during decode. Override with THREADS=12 for dedicated use
-ctk q8_0 -ctv q8_0 Quantise KV cache to 8-bit. Near-lossless, halves KV memory. 128K ctx only costs ~1.4 GB
--reasoning-budget -1 No hard cap on reasoning tokens
--chat-template-kwargs '{"preserve_thinking": true}' Carry thinking traces across all historical turns — better tool-use consistency and KV reuse in agent loops
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 Unsloth's "Precise Coding" preset. Swap to --temp 1.0 --presence-penalty 1.5 for general chat

Qwen3.6 architectural note

Only 10 of the 40 layers use standard attention; the rest are Gated Delta Net (a recurrent variant). That's why the KV cache is tiny even at 128K and why -np 1 matters — each extra slot duplicates recurrent state, not just KV.

Tuning for a different GPU

Each MoE layer on GPU costs ~530 MB VRAM. Attention weights are ~1.9 GB fixed.

VRAM Strategy
8 GB --cpu-moe (all experts on CPU)
12 GB FIT_TARGET=512 ./run-server.sh → ~N=26
16 GB, concurrent desktop use ./run-server.sh (default) → ~N=24–26 (depends on how much VRAM your compositor/browser is already using)
16 GB, dedicated / headless GPU FIT_TARGET=512 THREADS=12 ./run-server.sh → ~N=19
24 GB ./run-server.sh → ~N=10

If you want to pin a value, override with -ncmoe N instead of --fit.

Expected throughput

The CPU matters more than you'd think. During prefill and decode, tokens flow through every layer — so MoE experts placed on CPU affect both. A CPU without 3D V-Cache sees meaningfully lower prefill, not just lower decode.

Measured on Ryzen 5 5600X + 5070 Ti (this host) vs. upstream reference (9800X3D + 5070 Ti). Two columns for this host: the default desktop-friendly preset (N_CPU_MOE=26, THREADS=10, ~1.5 GB VRAM + 1 physical core reserved for Plasma/IDE/browser) and the dedicated preset (FIT_TARGET=512 THREADS=12 ./run-server.shN_CPU_MOE=19, all threads).

Test default / concurrent dedicated / headless 9800X3D reference (N=19)
pp512 630 808 ~2770
pp2048 1707 2091 ~4450
pp4096 1703 2097 ~4420
tg128 59.9 76.5 ~98

The concurrent preset costs ~22% decode and ~19–25% prefill vs. dedicated on this host — that's the real price of leaving VRAM and cores for the desktop. The 5600X costs most at pp512 (small-batch kernels are more sensitive to CPU-side latency for the offloaded MoE path) and least at tg128 relative to the 9800X3D reference. 3D V-Cache on the reference CPU helps decode more than you'd expect from the clock-speed delta alone.

Run ./run-bench.sh to measure on your host. If it OOMs — usually because another process is eating VRAM — either free GPU memory or bump N_CPU_MOE / drop UBATCH (see Troubleshooting).

Troubleshooting

mlock failed: cannot allocate memory — the kernel's locked-memory limit is too low. Check with ulimit -l; if it's not unlimited add to /etc/security/limits.conf:

<your-user>  hard  memlock  unlimited
<your-user>  soft  memlock  unlimited

Log out and back in. Or launch the server without locking by editing run-server.sh to drop --mlock --no-mmap (costs a bit of first-token latency but won't fail).

Note: the CPU-side weights (CUDA_Host model buffer) are page-locked by CUDA itself as part of host-pinned allocation, so the warning is largely cosmetic for this config — the pages that matter are already pinned. Fixing the ulimit still removes the warning and makes the lock state unambiguous.

OOM at model load or during prefill--fit-target 512 is the safe default but a desktop compositor or other GPU consumer can still push it over. Bump headroom: FIT_TARGET=1024 ./run-server.sh. To squeeze one more MoE layer onto the GPU at the cost of fragility: FIT_TARGET=256.

llama-bench OOMs on pp2048 with CUDA pool error — bench has no --fit so it can't reserve room for the compute buffer dynamically. Increase N_CPU_MOE for bench only (default is 22; try 24 if 22 still OOMs) or lower UBATCH: UBATCH=1024 ./run-bench.sh.

bind: address already in use — something else is on port 8033. Either kill it (ss -tlnp | grep 8033) or PORT=9000 ./run-server.sh. Remember to update opencode.json or ANTHROPIC_BASE_URL to match.

OpenCode / Claude Code connects but no model listed — the model ID your config declares must match what the server advertises. Check with curl http://localhost:8033/v1/models and edit the models key in opencode.json accordingly.

Different quant / different model filerun-server.sh auto-picks the single .gguf in models/. With multiple files, pass MODEL=/abs/path.gguf as an env var.

Using it from a coding agent

Claude Code

This build exposes /v1/messages natively, so no proxy is needed:

export ANTHROPIC_BASE_URL=http://localhost:8033
export ANTHROPIC_API_KEY=dummy    # required to be set, not checked locally
claude

Use /model inside the session to pick the local model. Expect degraded agentic quality vs. real Claude — Qwen3.6-35B is capable but gets tripped up on multi-step tool use more often, and prompt caching (which Claude Code leans on) doesn't apply to a local backend.

OpenCode

OpenCode talks OpenAI-compatible natively and is tuned around local models, so it's usually a better fit for Qwen.

Install (Arch / CachyOS, via AUR — needs your sudo password):

paru -S opencode-bin

The package in the official extra repo is pinned to an old 1.4.7; the AUR opencode-bin tracks upstream (currently 1.14.x) — same project (anomalyco/opencode), just current.

Config — this repo already ships opencode.json pointing at http://127.0.0.1:8033/v1. Either run OpenCode from this directory so it picks up the local config, or install it globally:

# per-project: run it from this repo
./run-server.sh &                   # keep the server running
opencode                            # picks up ./opencode.json

# or install globally
mkdir -p ~/.config/opencode
cp opencode.json ~/.config/opencode/opencode.json

On first launch, select the llama-local provider. The model ID in opencode.json is the full GGUF filename (Qwen3.6-35B-A3B-UD-Q4_K_M.gguf) because that's exactly what /v1/models returns. If you use a different quant, update the key under models to match curl http://localhost:8033/v1/models.

References

License

MIT. See LICENSE. The scripts here are original to this repo; llama.cpp and the Qwen3.6 weights have their own upstream licenses (MIT and Apache-2.0 respectively).

About

This is my how I've set up a qwen3-6 running locally with OpenCode

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages