Qwen-3.6 quants #1663

ikawrakow · 2026-04-19T16:48:01Z

ikawrakow
Apr 19, 2026
Maintainer

I was playing with Qwen3.6-35B-A3B quantization and comparing to the Unsloth quants. Unsloth, as we all know, produce superior quants

In the past that was very far from the truth, see for instance #359. I was curious to see if they have made progress since then.

I'll measure quantization error of quantization Q as PPL(Q)/PPL(bf16) - 1. I know, many will object that "PPL tells us nothing", and KLD is the one and only one true measure of quantization error. I leave KLD computations and comparisons as an exercise for the reader. The more educated reader will of course know that the correlation between ln(PPL(Q)/PPL(bf16) and KLD is close to 100%, so they will not waste their time doing that.

Unsloth publish many quants, so I downloaded a subset only. The following graph shows a comparison between superior Unsloth quants in red and my own quantization experiments in black. The x-axis is model size in GiB (and not GB, as GiB is the unit we use to measure RAM/VRAM). The y-axis is quantization error as defined above on a logarithmic scale.

This time around they did a reasonably good job at the low end of model sizes. Funny thing is that their so called "IQ1_M" quantization does not contain even a single IQ1_S or IQ1_M tensor, it is all IQ2_XXS with some other higher bpw quantization types sprinkled in. I guess, "dynamic" quants can "dynamically" mutate from 1- to 2-bit, and this time around it happened that they all decided to do that. Haha.

Things don't look so great at the higher end of the model size range. Qwen-3.6 quantizes exceptionally well, with the quantization error being just 0.14% for IQ4_KS (so, basically lossless). Unsloth needed 2.8 extra GiB to get to that points with their UD-Q4_K_XL quantization. Does this really matter? It depends. If you have a single 24 GB GPU, you can go up to a context of 32k tokens with u-batch size of 2048 (which maximizes PP performance). With the 3 GiB smaller IQ4_KT one can go up to 220k tokens. If one decreases the u-batch size to 1024, losing ~20% PP performance, one can get up to ~90k tokens with UD-Q4_K_XL, and enjoy the full 260k context with IQ4_KS or IQ4_KT. If one had more than one 24 GB GPU, then one would be using a higher bpw quantization in the first place (along with split mode graph).

The other quantization types I have picked allow running with full offload on smaller GPUs:

IQ1_KT and IQ2_KT - 12 GB GPU
IQ3_KT - 16 GB GPU

T0R0-xp · 2026-04-21T11:42:42Z

T0R0-xp
Apr 21, 2026

Draft — comment on ikawrakow/ik_llama.cpp #1663 "Qwen-3.6 quants"

Thanks for the benchmarks.

Context / use case

I'm running Qwen3.6-35B-A3B as a coding subagent inside a local knowledge-compiler pipeline (agentic, not interactive chat). The pipeline sends structured JSON task packets to llama-server via the OpenAI-compatible API, extracts code blocks from the response, and runs validation automatically. Single user, single agent slot, prompts in the 2–8K token range, expected output 1–5K tokens of code.

Hardware: RTX 4060 Laptop 8GB VRAM + 96GB DDR5 RAM.

Current config

llama-server \
  -m Qwen3.6-35B-A3B-Q4_K_M.gguf \
  -ngl 99 --n-cpu-moe 99 \
  -c 50000 -np 1 \
  -fa on \
  --cache-type-k q8_0 --cache-type-v turbo2 \
  --no-mmap --mlock \
  --ctx-checkpoints 1 --cache-ram 0 \
  -b 2048 -ub 2048 \
  --reasoning on --reasoning-budget -1 \
  --reasoning-budget-message "...budget reached. Write the complete solution now." \
  --jinja

Runtime: TurboQuant fork (llama-cpp-tq). Generation: ~10–12 tok/s with both Qwen and Gemma4 active on the same machine.

One documented trap: with --reasoning-budget -1 and bounded max_tokens, the model can spend the entire output budget on thinking and return no visible code. Fix: per-request thinking_budget_tokens via API. I posted the full details here if useful: [reddit link]

Questions

1. Quant choice for 8GB hybrid
Given your quality curve, would IQ3_KT or IQ4_KT make sense for a hybrid 8GB setup at 32–50k context? Q4_K_M currently works but I'm curious whether the IQK trellis types give a real quality or speed advantage in hybrid mode (most weights in RAM, only attention layers on GPU).

2. n-cpu-moe partial split
I'm using --n-cpu-moe 99 (all MoE to CPU). A commenter on that Reddit post reported 700+ tok/s PP on a 3060 Ti 8GB with --n-cpu-moe 38. Do you have intuition on whether a partial GPU/CPU split for MoE is generally better than full CPU offload on 8GB, or does it depend too much on the specific batch size and context?

0 replies

ikawrakow · 2026-04-21T13:00:29Z

ikawrakow
Apr 21, 2026
Maintainer Author

@T0R0-xp

The trellis quants have good performance on a GPU. On the CPU, it depends

Zen4 or better - performance is reasonable, but still lower than other quantization types
Vanilla AVX2 - noticeably slower than other quantization types
Apple Silicon - performance is pathetic

The IQ4_KS recipe from above will give you quantization accuracy comparable to Unsloth's UD-Q4_K_XL while being almost 3 GiB smaller. It should be better than UD-Q4_K_M (and also significantly smaller). The recipe uses IQ4_KS for all routed experts, Q6_0 for everything else.

When using IQ4_KS, one MoE layer is about 410 MiB. For the 50k tokens of context that you are using, the KV cache is about 600 MiB when quantized with Q8_0. The attention tensors are about 1500 MiB. The compute buffer is about 650 MiB for u-batch of 2048 using -wgt 1. So, 600 + 1500 + 650 = 2,750 MiB. Hence, you should be able to offload in the range of 10 MoE layers to the GPU. Here is what I get on my system (Ryzen-3995WX CPU, RTX-3090 but pretending it only has 8 GiB VRAM) for the above IQ4_KS

--cpu-moe

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.935	2189.33	1.591	80.44
2048	128	2048	0.909	2253.04	1.590	80.50
2048	128	4096	0.913	2243.58	1.648	77.65
2048	128	6144	0.934	2193.77	1.644	77.86
2048	128	8192	0.943	2172.11	1.664	76.94
2048	128	10240	0.939	2182.03	1.706	75.02
2048	128	12288	0.955	2143.89	1.728	74.06
2048	128	14336	0.958	2136.88	1.747	73.25
2048	128	16384	0.975	2100.57	1.737	73.70
2048	128	18432	0.983	2083.98	1.738	73.66
2048	128	20480	0.994	2060.73	1.775	72.13
2048	128	22528	0.992	2064.10	1.790	71.50
2048	128	24576	1.003	2042.70	1.805	70.91
2048	128	26624	1.021	2006.05	1.870	68.46
2048	128	28672	1.027	1994.88	1.836	69.72
2048	128	30720	1.032	1984.67	1.833	69.82
2048	128	32768	1.048	1954.08	1.860	68.81
2048	128	34816	1.059	1934.45	1.883	67.99
2048	128	36864	1.072	1909.98	1.917	66.78
2048	128	38912	1.080	1896.26	1.940	65.97
2048	128	40960	1.096	1868.24	1.960	65.31
2048	128	43008	1.093	1873.27	1.960	65.30
2048	128	45056	1.105	1853.72	1.957	65.40
2048	128	47104	1.121	1826.31	1.977	64.76

--n-cpu-moe 30

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.836	2451.20	1.414	90.52
2048	128	2048	0.805	2545.29	1.404	91.18
2048	128	4096	0.812	2521.21	1.416	90.41
2048	128	6144	0.829	2469.03	1.435	89.21
2048	128	8192	0.842	2432.96	1.455	87.95
2048	128	10240	0.843	2428.83	1.473	86.90
2048	128	12288	0.853	2401.05	1.505	85.08
2048	128	14336	0.860	2380.16	1.558	82.17
2048	128	16384	0.875	2340.87	1.539	83.18
2048	128	18432	0.884	2316.74	1.589	80.54
2048	128	20480	0.895	2287.82	1.572	81.44
2048	128	22528	0.900	2276.41	1.612	79.42
2048	128	24576	0.911	2248.78	1.658	77.20
2048	128	26624	0.923	2219.02	1.666	76.84
2048	128	28672	0.932	2196.38	1.666	76.83
2048	128	30720	0.942	2174.33	1.676	76.36
2048	128	32768	0.956	2141.94	1.690	75.75
2048	128	34816	0.963	2126.63	1.719	74.47
2048	128	36864	0.974	2102.96	1.798	71.18
2048	128	38912	0.982	2085.98	1.729	74.05
2048	128	40960	0.992	2064.99	1.779	71.93
2048	128	43008	0.998	2053.00	1.804	70.95
2048	128	45056	1.010	2027.50	1.812	70.65
2048	128	47104	1.030	1988.78	1.796	71.25

0 replies

M98M · 2026-05-16T12:45:45Z

M98M
May 16, 2026

Are these quants (Qwen3.6-35B-A3B) available in huggingface?

4 replies

ikawrakow May 16, 2026
Maintainer Author

No, I didn't publish on HF. Not possible to get gigabit connection where the computers on which I work are, so kind of painful to be uploading to HF. Ubergarm used to cook a lot of ik_llama.cpp specific quants, but he seems less active these days.

M98M May 16, 2026

No, I didn't publish on HF. Not possible to get gigabit connection where the computers on which I work are, so kind of painful to be uploading to HF. Ubergarm used to cook a lot of ik_llama.cpp specific quants, but he seems less active these days.

Yeah so I've heard ik_llama is faster than mainline in cpu/gpu hybrid. (e.g. MOE model like qwen 35b)
I've been trying to find out which quant types are recommended for such setups. (zen4 cpu if it matters)

ikawrakow May 16, 2026
Maintainer Author

Any quantized model that works with llama.cpp will also work with ik_llama.cpp.

3JlOy-PYCCKUi May 18, 2026

Are recipes for these quants available? I didn't find them (at least here)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-3.6 quants #1663

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Qwen-3.6 quants #1663

Uh oh!

ikawrakow Apr 19, 2026 Maintainer

Replies: 3 comments · 4 replies

Uh oh!

T0R0-xp Apr 21, 2026

Draft — comment on ikawrakow/ik_llama.cpp #1663 "Qwen-3.6 quants"

Uh oh!

ikawrakow Apr 21, 2026 Maintainer Author

--cpu-moe

--n-cpu-moe 30

Uh oh!

M98M May 16, 2026

Uh oh!

ikawrakow May 16, 2026 Maintainer Author

Uh oh!

M98M May 16, 2026

Uh oh!

ikawrakow May 16, 2026 Maintainer Author

Uh oh!

Uh oh!

3JlOy-PYCCKUi May 18, 2026

ikawrakow
Apr 19, 2026
Maintainer

Replies: 3 comments 4 replies

T0R0-xp
Apr 21, 2026

ikawrakow
Apr 21, 2026
Maintainer Author

M98M
May 16, 2026

ikawrakow May 16, 2026
Maintainer Author

ikawrakow May 16, 2026
Maintainer Author