Replies: 3 comments 4 replies
-
Draft — comment on ikawrakow/ik_llama.cpp #1663 "Qwen-3.6 quants"Thanks for the benchmarks. Context / use case I'm running Qwen3.6-35B-A3B as a coding subagent inside a local knowledge-compiler pipeline (agentic, not interactive chat). The pipeline sends structured JSON task packets to llama-server via the OpenAI-compatible API, extracts code blocks from the response, and runs validation automatically. Single user, single agent slot, prompts in the 2–8K token range, expected output 1–5K tokens of code. Hardware: RTX 4060 Laptop 8GB VRAM + 96GB DDR5 RAM. Current config Runtime: TurboQuant fork ( One documented trap: with Questions 1. Quant choice for 8GB hybrid 2. n-cpu-moe partial split |
Beta Was this translation helpful? Give feedback.
-
|
The trellis quants have good performance on a GPU. On the CPU, it depends
The When using --cpu-moe
--n-cpu-moe 30
|
Beta Was this translation helpful? Give feedback.
-
|
Are these quants (Qwen3.6-35B-A3B) available in huggingface? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I was playing with Qwen3.6-35B-A3B quantization and comparing to the Unsloth quants. Unsloth, as we all know, produce superior quants

In the past that was very far from the truth, see for instance #359. I was curious to see if they have made progress since then.
I'll measure quantization error of quantization
QasPPL(Q)/PPL(bf16) - 1. I know, many will object that "PPL tells us nothing", and KLD is the one and only one true measure of quantization error. I leave KLD computations and comparisons as an exercise for the reader. The more educated reader will of course know that the correlation betweenln(PPL(Q)/PPL(bf16)and KLD is close to 100%, so they will not waste their time doing that.Unsloth publish many quants, so I downloaded a subset only. The following graph shows a comparison between superior Unsloth quants in red and my own quantization experiments in black. The x-axis is model size in GiB (and not GB, as GiB is the unit we use to measure RAM/VRAM). The y-axis is quantization error as defined above on a logarithmic scale.
This time around they did a reasonably good job at the low end of model sizes. Funny thing is that their so called "IQ1_M" quantization does not contain even a single
IQ1_SorIQ1_Mtensor, it is allIQ2_XXSwith some other higher bpw quantization types sprinkled in. I guess, "dynamic" quants can "dynamically" mutate from 1- to 2-bit, and this time around it happened that they all decided to do that. Haha.Things don't look so great at the higher end of the model size range. Qwen-3.6 quantizes exceptionally well, with the quantization error being just 0.14% for
IQ4_KS(so, basically lossless). Unsloth needed 2.8 extra GiB to get to that points with their UD-Q4_K_XL quantization. Does this really matter? It depends. If you have a single 24 GB GPU, you can go up to a context of 32k tokens with u-batch size of 2048 (which maximizes PP performance). With the 3 GiB smallerIQ4_KTone can go up to 220k tokens. If one decreases the u-batch size to 1024, losing ~20% PP performance, one can get up to ~90k tokens with UD-Q4_K_XL, and enjoy the full 260k context withIQ4_KSorIQ4_KT. If one had more than one 24 GB GPU, then one would be using a higher bpw quantization in the first place (along with split modegraph).The other quantization types I have picked allow running with full offload on smaller GPUs:
IQ1_KTandIQ2_KT- 12 GB GPUIQ3_KT- 16 GB GPUBeta Was this translation helpful? Give feedback.
All reactions