Experimental review of TurboQuant-style KV-cache compression, vector quantization behavior, retrieval quality, and throughput implications on a single 96 GB Blackwell GPU. Technical overview on my blog: https://verbasik.github.io/warp-zone-folio/#/blog/turboquant
A full experimental run was completed for Qwen/Qwen3.6-27B in bfloat16 mode on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition GPU with 96 GB VRAM.
PASS = 14
WARN = 0
FAIL = 0
SKIP = 0
The full harness executed successfully:
- β model loading
- β theoretical TurboQuant checks
- β real Qwen KV-cache quantization proxy
- β Needle-in-a-Haystack
- β compact LongBench-E-style evaluation
- β throughput sanity check
- β ANN retrieval benchmark
Main takeaway: TurboQuant shows strong technical properties as a method for KV-cache compression and online vector quantization, but the experiments also confirm an important caveat: headline ratios such as 8x and 6.4x refer primarily to the bit budget of KV-cache values, not to total model memory and not to end-to-end inference speedup.
The experiments were run with the following model/runtime configuration:
{
"model_id": "Qwen/Qwen3.6-27B",
"device": "cuda:0",
"dtype": "bfloat16",
"model_class": "AutoModelForCausalLM",
"requested_attn_implementation": "sdpa",
"resolved_attn_implementation": "sdpa"
}The model was successfully loaded through AutoModelForCausalLM.
The run used the sdpa attention backend, which avoided an external dependency on flash-attn and allowed the full experiment suite to complete on a single GPU.
python3 turboquant.py \
--suite all \
--model_id Qwen/Qwen3.6-27B \
--load_model \
--local_files_only \
--torch_dtype bfloat16 \
--attn_implementation sdpa \
--needle_context_tokens 32768 \
--needle_required_bits 3.5 \
--needle_trials 1 \
--needle_positions 0.5 \
--run_longbench \
--longbench_required_bits 3.5 \
--longbench_max_examples 6 \
--longbench_max_prompt_tokens 32768 \
--longbench_local_dir data \
--run_throughput \
--throughput_new_tokens 128 \
--throughput_warmup 1 \
--throughput_repeats 3 \
--ann_use_qwen_embeddings \
--ann_required_scalar_bits 2,3,4 \
--ann_pq_index_speedup_threshold 10 \
--out_dir runs/tq_qwen36_27b_full_sdpa| Area | Result | Interpretation |
|---|---|---|
| Model loading | β PASS | Qwen/Qwen3.6-27B loaded successfully in bfloat16 |
| Compression arithmetic | β PASS | float32 β 4-bit = 8x is arithmetically correct |
| Random rotation | β PASS | Energy is spread uniformly across coordinates |
| TurboQuant-mse | β PASS | Distortion decreases approximately exponentially with bit-width |
| TurboQuant-prod / QJL | β PASS | Inner-product bias is reduced by almost 5x |
| KV-cache accounting | β PASS | KV-cache compresses strongly, but total memory improves much less |
| Real Qwen KV proxy | β PASS | 3.5-bit and 2.5-bit show controlled distortion growth |
| Needle-in-a-Haystack | β PASS | 3.5-bit preserved exact code retrieval at 32k context |
| LongBench-E-style subset | β PASS | 3.5-bit matched full precision on compact containment score |
| Throughput sanity check | β PASS | Python simulation is slower than full precision; no production speedup claimed |
| ANN retrieval | β PASS | TurboQuant beats scalar no-rotation at 2β4 bits and indexes much faster than PQ |
The first experiment checked the basic bit-budget arithmetic:
{
"float32_bits": 49152,
"float32_bytes": 6144,
"quant_bits": 6144,
"quant_bytes": 768,
"compression_32_to_4bit": 8.0
}For a vector of dimension 1536, moving from float32 to a 4-bit representation gives:
49152 bits / 6144 bits = 8x
So the statement:
32-bit β 4-bit = 8x
is arithmetically correct.
Important
This is a compression ratio for a numeric representation. It should not be interpreted as an 8x end-to-end inference speedup.
Real LLM inference includes many additional components:
- model weights
- activation buffers
- attention kernels
- memory bandwidth effects
- scheduler overhead
- framework-level runtime costs
Therefore, this experiment validates the arithmetic while also highlighting why headline compression ratios need careful interpretation.
The random_rotation_energy_spreading experiment validated the key geometric mechanism behind TurboQuant:
{
"dimension": 256,
"energy_cv_before": 16.0,
"energy_cv_after": 0.0,
"mean_max_abs_before": 1.0,
"mean_max_abs_after": 0.0625,
"coordinate_var_after": 0.0039062486,
"expected_var": 0.00390625,
"sqrt_d_scaled_var": 0.9999996424,
"sqrt_d_scaled_kurtosis": 2.9816138744
}Before rotation, the vector energy was concentrated in a small number of coordinates. After signed-Hadamard rotation, the energy became almost perfectly distributed.
The most important agreement is:
coordinate_var_after β expected_var = 1 / 256
The scaled kurtosis is also close to 3, which is consistent with an approximately normal coordinate distribution after scaling by sqrt(d).
Interpretation: random orthogonal rotation transforms an unfavorable coordinate geometry into a more homogeneous one. This makes simple coordinate-wise scalar quantization much more effective.
The turboquant_mse_4_power_minus_b experiment checked whether reconstruction distortion decreases approximately as 4^-b.
| Bits per coordinate | Measured distortion | Shannon lower bound 4^-b |
Constant factor |
|---|---|---|---|
| 1 | 0.3617 | 0.2500 | 1.45 |
| 2 | 0.1169 | 0.0625 | 1.87 |
| 3 | 0.0343 | 0.0156 | 2.19 |
| 4 | 0.0094 | 0.0039 | 2.41 |
The estimated log-scale slope was:
{
"log2_distortion_slope_per_bit": -1.757,
"target_slope": -2.0
}The ideal 4^-b decay corresponds to a slope of -2 in log2 scale. The measured value -1.757 is not perfect, but it clearly follows the same exponential trend.
The gap is expected in a finite-dimensional implementation with:
- finite block size;
- pure-PyTorch Lloyd-Max quantization;
- non-asymptotic constants.
Conclusion: TurboQuant-mse empirically follows the expected exponential distortion decay, with most of the gap appearing as a constant-factor loss relative to the lower bound.
The inner_product_bias_mse_vs_prod experiment tested the core idea behind TurboQuant-prod.
MSE-oriented quantization can reconstruct vectors well, but it may introduce systematic bias in inner products. QJL residual correction is intended to reduce this bias.
{
"mse_only_mean_abs_bias": 0.0099148,
"prod_qjl_mean_abs_bias": 0.0019973,
"mse_only_signed_bias": -0.0099148,
"prod_qjl_signed_bias": 0.0000376,
"bias_reduction_factor": 4.964
}The mean absolute bias was reduced by approximately:
0.0099148 / 0.0019973 β 4.96x
The signed bias was almost eliminated:
-0.0099148 β 0.0000376
This is one of the strongest positive results in the experiment suite.
Interpretation: QJL residual correction does not merely add random noise. It meaningfully compensates for systematic inner-product bias introduced by MSE quantization.
This matters directly for:
- attention;
- retrieval;
- nearest-neighbor search;
- ranking based on dot products.
The turboquant_prod_inner_product_rate experiment measured inner-product error across different bit-widths.
| Bits | Inner-product MSE | Proxy bound | Constant factor |
|---|---|---|---|
| 1 | 0.00613 | 0.00391 | 1.57 |
| 2 | 0.00219 | 0.00098 | 2.24 |
| 3 | 0.00071 | 0.00024 | 2.90 |
| 4 | 0.00021 | 0.00006 | 3.45 |
The estimated slope was:
{
"log2_ip_error_slope_per_bit": -1.622,
"target_slope_rough": -2.0
}The result is weaker than the ideal -2 slope, but it still shows stable exponential improvement as bit-width increases.
Practical meaning: each additional bit substantially improves inner-product accuracy, although finite-dimensional constants remain visible.
The qwen_kv_memory_accounting experiment tested a central practical question:
Does TurboQuant reduce total model memory by the same factor as it compresses KV-cache?
For Qwen/Qwen3.6-27B, the answer is no.
| Context tokens | KV bf16 | KV 2.5-bit | KV compression | Total bf16 model+KV | Total 2.5-bit model+KV | Total ratio |
|---|---|---|---|---|---|---|
| 32,768 | 2.0 GiB | 0.3125 GiB | 6.4x | 57.6 GiB | 55.91 GiB | 1.03x |
| 131,072 | 8.0 GiB | 1.25 GiB | 6.4x | 63.6 GiB | 56.85 GiB | 1.12x |
| 262,144 | 16.0 GiB | 2.5 GiB | 6.4x | 71.6 GiB | 58.1 GiB | 1.23x |
Formally, KV-cache compression is strong:
bf16 16-bit / 2.5-bit = 6.4x
But once model weights are included, the total memory ratio is much smaller:
32k context: total memory ratio β 1.03x
131k context: total memory ratio β 1.12x
262k context: total memory ratio β 1.23x
Note
TurboQuant can greatly reduce KV-cache memory, but it does not make a 27B model six times smaller.
The longer the context, the more important KV-cache becomes. For moderate context sizes, however, model weights remain the dominant memory component.
The qwen_kv_cache_quantization_proxy experiment used real past_key_values from the model, not synthetic tensors.
{
"source": "actual_model_past_key_values",
"seq_len": 2048
}| Effective bits | Reconstruction distortion | Attention inner-product sqerr | bf16 compression ratio |
|---|---|---|---|
| 3.5 | 0.0196 | 0.000076 | 4.57x |
| 2.5 | 0.0673 | 0.000261 | 6.4x |
The degradation is monotonic and expected:
3.5-bit: lower distortion, safer mode
2.5-bit: stronger compression, higher distortion
The error increase from 3.5-bit to 2.5-bit is approximately:
reconstruction distortion: 0.0673 / 0.0196 β 3.44x
attention IP sqerr: 0.000261 / 0.000076 β 3.41x
Interpretation: moving from 3.5-bit to 2.5-bit provides additional compression, but it increases reconstruction and attention inner-product error by roughly 3.4x.
The needle_in_haystack_qwen_generation experiment tested whether the model could retrieve a hidden code from a long context of approximately 32785 tokens.
{
"baseline_accuracy": 1.0,
"required_accuracy": {
"3.5": 1.0
}
}Both full precision and TurboQuant 3.5-bit retrieved the correct code:
expected_code = TQ-946382
full_precision_cache contains_code = true
turboquant_required_cache_3.5bit contains_code = true
Result: the conservative 3.5-bit KV-cache mode preserved exact retrieval in this 32k-context Needle-in-a-Haystack test.
Important
This was a single-trial smoke test at position 0.5. It is useful evidence, but not a statistically complete benchmark.
A stronger evaluation should include:
- multiple seeds;
- multiple needle positions;
- several hidden strings;
- repeated runs across prompt variants.
Also, this experiment validates quality under a Python-level simulated cache mutation. It does not prove real compressed-cache residency or production-level speedup.
The longbench_e_subset_qwen_generation experiment ran successfully on local LongBench data.
{
"source": "local_dir",
"configs": ["narrativeqa", "qasper", "hotpotqa"],
"n_examples": 6
}Scores:
{
"full_precision_cache": 0.5,
"turboquant_required_cache_3.5bit": 0.5
}Drop relative to full precision:
{
"turboquant_required_cache_3.5bit": 0.0
}So, on this compact subset, 3.5-bit did not reduce answer containment score relative to full precision.
Note
The absolute score of 0.5 means that the baseline itself answered only half of the examples correctly under this containment metric.
This should be interpreted as a relative sanity check:
3.5-bit KV-cache did not degrade compact LongBench-E-style containment score versus full precision.
It should not be treated as a full official LongBench reproduction, because the harness used a compact answer containment metric rather than the official task-specific scoring pipeline.
The qwen_throughput_sanity_check experiment used:
128generated tokens;1warmup run;3measured repeats.
| Mode | Decode tok/s | Speed ratio vs full precision |
|---|---|---|
| Full precision cache | 19.45 | 1.00 |
| Simulated 3.5-bit cache | 14.53 | 0.75 |
| Simulated 2.5-bit cache | 14.56 | 0.75 |
The Python-level simulated quantized cache was slower than full precision:
3.5-bit simulated decode speed β 75% of full precision
2.5-bit simulated decode speed β 75% of full precision
This is expected. The current harness performs additional KV-cache mutation at the Python/PyTorch level and does not use fused kernels.
Key lesson: bit-level KV-cache compression should not be automatically interpreted as end-to-end inference acceleration.
Real speedup requires a kernel-level implementation where the quantized cache is:
- stored compactly;
- accessed efficiently;
- dequantized or consumed with minimal overhead;
- integrated into the inference engine.
The ANN benchmark used Qwen hidden-state mean-pool embeddings:
{
"source": "qwen_hidden_state_mean_pool_embeddings",
"db_size": 1024,
"query_size": 128,
"projected_dim": 256,
"k": 10
}The required comparison tested TurboQuant against scalar quantization without rotation at 2, 3, and 4 bits.
| Bits | TurboQuant recall@10 | Scalar recall@10 | Result |
|---|---|---|---|
| 2 | 0.393 | 0.298 | TurboQuant wins |
| 3 | 0.666 | 0.546 | TurboQuant wins |
| 4 | 0.795 | 0.754 | TurboQuant wins |
Conclusion: at moderate bit-widths, TurboQuantβs random rotation improves recall over naive scalar quantization.
The exception was the 1-bit regime:
TurboQuant 1-bit: 0.184
scalar 1-bit: 0.238
The 1-bit setting was unstable on real Qwen mean-pool embeddings, so the required benchmark focused on 2, 3, and 4 bits.
PQ-kmeans achieved higher recall at every tested bit-width:
| Bits | TurboQuant recall@10 | PQ-kmeans recall@10 |
|---|---|---|
| 1 | 0.184 | 0.331 |
| 2 | 0.393 | 0.707 |
| 3 | 0.666 | 0.766 |
| 4 | 0.795 | 0.880 |
So, in this ANN benchmark, TurboQuant did not outperform the trained PQ baseline on pure recall.
However, TurboQuant was dramatically faster to build:
| Bits | TurboQuant index time | PQ index time | PQ / TurboQuant index cost |
|---|---|---|---|
| 1 | 0.00089 s | 0.10787 s | 121.8x |
| 2 | 0.00086 s | 0.12789 s | 149.5x |
| 3 | 0.00055 s | 0.19566 s | 357.4x |
| 4 | 0.00059 s | 0.27165 s | 460.3x |
Fair interpretation: TurboQuant is especially strong as an online indexing method. It is much cheaper to build than trained PQ and improves over scalar no-rotation at moderate bit-widths, but it does not necessarily beat trained PQ-kmeans on recall.
The full run supports the technical validity of the main TurboQuant mechanisms:
- Random rotation effectively spreads coordinate energy.
- TurboQuant-mse shows approximately exponential reconstruction distortion decay.
- TurboQuant-prod with QJL residual correction significantly reduces inner-product bias.
- Real Qwen KV-cache quantization shows controlled distortion growth.
- The 3.5-bit mode preserved quality in Needle-in-a-Haystack and compact LongBench-E-style checks.
- ANN experiments show TurboQuant improves over scalar no-rotation at 2β4 bits.
- TurboQuant has a very large index construction cost advantage over PQ-kmeans.
The same results also confirm several important limitations:
8xand6.4xare bit-level compression ratios, not end-to-end speedups.- TurboQuant compresses KV-cache, not model weights.
- Total model+KV memory improves much less than KV-cache memory alone.
- Python-level simulation is slower than full precision decode due to overhead.
- Production speedup requires fused kernels or inference-engine integration.
- TurboQuant did not beat PQ-kmeans on ANN recall in this benchmark.
Use this as the conservative quality-preserving mode.
3.5-bit KV-cache:
safer compression mode;
preserved Needle and compact LongBench-E-style scores in this run;
provides 4.57x KV-cache compression relative to bf16.
Use this as an aggressive compression mode.
2.5-bit KV-cache:
stronger compression mode;
provides 6.4x KV-cache compression relative to bf16;
introduces substantially higher reconstruction and attention inner-product error;
should be treated as a stress / memory-pressure mode, not guaranteed quality-preserving.
Use TurboQuant when online quantization cost matters.
ANN retrieval:
useful for online indexing without codebook training;
better than scalar no-rotation at 2β4 bits;
worse than PQ-kmeans on recall in this benchmark;
much faster than PQ-kmeans for index construction.
The experimental evaluation shows that TurboQuant is a technically meaningful method, not merely a marketing artifact. Its key components β random rotation, scalar quantization, and QJL residual correction β work reproducibly in synthetic tests, on real Qwen/Qwen3.6-27B KV-cache tensors, and in compact long-context quality checks.
At the same time, the experiments show that headline compression ratios require strict context. The 6.4x figure applies to KV-cache values when moving from bf16 to a 2.5-bit effective representation. It does not imply a sixfold reduction in total model memory.
For Qwen/Qwen3.6-27B, total model+KV memory improves by approximately:
32k context: ~3%
262k context: ~23%
Similarly, Python-level simulation does not demonstrate decode acceleration. Without specialized kernels, quantized-cache simulation can be slower than full precision.
Bottom line: TurboQuant is a strong academic and engineering method for KV-cache compression and online vector quantization, especially promising for long-context inference and online retrieval. Its real-world impact depends on context length, bit-width regime, and the quality of the kernel-level implementation.