TurboQuant on Qwen/Qwen3.6-27B

Experimental review of TurboQuant-style KV-cache compression, vector quantization behavior, retrieval quality, and throughput implications on a single 96 GB Blackwell GPU. Technical overview on my blog: https://verbasik.github.io/warp-zone-folio/#/blog/turboquant

📌 At a Glance

A full experimental run was completed for Qwen/Qwen3.6-27B in bfloat16 mode on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition GPU with 96 GB VRAM.

PASS = 14
WARN = 0
FAIL = 0
SKIP = 0

The full harness executed successfully:

✅ model loading
✅ theoretical TurboQuant checks
✅ real Qwen KV-cache quantization proxy
✅ Needle-in-a-Haystack
✅ compact LongBench-E-style evaluation
✅ throughput sanity check
✅ ANN retrieval benchmark

Main takeaway: TurboQuant shows strong technical properties as a method for KV-cache compression and online vector quantization, but the experiments also confirm an important caveat: headline ratios such as 8x and 6.4x refer primarily to the bit budget of KV-cache values, not to total model memory and not to end-to-end inference speedup.

🧪 Experimental Setup

The experiments were run with the following model/runtime configuration:

{
  "model_id": "Qwen/Qwen3.6-27B",
  "device": "cuda:0",
  "dtype": "bfloat16",
  "model_class": "AutoModelForCausalLM",
  "requested_attn_implementation": "sdpa",
  "resolved_attn_implementation": "sdpa"
}

The model was successfully loaded through AutoModelForCausalLM.

The run used the sdpa attention backend, which avoided an external dependency on flash-attn and allowed the full experiment suite to complete on a single GPU.

Full Launch Command

python3 turboquant.py \
  --suite all \
  --model_id Qwen/Qwen3.6-27B \
  --load_model \
  --local_files_only \
  --torch_dtype bfloat16 \
  --attn_implementation sdpa \
  --needle_context_tokens 32768 \
  --needle_required_bits 3.5 \
  --needle_trials 1 \
  --needle_positions 0.5 \
  --run_longbench \
  --longbench_required_bits 3.5 \
  --longbench_max_examples 6 \
  --longbench_max_prompt_tokens 32768 \
  --longbench_local_dir data \
  --run_throughput \
  --throughput_new_tokens 128 \
  --throughput_warmup 1 \
  --throughput_repeats 3 \
  --ann_use_qwen_embeddings \
  --ann_required_scalar_bits 2,3,4 \
  --ann_pq_index_speedup_threshold 10 \
  --out_dir runs/tq_qwen36_27b_full_sdpa

📊 Summary of Results

Area	Result	Interpretation
Model loading	✅ PASS	`Qwen/Qwen3.6-27B` loaded successfully in `bfloat16`
Compression arithmetic	✅ PASS	`float32 → 4-bit = 8x` is arithmetically correct
Random rotation	✅ PASS	Energy is spread uniformly across coordinates
TurboQuant-mse	✅ PASS	Distortion decreases approximately exponentially with bit-width
TurboQuant-prod / QJL	✅ PASS	Inner-product bias is reduced by almost `5x`
KV-cache accounting	✅ PASS	KV-cache compresses strongly, but total memory improves much less
Real Qwen KV proxy	✅ PASS	3.5-bit and 2.5-bit show controlled distortion growth
Needle-in-a-Haystack	✅ PASS	3.5-bit preserved exact code retrieval at 32k context
LongBench-E-style subset	✅ PASS	3.5-bit matched full precision on compact containment score
Throughput sanity check	✅ PASS	Python simulation is slower than full precision; no production speedup claimed
ANN retrieval	✅ PASS	TurboQuant beats scalar no-rotation at 2–4 bits and indexes much faster than PQ

1. 🔢 Compression Arithmetic

The first experiment checked the basic bit-budget arithmetic:

{
  "float32_bits": 49152,
  "float32_bytes": 6144,
  "quant_bits": 6144,
  "quant_bytes": 768,
  "compression_32_to_4bit": 8.0
}

For a vector of dimension 1536, moving from float32 to a 4-bit representation gives:

49152 bits / 6144 bits = 8x

So the statement:

32-bit → 4-bit = 8x

is arithmetically correct.

Important

This is a compression ratio for a numeric representation. It should not be interpreted as an 8x end-to-end inference speedup.

Real LLM inference includes many additional components:

model weights
activation buffers
attention kernels
memory bandwidth effects
scheduler overhead
framework-level runtime costs

Therefore, this experiment validates the arithmetic while also highlighting why headline compression ratios need careful interpretation.

2. 🌀 Random Rotation Spreads Coordinate Energy

The random_rotation_energy_spreading experiment validated the key geometric mechanism behind TurboQuant:

{
  "dimension": 256,
  "energy_cv_before": 16.0,
  "energy_cv_after": 0.0,
  "mean_max_abs_before": 1.0,
  "mean_max_abs_after": 0.0625,
  "coordinate_var_after": 0.0039062486,
  "expected_var": 0.00390625,
  "sqrt_d_scaled_var": 0.9999996424,
  "sqrt_d_scaled_kurtosis": 2.9816138744
}

Before rotation, the vector energy was concentrated in a small number of coordinates. After signed-Hadamard rotation, the energy became almost perfectly distributed.

The most important agreement is:

coordinate_var_after ≈ expected_var = 1 / 256

The scaled kurtosis is also close to 3, which is consistent with an approximately normal coordinate distribution after scaling by sqrt(d).

Interpretation: random orthogonal rotation transforms an unfavorable coordinate geometry into a more homogeneous one. This makes simple coordinate-wise scalar quantization much more effective.

3. 📉 TurboQuant-mse: Distortion Decreases Approximately as `4^-b`

The turboquant_mse_4_power_minus_b experiment checked whether reconstruction distortion decreases approximately as 4^-b.

Bits per coordinate	Measured distortion	Shannon lower bound `4^-b`	Constant factor
1	0.3617	0.2500	1.45
2	0.1169	0.0625	1.87
3	0.0343	0.0156	2.19
4	0.0094	0.0039	2.41

The estimated log-scale slope was:

{
  "log2_distortion_slope_per_bit": -1.757,
  "target_slope": -2.0
}

The ideal 4^-b decay corresponds to a slope of -2 in log2 scale. The measured value -1.757 is not perfect, but it clearly follows the same exponential trend.

The gap is expected in a finite-dimensional implementation with:

finite block size;
pure-PyTorch Lloyd-Max quantization;
non-asymptotic constants.

Conclusion: TurboQuant-mse empirically follows the expected exponential distortion decay, with most of the gap appearing as a constant-factor loss relative to the lower bound.

4. 🎯 QJL Residual Correction Reduces Inner-Product Bias

The inner_product_bias_mse_vs_prod experiment tested the core idea behind TurboQuant-prod.

MSE-oriented quantization can reconstruct vectors well, but it may introduce systematic bias in inner products. QJL residual correction is intended to reduce this bias.

{
  "mse_only_mean_abs_bias": 0.0099148,
  "prod_qjl_mean_abs_bias": 0.0019973,
  "mse_only_signed_bias": -0.0099148,
  "prod_qjl_signed_bias": 0.0000376,
  "bias_reduction_factor": 4.964
}

The mean absolute bias was reduced by approximately:

0.0099148 / 0.0019973 ≈ 4.96x

The signed bias was almost eliminated:

-0.0099148 → 0.0000376

This is one of the strongest positive results in the experiment suite.

Interpretation: QJL residual correction does not merely add random noise. It meaningfully compensates for systematic inner-product bias introduced by MSE quantization.

This matters directly for:

attention;
retrieval;
nearest-neighbor search;
ranking based on dot products.

5. 📐 TurboQuant-prod Inner-Product Error Also Decays Exponentially

The turboquant_prod_inner_product_rate experiment measured inner-product error across different bit-widths.

Bits	Inner-product MSE	Proxy bound	Constant factor
1	0.00613	0.00391	1.57
2	0.00219	0.00098	2.24
3	0.00071	0.00024	2.90
4	0.00021	0.00006	3.45

The estimated slope was:

{
  "log2_ip_error_slope_per_bit": -1.622,
  "target_slope_rough": -2.0
}

The result is weaker than the ideal -2 slope, but it still shows stable exponential improvement as bit-width increases.

Practical meaning: each additional bit substantially improves inner-product accuracy, although finite-dimensional constants remain visible.

6. 🧠 KV-cache Compresses Strongly — Total Model Memory Does Not

The qwen_kv_memory_accounting experiment tested a central practical question:

Does TurboQuant reduce total model memory by the same factor as it compresses KV-cache?

For Qwen/Qwen3.6-27B, the answer is no.

Context tokens	KV bf16	KV 2.5-bit	KV compression	Total bf16 model+KV	Total 2.5-bit model+KV	Total ratio
32,768	2.0 GiB	0.3125 GiB	6.4x	57.6 GiB	55.91 GiB	1.03x
131,072	8.0 GiB	1.25 GiB	6.4x	63.6 GiB	56.85 GiB	1.12x
262,144	16.0 GiB	2.5 GiB	6.4x	71.6 GiB	58.1 GiB	1.23x

Formally, KV-cache compression is strong:

bf16 16-bit / 2.5-bit = 6.4x

But once model weights are included, the total memory ratio is much smaller:

32k context:  total memory ratio ≈ 1.03x
131k context: total memory ratio ≈ 1.12x
262k context: total memory ratio ≈ 1.23x

Note

TurboQuant can greatly reduce KV-cache memory, but it does not make a 27B model six times smaller.

The longer the context, the more important KV-cache becomes. For moderate context sizes, however, model weights remain the dominant memory component.

7. 🧩 Real Qwen KV-cache Quantization Shows Controlled Degradation

The qwen_kv_cache_quantization_proxy experiment used real past_key_values from the model, not synthetic tensors.

{
  "source": "actual_model_past_key_values",
  "seq_len": 2048
}

Effective bits	Reconstruction distortion	Attention inner-product sqerr	bf16 compression ratio
3.5	0.0196	0.000076	4.57x
2.5	0.0673	0.000261	6.4x

The degradation is monotonic and expected:

3.5-bit: lower distortion, safer mode
2.5-bit: stronger compression, higher distortion

The error increase from 3.5-bit to 2.5-bit is approximately:

reconstruction distortion: 0.0673 / 0.0196 ≈ 3.44x
attention IP sqerr:        0.000261 / 0.000076 ≈ 3.41x

Interpretation: moving from 3.5-bit to 2.5-bit provides additional compression, but it increases reconstruction and attention inner-product error by roughly 3.4x.

8. 🪡 Needle-in-a-Haystack: 3.5-bit Preserved Retrieval

The needle_in_haystack_qwen_generation experiment tested whether the model could retrieve a hidden code from a long context of approximately 32785 tokens.

{
  "baseline_accuracy": 1.0,
  "required_accuracy": {
    "3.5": 1.0
  }
}

Both full precision and TurboQuant 3.5-bit retrieved the correct code:

expected_code = TQ-946382
full_precision_cache contains_code = true
turboquant_required_cache_3.5bit contains_code = true

Result: the conservative 3.5-bit KV-cache mode preserved exact retrieval in this 32k-context Needle-in-a-Haystack test.

Important

This was a single-trial smoke test at position 0.5. It is useful evidence, but not a statistically complete benchmark.

A stronger evaluation should include:

multiple seeds;
multiple needle positions;
several hidden strings;
repeated runs across prompt variants.

Also, this experiment validates quality under a Python-level simulated cache mutation. It does not prove real compressed-cache residency or production-level speedup.

9. 📚 LongBench-E-style Subset: 3.5-bit Matched Baseline

The longbench_e_subset_qwen_generation experiment ran successfully on local LongBench data.

{
  "source": "local_dir",
  "configs": ["narrativeqa", "qasper", "hotpotqa"],
  "n_examples": 6
}

Scores:

{
  "full_precision_cache": 0.5,
  "turboquant_required_cache_3.5bit": 0.5
}

Drop relative to full precision:

{
  "turboquant_required_cache_3.5bit": 0.0
}

So, on this compact subset, 3.5-bit did not reduce answer containment score relative to full precision.

Note

The absolute score of 0.5 means that the baseline itself answered only half of the examples correctly under this containment metric.

This should be interpreted as a relative sanity check:

3.5-bit KV-cache did not degrade compact LongBench-E-style containment score versus full precision.

It should not be treated as a full official LongBench reproduction, because the harness used a compact answer containment metric rather than the official task-specific scoring pipeline.

10. ⚡ Throughput Sanity Check: Python Simulation Does Not Speed Up Decode

The qwen_throughput_sanity_check experiment used:

128 generated tokens;
1 warmup run;
3 measured repeats.

Mode	Decode tok/s	Speed ratio vs full precision
Full precision cache	19.45	1.00
Simulated 3.5-bit cache	14.53	0.75
Simulated 2.5-bit cache	14.56	0.75

The Python-level simulated quantized cache was slower than full precision:

3.5-bit simulated decode speed ≈ 75% of full precision
2.5-bit simulated decode speed ≈ 75% of full precision

This is expected. The current harness performs additional KV-cache mutation at the Python/PyTorch level and does not use fused kernels.

Key lesson: bit-level KV-cache compression should not be automatically interpreted as end-to-end inference acceleration.

Real speedup requires a kernel-level implementation where the quantized cache is:

stored compactly;
accessed efficiently;
dequantized or consumed with minimal overhead;
integrated into the inference engine.

11. 🔎 ANN Retrieval: Faster Online Indexing, Lower Recall than PQ

The ANN benchmark used Qwen hidden-state mean-pool embeddings:

{
  "source": "qwen_hidden_state_mean_pool_embeddings",
  "db_size": 1024,
  "query_size": 128,
  "projected_dim": 256,
  "k": 10
}

TurboQuant vs Scalar No-Rotation

The required comparison tested TurboQuant against scalar quantization without rotation at 2, 3, and 4 bits.

Bits	TurboQuant recall@10	Scalar recall@10	Result
2	0.393	0.298	TurboQuant wins
3	0.666	0.546	TurboQuant wins
4	0.795	0.754	TurboQuant wins

Conclusion: at moderate bit-widths, TurboQuant’s random rotation improves recall over naive scalar quantization.

The exception was the 1-bit regime:

TurboQuant 1-bit: 0.184
scalar 1-bit:     0.238

The 1-bit setting was unstable on real Qwen mean-pool embeddings, so the required benchmark focused on 2, 3, and 4 bits.

TurboQuant vs PQ-kmeans

PQ-kmeans achieved higher recall at every tested bit-width:

Bits	TurboQuant recall@10	PQ-kmeans recall@10
1	0.184	0.331
2	0.393	0.707
3	0.666	0.766
4	0.795	0.880

So, in this ANN benchmark, TurboQuant did not outperform the trained PQ baseline on pure recall.

However, TurboQuant was dramatically faster to build:

Bits	TurboQuant index time	PQ index time	PQ / TurboQuant index cost
1	0.00089 s	0.10787 s	121.8x
2	0.00086 s	0.12789 s	149.5x
3	0.00055 s	0.19566 s	357.4x
4	0.00059 s	0.27165 s	460.3x

Fair interpretation: TurboQuant is especially strong as an online indexing method. It is much cheaper to build than trained PQ and improves over scalar no-rotation at moderate bit-widths, but it does not necessarily beat trained PQ-kmeans on recall.

✅ What the Experiments Confirmed

The full run supports the technical validity of the main TurboQuant mechanisms:

Random rotation effectively spreads coordinate energy.
TurboQuant-mse shows approximately exponential reconstruction distortion decay.
TurboQuant-prod with QJL residual correction significantly reduces inner-product bias.
Real Qwen KV-cache quantization shows controlled distortion growth.
The 3.5-bit mode preserved quality in Needle-in-a-Haystack and compact LongBench-E-style checks.
ANN experiments show TurboQuant improves over scalar no-rotation at 2–4 bits.
TurboQuant has a very large index construction cost advantage over PQ-kmeans.

⚠️ What the Experiments Do Not Prove

The same results also confirm several important limitations:

8x and 6.4x are bit-level compression ratios, not end-to-end speedups.
TurboQuant compresses KV-cache, not model weights.
Total model+KV memory improves much less than KV-cache memory alone.
Python-level simulation is slower than full precision decode due to overhead.
Production speedup requires fused kernels or inference-engine integration.
TurboQuant did not beat PQ-kmeans on ANN recall in this benchmark.

🧭 Practical Interpretation

3.5-bit KV-cache

Use this as the conservative quality-preserving mode.

3.5-bit KV-cache:
  safer compression mode;
  preserved Needle and compact LongBench-E-style scores in this run;
  provides 4.57x KV-cache compression relative to bf16.

2.5-bit KV-cache

Use this as an aggressive compression mode.

2.5-bit KV-cache:
  stronger compression mode;
  provides 6.4x KV-cache compression relative to bf16;
  introduces substantially higher reconstruction and attention inner-product error;
  should be treated as a stress / memory-pressure mode, not guaranteed quality-preserving.

ANN Retrieval

Use TurboQuant when online quantization cost matters.

ANN retrieval:
  useful for online indexing without codebook training;
  better than scalar no-rotation at 2–4 bits;
  worse than PQ-kmeans on recall in this benchmark;
  much faster than PQ-kmeans for index construction.

🧾 Final Review Statement

The experimental evaluation shows that TurboQuant is a technically meaningful method, not merely a marketing artifact. Its key components — random rotation, scalar quantization, and QJL residual correction — work reproducibly in synthetic tests, on real Qwen/Qwen3.6-27B KV-cache tensors, and in compact long-context quality checks.

At the same time, the experiments show that headline compression ratios require strict context. The 6.4x figure applies to KV-cache values when moving from bf16 to a 2.5-bit effective representation. It does not imply a sixfold reduction in total model memory.

For Qwen/Qwen3.6-27B, total model+KV memory improves by approximately:

32k context:  ~3%
262k context: ~23%

Similarly, Python-level simulation does not demonstrate decode acceleration. Without specialized kernels, quantized-cache simulation can be slower than full precision.

Bottom line: TurboQuant is a strong academic and engineering method for KV-cache compression and online vector quantization, especially promising for long-context inference and online retrieval. Its real-world impact depends on context length, bit-width regime, and the quality of the kernel-level implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
turboquant		turboquant
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
turboquant.py		turboquant.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant on Qwen/Qwen3.6-27B

📌 At a Glance

🧪 Experimental Setup

Full Launch Command

📊 Summary of Results

1. 🔢 Compression Arithmetic

2. 🌀 Random Rotation Spreads Coordinate Energy

3. 📉 TurboQuant-mse: Distortion Decreases Approximately as `4^-b`

4. 🎯 QJL Residual Correction Reduces Inner-Product Bias

5. 📐 TurboQuant-prod Inner-Product Error Also Decays Exponentially

6. 🧠 KV-cache Compresses Strongly — Total Model Memory Does Not

7. 🧩 Real Qwen KV-cache Quantization Shows Controlled Degradation

8. 🪡 Needle-in-a-Haystack: 3.5-bit Preserved Retrieval

9. 📚 LongBench-E-style Subset: 3.5-bit Matched Baseline

10. ⚡ Throughput Sanity Check: Python Simulation Does Not Speed Up Decode

11. 🔎 ANN Retrieval: Faster Online Indexing, Lower Recall than PQ

TurboQuant vs Scalar No-Rotation

TurboQuant vs PQ-kmeans

✅ What the Experiments Confirmed

⚠️ What the Experiments Do Not Prove

🧭 Practical Interpretation

3.5-bit KV-cache

2.5-bit KV-cache

ANN Retrieval

🧾 Final Review Statement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant on Qwen/Qwen3.6-27B

📌 At a Glance

🧪 Experimental Setup

Full Launch Command

📊 Summary of Results

1. 🔢 Compression Arithmetic

2. 🌀 Random Rotation Spreads Coordinate Energy

3. 📉 TurboQuant-mse: Distortion Decreases Approximately as 4^-b

4. 🎯 QJL Residual Correction Reduces Inner-Product Bias

5. 📐 TurboQuant-prod Inner-Product Error Also Decays Exponentially

6. 🧠 KV-cache Compresses Strongly — Total Model Memory Does Not

7. 🧩 Real Qwen KV-cache Quantization Shows Controlled Degradation

8. 🪡 Needle-in-a-Haystack: 3.5-bit Preserved Retrieval

9. 📚 LongBench-E-style Subset: 3.5-bit Matched Baseline

10. ⚡ Throughput Sanity Check: Python Simulation Does Not Speed Up Decode

11. 🔎 ANN Retrieval: Faster Online Indexing, Lower Recall than PQ

TurboQuant vs Scalar No-Rotation

TurboQuant vs PQ-kmeans

✅ What the Experiments Confirmed

⚠️ What the Experiments Do Not Prove

🧭 Practical Interpretation

3.5-bit KV-cache

2.5-bit KV-cache

ANN Retrieval

🧾 Final Review Statement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. 📉 TurboQuant-mse: Distortion Decreases Approximately as `4^-b`

Packages