Skip to content

Verbasik/TurboQuant-True-or-False

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TurboQuant on Qwen/Qwen3.6-27B

Experimental review of TurboQuant-style KV-cache compression, vector quantization behavior, retrieval quality, and throughput implications on a single 96 GB Blackwell GPU. Technical overview on my blog: https://verbasik.github.io/warp-zone-folio/#/blog/turboquant

πŸ“Œ At a Glance

A full experimental run was completed for Qwen/Qwen3.6-27B in bfloat16 mode on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition GPU with 96 GB VRAM.

PASS = 14
WARN = 0
FAIL = 0
SKIP = 0

The full harness executed successfully:

  • βœ… model loading
  • βœ… theoretical TurboQuant checks
  • βœ… real Qwen KV-cache quantization proxy
  • βœ… Needle-in-a-Haystack
  • βœ… compact LongBench-E-style evaluation
  • βœ… throughput sanity check
  • βœ… ANN retrieval benchmark

Main takeaway: TurboQuant shows strong technical properties as a method for KV-cache compression and online vector quantization, but the experiments also confirm an important caveat: headline ratios such as 8x and 6.4x refer primarily to the bit budget of KV-cache values, not to total model memory and not to end-to-end inference speedup.


πŸ§ͺ Experimental Setup

The experiments were run with the following model/runtime configuration:

{
  "model_id": "Qwen/Qwen3.6-27B",
  "device": "cuda:0",
  "dtype": "bfloat16",
  "model_class": "AutoModelForCausalLM",
  "requested_attn_implementation": "sdpa",
  "resolved_attn_implementation": "sdpa"
}

The model was successfully loaded through AutoModelForCausalLM.

The run used the sdpa attention backend, which avoided an external dependency on flash-attn and allowed the full experiment suite to complete on a single GPU.

Full Launch Command

python3 turboquant.py \
  --suite all \
  --model_id Qwen/Qwen3.6-27B \
  --load_model \
  --local_files_only \
  --torch_dtype bfloat16 \
  --attn_implementation sdpa \
  --needle_context_tokens 32768 \
  --needle_required_bits 3.5 \
  --needle_trials 1 \
  --needle_positions 0.5 \
  --run_longbench \
  --longbench_required_bits 3.5 \
  --longbench_max_examples 6 \
  --longbench_max_prompt_tokens 32768 \
  --longbench_local_dir data \
  --run_throughput \
  --throughput_new_tokens 128 \
  --throughput_warmup 1 \
  --throughput_repeats 3 \
  --ann_use_qwen_embeddings \
  --ann_required_scalar_bits 2,3,4 \
  --ann_pq_index_speedup_threshold 10 \
  --out_dir runs/tq_qwen36_27b_full_sdpa

πŸ“Š Summary of Results

Area Result Interpretation
Model loading βœ… PASS Qwen/Qwen3.6-27B loaded successfully in bfloat16
Compression arithmetic βœ… PASS float32 β†’ 4-bit = 8x is arithmetically correct
Random rotation βœ… PASS Energy is spread uniformly across coordinates
TurboQuant-mse βœ… PASS Distortion decreases approximately exponentially with bit-width
TurboQuant-prod / QJL βœ… PASS Inner-product bias is reduced by almost 5x
KV-cache accounting βœ… PASS KV-cache compresses strongly, but total memory improves much less
Real Qwen KV proxy βœ… PASS 3.5-bit and 2.5-bit show controlled distortion growth
Needle-in-a-Haystack βœ… PASS 3.5-bit preserved exact code retrieval at 32k context
LongBench-E-style subset βœ… PASS 3.5-bit matched full precision on compact containment score
Throughput sanity check βœ… PASS Python simulation is slower than full precision; no production speedup claimed
ANN retrieval βœ… PASS TurboQuant beats scalar no-rotation at 2–4 bits and indexes much faster than PQ

1. πŸ”’ Compression Arithmetic

The first experiment checked the basic bit-budget arithmetic:

{
  "float32_bits": 49152,
  "float32_bytes": 6144,
  "quant_bits": 6144,
  "quant_bytes": 768,
  "compression_32_to_4bit": 8.0
}

For a vector of dimension 1536, moving from float32 to a 4-bit representation gives:

49152 bits / 6144 bits = 8x

So the statement:

32-bit β†’ 4-bit = 8x

is arithmetically correct.

Important

This is a compression ratio for a numeric representation. It should not be interpreted as an 8x end-to-end inference speedup.

Real LLM inference includes many additional components:

  • model weights
  • activation buffers
  • attention kernels
  • memory bandwidth effects
  • scheduler overhead
  • framework-level runtime costs

Therefore, this experiment validates the arithmetic while also highlighting why headline compression ratios need careful interpretation.


2. πŸŒ€ Random Rotation Spreads Coordinate Energy

The random_rotation_energy_spreading experiment validated the key geometric mechanism behind TurboQuant:

{
  "dimension": 256,
  "energy_cv_before": 16.0,
  "energy_cv_after": 0.0,
  "mean_max_abs_before": 1.0,
  "mean_max_abs_after": 0.0625,
  "coordinate_var_after": 0.0039062486,
  "expected_var": 0.00390625,
  "sqrt_d_scaled_var": 0.9999996424,
  "sqrt_d_scaled_kurtosis": 2.9816138744
}

Before rotation, the vector energy was concentrated in a small number of coordinates. After signed-Hadamard rotation, the energy became almost perfectly distributed.

The most important agreement is:

coordinate_var_after β‰ˆ expected_var = 1 / 256

The scaled kurtosis is also close to 3, which is consistent with an approximately normal coordinate distribution after scaling by sqrt(d).

Interpretation: random orthogonal rotation transforms an unfavorable coordinate geometry into a more homogeneous one. This makes simple coordinate-wise scalar quantization much more effective.


3. πŸ“‰ TurboQuant-mse: Distortion Decreases Approximately as 4^-b

The turboquant_mse_4_power_minus_b experiment checked whether reconstruction distortion decreases approximately as 4^-b.

Bits per coordinate Measured distortion Shannon lower bound 4^-b Constant factor
1 0.3617 0.2500 1.45
2 0.1169 0.0625 1.87
3 0.0343 0.0156 2.19
4 0.0094 0.0039 2.41

The estimated log-scale slope was:

{
  "log2_distortion_slope_per_bit": -1.757,
  "target_slope": -2.0
}

The ideal 4^-b decay corresponds to a slope of -2 in log2 scale. The measured value -1.757 is not perfect, but it clearly follows the same exponential trend.

The gap is expected in a finite-dimensional implementation with:

  • finite block size;
  • pure-PyTorch Lloyd-Max quantization;
  • non-asymptotic constants.

Conclusion: TurboQuant-mse empirically follows the expected exponential distortion decay, with most of the gap appearing as a constant-factor loss relative to the lower bound.


4. 🎯 QJL Residual Correction Reduces Inner-Product Bias

The inner_product_bias_mse_vs_prod experiment tested the core idea behind TurboQuant-prod.

MSE-oriented quantization can reconstruct vectors well, but it may introduce systematic bias in inner products. QJL residual correction is intended to reduce this bias.

{
  "mse_only_mean_abs_bias": 0.0099148,
  "prod_qjl_mean_abs_bias": 0.0019973,
  "mse_only_signed_bias": -0.0099148,
  "prod_qjl_signed_bias": 0.0000376,
  "bias_reduction_factor": 4.964
}

The mean absolute bias was reduced by approximately:

0.0099148 / 0.0019973 β‰ˆ 4.96x

The signed bias was almost eliminated:

-0.0099148 β†’ 0.0000376

This is one of the strongest positive results in the experiment suite.

Interpretation: QJL residual correction does not merely add random noise. It meaningfully compensates for systematic inner-product bias introduced by MSE quantization.

This matters directly for:

  • attention;
  • retrieval;
  • nearest-neighbor search;
  • ranking based on dot products.

5. πŸ“ TurboQuant-prod Inner-Product Error Also Decays Exponentially

The turboquant_prod_inner_product_rate experiment measured inner-product error across different bit-widths.

Bits Inner-product MSE Proxy bound Constant factor
1 0.00613 0.00391 1.57
2 0.00219 0.00098 2.24
3 0.00071 0.00024 2.90
4 0.00021 0.00006 3.45

The estimated slope was:

{
  "log2_ip_error_slope_per_bit": -1.622,
  "target_slope_rough": -2.0
}

The result is weaker than the ideal -2 slope, but it still shows stable exponential improvement as bit-width increases.

Practical meaning: each additional bit substantially improves inner-product accuracy, although finite-dimensional constants remain visible.


6. 🧠 KV-cache Compresses Strongly β€” Total Model Memory Does Not

The qwen_kv_memory_accounting experiment tested a central practical question:

Does TurboQuant reduce total model memory by the same factor as it compresses KV-cache?

For Qwen/Qwen3.6-27B, the answer is no.

Context tokens KV bf16 KV 2.5-bit KV compression Total bf16 model+KV Total 2.5-bit model+KV Total ratio
32,768 2.0 GiB 0.3125 GiB 6.4x 57.6 GiB 55.91 GiB 1.03x
131,072 8.0 GiB 1.25 GiB 6.4x 63.6 GiB 56.85 GiB 1.12x
262,144 16.0 GiB 2.5 GiB 6.4x 71.6 GiB 58.1 GiB 1.23x

Formally, KV-cache compression is strong:

bf16 16-bit / 2.5-bit = 6.4x

But once model weights are included, the total memory ratio is much smaller:

32k context:  total memory ratio β‰ˆ 1.03x
131k context: total memory ratio β‰ˆ 1.12x
262k context: total memory ratio β‰ˆ 1.23x

Note

TurboQuant can greatly reduce KV-cache memory, but it does not make a 27B model six times smaller.

The longer the context, the more important KV-cache becomes. For moderate context sizes, however, model weights remain the dominant memory component.


7. 🧩 Real Qwen KV-cache Quantization Shows Controlled Degradation

The qwen_kv_cache_quantization_proxy experiment used real past_key_values from the model, not synthetic tensors.

{
  "source": "actual_model_past_key_values",
  "seq_len": 2048
}
Effective bits Reconstruction distortion Attention inner-product sqerr bf16 compression ratio
3.5 0.0196 0.000076 4.57x
2.5 0.0673 0.000261 6.4x

The degradation is monotonic and expected:

3.5-bit: lower distortion, safer mode
2.5-bit: stronger compression, higher distortion

The error increase from 3.5-bit to 2.5-bit is approximately:

reconstruction distortion: 0.0673 / 0.0196 β‰ˆ 3.44x
attention IP sqerr:        0.000261 / 0.000076 β‰ˆ 3.41x

Interpretation: moving from 3.5-bit to 2.5-bit provides additional compression, but it increases reconstruction and attention inner-product error by roughly 3.4x.


8. πŸͺ‘ Needle-in-a-Haystack: 3.5-bit Preserved Retrieval

The needle_in_haystack_qwen_generation experiment tested whether the model could retrieve a hidden code from a long context of approximately 32785 tokens.

{
  "baseline_accuracy": 1.0,
  "required_accuracy": {
    "3.5": 1.0
  }
}

Both full precision and TurboQuant 3.5-bit retrieved the correct code:

expected_code = TQ-946382
full_precision_cache contains_code = true
turboquant_required_cache_3.5bit contains_code = true

Result: the conservative 3.5-bit KV-cache mode preserved exact retrieval in this 32k-context Needle-in-a-Haystack test.

Important

This was a single-trial smoke test at position 0.5. It is useful evidence, but not a statistically complete benchmark.

A stronger evaluation should include:

  • multiple seeds;
  • multiple needle positions;
  • several hidden strings;
  • repeated runs across prompt variants.

Also, this experiment validates quality under a Python-level simulated cache mutation. It does not prove real compressed-cache residency or production-level speedup.


9. πŸ“š LongBench-E-style Subset: 3.5-bit Matched Baseline

The longbench_e_subset_qwen_generation experiment ran successfully on local LongBench data.

{
  "source": "local_dir",
  "configs": ["narrativeqa", "qasper", "hotpotqa"],
  "n_examples": 6
}

Scores:

{
  "full_precision_cache": 0.5,
  "turboquant_required_cache_3.5bit": 0.5
}

Drop relative to full precision:

{
  "turboquant_required_cache_3.5bit": 0.0
}

So, on this compact subset, 3.5-bit did not reduce answer containment score relative to full precision.

Note

The absolute score of 0.5 means that the baseline itself answered only half of the examples correctly under this containment metric.

This should be interpreted as a relative sanity check:

3.5-bit KV-cache did not degrade compact LongBench-E-style containment score versus full precision.

It should not be treated as a full official LongBench reproduction, because the harness used a compact answer containment metric rather than the official task-specific scoring pipeline.


10. ⚑ Throughput Sanity Check: Python Simulation Does Not Speed Up Decode

The qwen_throughput_sanity_check experiment used:

  • 128 generated tokens;
  • 1 warmup run;
  • 3 measured repeats.
Mode Decode tok/s Speed ratio vs full precision
Full precision cache 19.45 1.00
Simulated 3.5-bit cache 14.53 0.75
Simulated 2.5-bit cache 14.56 0.75

The Python-level simulated quantized cache was slower than full precision:

3.5-bit simulated decode speed β‰ˆ 75% of full precision
2.5-bit simulated decode speed β‰ˆ 75% of full precision

This is expected. The current harness performs additional KV-cache mutation at the Python/PyTorch level and does not use fused kernels.

Key lesson: bit-level KV-cache compression should not be automatically interpreted as end-to-end inference acceleration.

Real speedup requires a kernel-level implementation where the quantized cache is:

  • stored compactly;
  • accessed efficiently;
  • dequantized or consumed with minimal overhead;
  • integrated into the inference engine.

11. πŸ”Ž ANN Retrieval: Faster Online Indexing, Lower Recall than PQ

The ANN benchmark used Qwen hidden-state mean-pool embeddings:

{
  "source": "qwen_hidden_state_mean_pool_embeddings",
  "db_size": 1024,
  "query_size": 128,
  "projected_dim": 256,
  "k": 10
}

TurboQuant vs Scalar No-Rotation

The required comparison tested TurboQuant against scalar quantization without rotation at 2, 3, and 4 bits.

Bits TurboQuant recall@10 Scalar recall@10 Result
2 0.393 0.298 TurboQuant wins
3 0.666 0.546 TurboQuant wins
4 0.795 0.754 TurboQuant wins

Conclusion: at moderate bit-widths, TurboQuant’s random rotation improves recall over naive scalar quantization.

The exception was the 1-bit regime:

TurboQuant 1-bit: 0.184
scalar 1-bit:     0.238

The 1-bit setting was unstable on real Qwen mean-pool embeddings, so the required benchmark focused on 2, 3, and 4 bits.


TurboQuant vs PQ-kmeans

PQ-kmeans achieved higher recall at every tested bit-width:

Bits TurboQuant recall@10 PQ-kmeans recall@10
1 0.184 0.331
2 0.393 0.707
3 0.666 0.766
4 0.795 0.880

So, in this ANN benchmark, TurboQuant did not outperform the trained PQ baseline on pure recall.

However, TurboQuant was dramatically faster to build:

Bits TurboQuant index time PQ index time PQ / TurboQuant index cost
1 0.00089 s 0.10787 s 121.8x
2 0.00086 s 0.12789 s 149.5x
3 0.00055 s 0.19566 s 357.4x
4 0.00059 s 0.27165 s 460.3x

Fair interpretation: TurboQuant is especially strong as an online indexing method. It is much cheaper to build than trained PQ and improves over scalar no-rotation at moderate bit-widths, but it does not necessarily beat trained PQ-kmeans on recall.


βœ… What the Experiments Confirmed

The full run supports the technical validity of the main TurboQuant mechanisms:

  1. Random rotation effectively spreads coordinate energy.
  2. TurboQuant-mse shows approximately exponential reconstruction distortion decay.
  3. TurboQuant-prod with QJL residual correction significantly reduces inner-product bias.
  4. Real Qwen KV-cache quantization shows controlled distortion growth.
  5. The 3.5-bit mode preserved quality in Needle-in-a-Haystack and compact LongBench-E-style checks.
  6. ANN experiments show TurboQuant improves over scalar no-rotation at 2–4 bits.
  7. TurboQuant has a very large index construction cost advantage over PQ-kmeans.

⚠️ What the Experiments Do Not Prove

The same results also confirm several important limitations:

  1. 8x and 6.4x are bit-level compression ratios, not end-to-end speedups.
  2. TurboQuant compresses KV-cache, not model weights.
  3. Total model+KV memory improves much less than KV-cache memory alone.
  4. Python-level simulation is slower than full precision decode due to overhead.
  5. Production speedup requires fused kernels or inference-engine integration.
  6. TurboQuant did not beat PQ-kmeans on ANN recall in this benchmark.

🧭 Practical Interpretation

3.5-bit KV-cache

Use this as the conservative quality-preserving mode.

3.5-bit KV-cache:
  safer compression mode;
  preserved Needle and compact LongBench-E-style scores in this run;
  provides 4.57x KV-cache compression relative to bf16.

2.5-bit KV-cache

Use this as an aggressive compression mode.

2.5-bit KV-cache:
  stronger compression mode;
  provides 6.4x KV-cache compression relative to bf16;
  introduces substantially higher reconstruction and attention inner-product error;
  should be treated as a stress / memory-pressure mode, not guaranteed quality-preserving.

ANN Retrieval

Use TurboQuant when online quantization cost matters.

ANN retrieval:
  useful for online indexing without codebook training;
  better than scalar no-rotation at 2–4 bits;
  worse than PQ-kmeans on recall in this benchmark;
  much faster than PQ-kmeans for index construction.

🧾 Final Review Statement

The experimental evaluation shows that TurboQuant is a technically meaningful method, not merely a marketing artifact. Its key components β€” random rotation, scalar quantization, and QJL residual correction β€” work reproducibly in synthetic tests, on real Qwen/Qwen3.6-27B KV-cache tensors, and in compact long-context quality checks.

At the same time, the experiments show that headline compression ratios require strict context. The 6.4x figure applies to KV-cache values when moving from bf16 to a 2.5-bit effective representation. It does not imply a sixfold reduction in total model memory.

For Qwen/Qwen3.6-27B, total model+KV memory improves by approximately:

32k context:  ~3%
262k context: ~23%

Similarly, Python-level simulation does not demonstrate decode acceleration. Without specialized kernels, quantized-cache simulation can be slower than full precision.

Bottom line: TurboQuant is a strong academic and engineering method for KV-cache compression and online vector quantization, especially promising for long-context inference and online retrieval. Its real-world impact depends on context length, bit-width regime, and the quality of the kernel-level implementation.

About

Experimental review of TurboQuant-style KV-cache compression, vector quantization behavior, retrieval quality, and throughput implications on a single 96 GB Blackwell GPU.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages