Question on Performance Comparison using Different Cache Bit Precision #46

soumendukrg · 2024-10-19T09:16:06Z

Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.

Command:
python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16

Bits: 4

Decode tokens per sec: 13.57
Cache Memory used: 0.07 GB

Bits: 8

Decode tokens per sec: 17.56
Memory used: 0.13 GB

Bits: 16

Decode tokens per sec: 26.09
Memory used: 0.26 GB

Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Performance Comparison using Different Cache Bit Precision #46

Question on Performance Comparison using Different Cache Bit Precision #46

soumendukrg commented Oct 19, 2024 •

edited

Loading

Question on Performance Comparison using Different Cache Bit Precision #46

Question on Performance Comparison using Different Cache Bit Precision #46

Comments

soumendukrg commented Oct 19, 2024 • edited Loading

soumendukrg commented Oct 19, 2024 •

edited

Loading