Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on Performance Comparison using Different Cache Bit Precision #46

Open
soumendukrg opened this issue Oct 19, 2024 · 0 comments

Comments

@soumendukrg
Copy link

soumendukrg commented Oct 19, 2024

Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.

Command:
python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16

Bits: 4

  • Decode tokens per sec: 13.57
  • Cache Memory used: 0.07 GB

Bits: 8

  • Decode tokens per sec: 17.56
  • Memory used: 0.13 GB

Bits: 16

  • Decode tokens per sec: 26.09
  • Memory used: 0.26 GB

Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant