You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.
Command: python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16
Bits: 4
Decode tokens per sec: 13.57
Cache Memory used: 0.07 GB
Bits: 8
Decode tokens per sec: 17.56
Memory used: 0.13 GB
Bits: 16
Decode tokens per sec: 26.09
Memory used: 0.26 GB
Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?
The text was updated successfully, but these errors were encountered:
Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.
Command:
python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16
Bits: 4
Bits: 8
Bits: 16
Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?
The text was updated successfully, but these errors were encountered: