Replies: 2 comments
-
If you're only running 1 request, your KV Cache is unlikely to be filled up. If you want to improve generation speed you can consider using FP8 or speculation |
Beta Was this translation helpful? Give feedback.
0 replies
-
@tammypi 你好请问你的GPU KV cache usage是怎么统计的?我在本地跑 benchmark_throughput 想查看kvcache使用情况 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Your current environment
How would you like to use vllm
1.Model size: 14B
2.Performance: Average prompt throughput: 120.1 tokens/s, average generation throughput: 41.9 tokens/s, running: 1 request, swapped: 0 requests, pending: 0 requests, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.
The generation throughput speed is too slow. I think the reason might be the low GPU KV cache usage. How can I increase the GPU KV cache usage and improve the generation throughput speed?
Beta Was this translation helpful? Give feedback.
All reactions