Description
Hi Qwen Team,
Not sure if this is the right place to post this proposal, but I don't know of anywhere else to post it and have a discussion.
I'm testing Qwen3.5 with some agents that involve many tool calls and multi-turn interactions. One thing catches my attention is that the prompt cache hit rate drops every time I send a new message.
I found this is because the chat template won't render the reasoning parts before the last user query message into the final prompt. Therefore, these messages cannot hit the prompt cache. The strategy could help in reducing context length and has little impact if there are few messages between two user queries.
But when it comes to the Deep Agent case, since tool calls are more often, the size of the reasoning content seems to be small (like tens of tokens?), making the overhead of re-processing these messages less acceptable.
So, I think it's better to offer a choice between reducing history context length and reducing prefill latency.
I've already submitted a PR to fix the chat template here: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/discussions/60
I'd appreciate your feedback when you have a chance. Thanks!
Reproduction
enable thinking in some agent.
send a new mesage.
Logs
Environment Information
Known Issue
Description
Hi Qwen Team,
Not sure if this is the right place to post this proposal, but I don't know of anywhere else to post it and have a discussion.
I'm testing Qwen3.5 with some agents that involve many tool calls and multi-turn interactions. One thing catches my attention is that the prompt cache hit rate drops every time I send a new message.
I found this is because the chat template won't render the reasoning parts before the last user query message into the final prompt. Therefore, these messages cannot hit the prompt cache. The strategy could help in reducing context length and has little impact if there are few messages between two user queries.
But when it comes to the Deep Agent case, since tool calls are more often, the size of the reasoning content seems to be small (like tens of tokens?), making the overhead of re-processing these messages less acceptable.
So, I think it's better to offer a choice between reducing history context length and reducing prefill latency.
I've already submitted a PR to fix the chat template here: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/discussions/60
I'd appreciate your feedback when you have a chance. Thanks!
Reproduction
enable thinking in some agent.
send a new mesage.
Logs
Environment Information
Known Issue