Skip to content

Proposal: new chat_template_arg enable_history_reasoning for reusing prompt cache among querys within Agents . #112

@Abioy

Description

@Abioy

Description

Hi Qwen Team,

Not sure if this is the right place to post this proposal, but I don't know of anywhere else to post it and have a discussion.

I'm testing Qwen3.5 with some agents that involve many tool calls and multi-turn interactions. One thing catches my attention is that the prompt cache hit rate drops every time I send a new message.

I found this is because the chat template won't render the reasoning parts before the last user query message into the final prompt. Therefore, these messages cannot hit the prompt cache. The strategy could help in reducing context length and has little impact if there are few messages between two user queries.

But when it comes to the Deep Agent case, since tool calls are more often, the size of the reasoning content seems to be small (like tens of tokens?), making the overhead of re-processing these messages less acceptable.

So, I think it's better to offer a choice between reducing history context length and reducing prefill latency.

I've already submitted a PR to fix the chat template here: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/discussions/60

I'd appreciate your feedback when you have a chance. Thanks!

Reproduction

enable thinking in some agent.
send a new mesage.

Logs

Environment Information

Known Issue

  • The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions