-
Notifications
You must be signed in to change notification settings - Fork 56
Add Node to update KV cache in Stateful LLM model #872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
899feb5 to
5432bd4
Compare
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for KV cache reordering in the OpenVINO stateful LLM model to enable tree-based speculative decoding. It introduces a new subgraph pattern (Gather + ScatterElementsUpdate) that allows OpenVINO to perform KV cache reordering during inference, which can be optimized out by the GPU if not needed.
Key changes:
- Adds new graph nodes (
src_idx,dst_idxparameters andGather/ScatterElementsUpdateoperations) to enable KV cache manipulation - Implements
ReorderKVCacheAPI across the backend stack with parsing logic for comma-separated index pairs - Stores reorder indices in
StatefulOVInferRequestfor processing during inference
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| ov_stateful_patch_utils.h | Adds opset12 include for ScatterElementsUpdate operation |
| ov_stateful_patch_utils.cc | Implements the new KV cache reorder subgraph with src_idx/dst_idx parameters and Gather/ScatterElementsUpdate nodes |
| ov_interface.h | Declares ReorderKVCache method and adds member variables for storing reorder indices |
| ov_interface.cc | Implements ReorderKVCache with index validation and tensor population logic using hardcoded shape values |
| openvino_execution_provider.cc | Adds kvcache_reorder option parsing to convert semicolon-delimited string format into index vectors |
| ibackend.h | Adds virtual ReorderKVCache method to IBackend interface |
| basic_backend.h/cc | Implements ReorderKVCache to propagate calls to inference request pool |
| backend_manager.h/cc | Implements ReorderKVCache as pass-through to concrete backend |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
move fuse flag in exenetwork.
4e7c895 to
1665865
Compare
onnxruntime/core/providers/openvino/openvino_execution_provider.cc
Outdated
Show resolved
Hide resolved
RyanMetcalfeInt8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks @Kotomi-Du & @ZackyLake, I've approved -- But please resolve one small typo for Lint warning above. |
Thanks, Ryan. We will address that, test the latest version again before merge! |
|
Thanks @Kotomi-Du & @ZackyLake, can you also address some of those other Lint warnings? I think you'll just need to include a couple more headers. e.g. |
|
Confirmed, Phi-silica tests on GPU are passed with latest change. |
Sorry, @RyanMetcalfeInt8 , I missed the comment you just updated and merged the PR. We can initialize a new PR for that or could address this warning in our future PRs. |
Description
This PR is to add a small subgraph
Gather + ScatterElementUpdatefor KVCache to allow OpenVINO to do KV cache reorder during model inference. This pattern will be optimized out by OV GPU if there is no related information provided (done in OV 33114)The graph below shows how the PR impacts an onnx model when triggering makeStateful() path.
Motivation and Context
The Microsoft Phi-Silica application leverages tree-based speculative decoding to accelerate LLM inference. This technique requires frequent manipulation of past KV cache states (e.g. trimming, reordering). This is because only a single branch of the speculative draft tree is accepted after verification.
On the other side, the current KV Cache API available is OV is very slow which cannot meet MSFT requirements. Details in CVS-174809. As OV team suggested, the only way to support reorder feature is to add specific nodes in the original graph. This PR is to serve this purpose.
Open
If feature goes to new ABI?
Yes
Jira Ticket :
CVS-176367