Add Node to update KV cache in Stateful LLM model #872

Kotomi-Du · 2025-12-03T20:15:19Z

Description

This PR is to add a small subgraph Gather + ScatterElementUpdate for KVCache to allow OpenVINO to do KV cache reorder during model inference. This pattern will be optimized out by OV GPU if there is no related information provided (done in OV 33114)

The graph below shows how the PR impacts an onnx model when triggering makeStateful() path.

Motivation and Context

The Microsoft Phi-Silica application leverages tree-based speculative decoding to accelerate LLM inference. This technique requires frequent manipulation of past KV cache states (e.g. trimming, reordering). This is because only a single branch of the speculative draft tree is accepted after verification.

On the other side, the current KV Cache API available is OV is very slow which cannot meet MSFT requirements. Details in CVS-174809. As OV team suggested, the only way to support reorder feature is to add specific nodes in the original graph. This PR is to serve this purpose.

Open

If NPU don't want to have this path, a device specific flag has to be added.

If feature goes to new ABI?

Yes

Jira Ticket :

CVS-176367

onnxruntime/core/providers/openvino/ov_interface.cc

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

Copilot

Pull request overview

This PR adds support for KV cache reordering in the OpenVINO stateful LLM model to enable tree-based speculative decoding. It introduces a new subgraph pattern (Gather + ScatterElementsUpdate) that allows OpenVINO to perform KV cache reordering during inference, which can be optimized out by the GPU if not needed.

Key changes:

Adds new graph nodes (src_idx, dst_idx parameters and Gather/ScatterElementsUpdate operations) to enable KV cache manipulation
Implements ReorderKVCache API across the backend stack with parsing logic for comma-separated index pairs
Stores reorder indices in StatefulOVInferRequest for processing during inference

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
ov_stateful_patch_utils.h	Adds opset12 include for ScatterElementsUpdate operation
ov_stateful_patch_utils.cc	Implements the new KV cache reorder subgraph with src_idx/dst_idx parameters and Gather/ScatterElementsUpdate nodes
ov_interface.h	Declares ReorderKVCache method and adds member variables for storing reorder indices
ov_interface.cc	Implements ReorderKVCache with index validation and tensor population logic using hardcoded shape values
openvino_execution_provider.cc	Adds kvcache_reorder option parsing to convert semicolon-delimited string format into index vectors
ibackend.h	Adds virtual ReorderKVCache method to IBackend interface
basic_backend.h/cc	Implements ReorderKVCache to propagate calls to inference request pool
backend_manager.h/cc	Implements ReorderKVCache as pass-through to concrete backend

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/openvino/ov_interface.cc

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

onnxruntime/core/providers/openvino/ov_interface.cc

onnxruntime/core/providers/openvino/ov_stateful_patch_utils.cc

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

onnxruntime/core/providers/openvino/ov_interface.cc

onnxruntime/core/providers/openvino/ov_interface.h

onnxruntime/core/providers/openvino/ov_stateful_patch_utils.cc

move fuse flag in exenetwork.

onnxruntime/core/providers/openvino/ov_interface.cc

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

RyanMetcalfeInt8

LGTM

RyanMetcalfeInt8 · 2026-01-06T12:48:50Z

Thanks @Kotomi-Du & @ZackyLake, I've approved -- But please resolve one small typo for Lint warning above.

Kotomi-Du · 2026-01-06T17:49:13Z

Thanks @Kotomi-Du & @ZackyLake, I've approved -- But please resolve one small typo for Lint warning above.

Thanks, Ryan. We will address that, test the latest version again before merge!

RyanMetcalfeInt8 · 2026-01-07T01:26:42Z

Thanks @Kotomi-Du & @ZackyLake, can you also address some of those other Lint warnings? I think you'll just need to include a couple more headers. e.g.

#include <vector>
#include <limits>

Kotomi-Du · 2026-01-07T01:43:38Z

Confirmed, Phi-silica tests on GPU are passed with latest change.

Kotomi-Du · 2026-01-07T01:46:06Z

Thanks @Kotomi-Du & @ZackyLake, can you also address some of those other Lint warnings? I think you'll just need to include a couple more headers. e.g.
#include <vector>
#include <limits>

Sorry, @RyanMetcalfeInt8 , I missed the comment you just updated and merged the PR. We can initialize a new PR for that or could address this warning in our future PRs.

Kotomi-Du marked this pull request as draft December 3, 2025 20:15

mdvoretc-intel reviewed Dec 4, 2025

View reviewed changes

onnxruntime/core/providers/openvino/ov_interface.cc Outdated Show resolved Hide resolved

Kotomi-Du force-pushed the update_kvcache_node branch 2 times, most recently from 899feb5 to 5432bd4 Compare December 6, 2025 01:32

Kotomi-Du marked this pull request as ready for review December 9, 2025 05:03

Kotomi-Du requested review from MayureshV1 and RyanMetcalfeInt8 December 9, 2025 05:03

RyanMetcalfeInt8 requested a review from Copilot December 9, 2025 17:16