[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

tomeras91 · 2025-06-19T12:03:35Z

Throughout the code there are a few places where the number of available kv cache tokens and/or maximum batch size are estimated. These estimations are based on the available free GPU memory and on the memory size of a single kv cache entry. These estimations didn't take into account hybrid models like Nemotron-H, in which not all layers are attention layers and require a kv cache.

Changes in this PR:

consider mamba cache for max batch size estimation in trtllm-bench throughput command
take only attention layers into account when estimating maximum number of tokens in kv cache, both in trtllm-bench throughput and in KvCacheCreator
propagate kv_cache_gpu_mem_fraction from trtllm-bench throughput CLI arg to the function that estimates the maximum batch size
release mamba cache memory when MambaHybridCacheManager is shut down (+ a small refactor to increase readability of MambaHybridCacheManager).

…mba cache memory estimation Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…ench Signed-off-by: Tomer Asida <[email protected]>

…CacheCreator Signed-off-by: Tomer Asida <[email protected]>

…-bench throughput command Signed-off-by: Tomer Asida <[email protected]>

…MambaHybridCacheManager) Signed-off-by: Tomer Asida <[email protected]>

…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…esult of is_nemotron_hybrid to increase readability Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

Copilot

Pull Request Overview

This PR fixes the estimation of maximum batch size and maximum token count for the KV cache when using hybrid models like Nemotron-H, ensuring that only attention layers are considered in the KV cache estimations and that mamba cache memory is also taken into account. Key changes include:

Adjusting the byte-per-token calculation to count only attention layers using the hybrid override pattern.
Propagating the kv_cache_gpu_mem_fraction CLI argument and applying a conservative adjustment for mamba hybrid models.
Refactoring resource manager methods to better handle mamba cache blocks and releasing cache memory upon shutdown.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tensorrt_llm/bench/build/tuning.py	Adjustments to KV cache estimations and logging for hybrid models.
tensorrt_llm/bench/build/dataclasses.py	Adding hybrid_override_pattern and mamba_config fields to model configurations.
tensorrt_llm/bench/build/build.py	Propagation of kv_cache_gpu_mem_fraction into benchmark engine settings.
tensorrt_llm/bench/benchmark/utils/general.py	Passing the new CLI argument for kv_cache memory fraction.
tensorrt_llm/_torch/pyexecutor/resource_manager.py	Renaming and refactoring resource methods and adding a shutdown method for mamba cache release.
tensorrt_llm/_torch/pyexecutor/config_utils.py	Updating hybrid check logic using getattr.
tensorrt_llm/_torch/pyexecutor/_util.py	Adjusting the cache size calculation using attention layers in hybrid models.

Comments suppressed due to low confidence (1)

tensorrt_llm/bench/build/tuning.py:95

Consider adding an inline comment to explain the rationale behind squaring kv_cache_gpu_mem_fraction for mamba hybrid models, as it improves clarity on why a more conservative memory fraction is applied.

        kv_cache_gpu_mem_fraction *= kv_cache_gpu_mem_fraction

Copilot · 2025-06-19T12:46:02Z

tensorrt_llm/_torch/pyexecutor/_util.py

+        num_attention_layers = max(len(mapping.pp_layers(num_attention_layers)),
+                                   1)
+        mem_per_token *= num_attention_layers * head_dim


[nitpick] The variable 'num_attention_layers' is being reassigned to represent the number of mapped pipeline layers instead of its original meaning. Consider using a new variable name (e.g., 'mapped_attention_layers') to preserve clarity and avoid confusion.

Suggested change

num_attention_layers = max(len(mapping.pp_layers(num_attention_layers)),

1)

mem_per_token *= num_attention_layers * head_dim

mapped_attention_layers = max(len(mapping.pp_layers(num_attention_layers)),

1)

mem_per_token *= mapped_attention_layers * head_dim

tomeras91 · 2025-06-19T14:18:55Z

/bot run

tensorrt-cicd · 2025-06-19T14:24:19Z

PR_Github #9520 [ run ] triggered by Bot

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 · 2025-06-19T16:15:14Z

/bot run

tensorrt-cicd · 2025-06-19T16:20:32Z

PR_Github #9532 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-19T16:20:34Z

PR_Github #9520 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-19T19:22:15Z

PR_Github #9532 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6994 completed with status: 'FAILURE'

tomeras91 added 9 commits June 12, 2025 15:47

WIP: consider num_attention_layers for kv cache estimation and add ma…

1752239

…mba cache memory estimation Signed-off-by: Tomer Asida <[email protected]>

Merge branch 'main' into fix-trtllm-bench-for-nemotron-h

7829ec9

Signed-off-by: Tomer Asida <[email protected]>

organize code and logging for max batch size calculation for trtllm-b…

4403183

…ench Signed-off-by: Tomer Asida <[email protected]>

consider only attention layers when estimating number of tokens in Kv…

6ff4602

…CacheCreator Signed-off-by: Tomer Asida <[email protected]>

propagate kv_cache_gpu_mem_fraction to calc_engine_setting for trtllm…

e6615a8

…-bench throughput command Signed-off-by: Tomer Asida <[email protected]>

release mamba cache memory when shutting down MambaCacheManager (and …

42d65f3

…MambaHybridCacheManager) Signed-off-by: Tomer Asida <[email protected]>

small refactor - MambaCacheManager method names to match BaseResource…

17d22e5

…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion Signed-off-by: Tomer Asida <[email protected]>

refactor - is_nemotron_hybrid works on dicts as well

7dfeab8

Signed-off-by: Tomer Asida <[email protected]>

remove log

ee85bac

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 requested a review from a team as a code owner June 19, 2025 12:03

tomeras91 requested review from schetlur-nv, Naveassaf and Copilot June 19, 2025 12:03

This comment was marked as outdated.

Sign in to view

Add comment explaining squaring of kv_cache_gpu_mem_fraction + save r…

d0d0b7e

…esult of is_nemotron_hybrid to increase readability Signed-off-by: Tomer Asida <[email protected]>

tomeras91 requested a review from Copilot June 19, 2025 12:41

This comment was marked as outdated.

Sign in to view

remove debug print

63bea92

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 requested a review from Copilot June 19, 2025 12:45

Copilot AI reviewed Jun 19, 2025

View reviewed changes

fix - use config.get() only if config is a dict

c8c71df

Signed-off-by: Tomer Asida <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

Uh oh!

tomeras91 commented Jun 19, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 19, 2025

Uh oh!

tomeras91 commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tomeras91 commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

Uh oh!

[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

Are you sure you want to change the base?

[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

Uh oh!

Conversation

tomeras91 commented Jun 19, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

tomeras91 commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tomeras91 commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

Uh oh!