Skip to content

[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

tomeras91
Copy link
Collaborator

Throughout the code there are a few places where the number of available kv cache tokens and/or maximum batch size are estimated. These estimations are based on the available free GPU memory and on the memory size of a single kv cache entry. These estimations didn't take into account hybrid models like Nemotron-H, in which not all layers are attention layers and require a kv cache.

Changes in this PR:

  • consider mamba cache for max batch size estimation in trtllm-bench throughput command
  • take only attention layers into account when estimating maximum number of tokens in kv cache, both in trtllm-bench throughput and in KvCacheCreator
  • propagate kv_cache_gpu_mem_fraction from trtllm-bench throughput CLI arg to the function that estimates the maximum batch size
  • release mamba cache memory when MambaHybridCacheManager is shut down (+ a small refactor to increase readability of MambaHybridCacheManager).

…mba cache memory estimation

Signed-off-by: Tomer Asida <[email protected]>
…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion

Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
@tomeras91 tomeras91 requested a review from a team as a code owner June 19, 2025 12:03
Copilot

This comment was marked as outdated.

…esult of is_nemotron_hybrid to increase readability

Signed-off-by: Tomer Asida <[email protected]>
@tomeras91 tomeras91 requested a review from Copilot June 19, 2025 12:41
Copilot

This comment was marked as outdated.

Signed-off-by: Tomer Asida <[email protected]>
@tomeras91 tomeras91 requested a review from Copilot June 19, 2025 12:45
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the estimation of maximum batch size and maximum token count for the KV cache when using hybrid models like Nemotron-H, ensuring that only attention layers are considered in the KV cache estimations and that mamba cache memory is also taken into account. Key changes include:

  • Adjusting the byte-per-token calculation to count only attention layers using the hybrid override pattern.
  • Propagating the kv_cache_gpu_mem_fraction CLI argument and applying a conservative adjustment for mamba hybrid models.
  • Refactoring resource manager methods to better handle mamba cache blocks and releasing cache memory upon shutdown.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tensorrt_llm/bench/build/tuning.py Adjustments to KV cache estimations and logging for hybrid models.
tensorrt_llm/bench/build/dataclasses.py Adding hybrid_override_pattern and mamba_config fields to model configurations.
tensorrt_llm/bench/build/build.py Propagation of kv_cache_gpu_mem_fraction into benchmark engine settings.
tensorrt_llm/bench/benchmark/utils/general.py Passing the new CLI argument for kv_cache memory fraction.
tensorrt_llm/_torch/pyexecutor/resource_manager.py Renaming and refactoring resource methods and adding a shutdown method for mamba cache release.
tensorrt_llm/_torch/pyexecutor/config_utils.py Updating hybrid check logic using getattr.
tensorrt_llm/_torch/pyexecutor/_util.py Adjusting the cache size calculation using attention layers in hybrid models.
Comments suppressed due to low confidence (1)

tensorrt_llm/bench/build/tuning.py:95

  • Consider adding an inline comment to explain the rationale behind squaring kv_cache_gpu_mem_fraction for mamba hybrid models, as it improves clarity on why a more conservative memory fraction is applied.
        kv_cache_gpu_mem_fraction *= kv_cache_gpu_mem_fraction

Comment on lines +90 to +92
num_attention_layers = max(len(mapping.pp_layers(num_attention_layers)),
1)
mem_per_token *= num_attention_layers * head_dim
Copy link
Preview

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The variable 'num_attention_layers' is being reassigned to represent the number of mapped pipeline layers instead of its original meaning. Consider using a new variable name (e.g., 'mapped_attention_layers') to preserve clarity and avoid confusion.

Suggested change
num_attention_layers = max(len(mapping.pp_layers(num_attention_layers)),
1)
mem_per_token *= num_attention_layers * head_dim
mapped_attention_layers = max(len(mapping.pp_layers(num_attention_layers)),
1)
mem_per_token *= mapped_attention_layers * head_dim

Copilot uses AI. Check for mistakes.

@tomeras91
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9520 [ run ] triggered by Bot

@tomeras91
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9532 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9520 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9532 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6994 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants