-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mba cache memory estimation Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
…ench Signed-off-by: Tomer Asida <[email protected]>
…CacheCreator Signed-off-by: Tomer Asida <[email protected]>
…-bench throughput command Signed-off-by: Tomer Asida <[email protected]>
…MambaHybridCacheManager) Signed-off-by: Tomer Asida <[email protected]>
…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
…esult of is_nemotron_hybrid to increase readability Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the estimation of maximum batch size and maximum token count for the KV cache when using hybrid models like Nemotron-H, ensuring that only attention layers are considered in the KV cache estimations and that mamba cache memory is also taken into account. Key changes include:
- Adjusting the byte-per-token calculation to count only attention layers using the hybrid override pattern.
- Propagating the kv_cache_gpu_mem_fraction CLI argument and applying a conservative adjustment for mamba hybrid models.
- Refactoring resource manager methods to better handle mamba cache blocks and releasing cache memory upon shutdown.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tensorrt_llm/bench/build/tuning.py | Adjustments to KV cache estimations and logging for hybrid models. |
tensorrt_llm/bench/build/dataclasses.py | Adding hybrid_override_pattern and mamba_config fields to model configurations. |
tensorrt_llm/bench/build/build.py | Propagation of kv_cache_gpu_mem_fraction into benchmark engine settings. |
tensorrt_llm/bench/benchmark/utils/general.py | Passing the new CLI argument for kv_cache memory fraction. |
tensorrt_llm/_torch/pyexecutor/resource_manager.py | Renaming and refactoring resource methods and adding a shutdown method for mamba cache release. |
tensorrt_llm/_torch/pyexecutor/config_utils.py | Updating hybrid check logic using getattr. |
tensorrt_llm/_torch/pyexecutor/_util.py | Adjusting the cache size calculation using attention layers in hybrid models. |
Comments suppressed due to low confidence (1)
tensorrt_llm/bench/build/tuning.py:95
- Consider adding an inline comment to explain the rationale behind squaring kv_cache_gpu_mem_fraction for mamba hybrid models, as it improves clarity on why a more conservative memory fraction is applied.
kv_cache_gpu_mem_fraction *= kv_cache_gpu_mem_fraction
num_attention_layers = max(len(mapping.pp_layers(num_attention_layers)), | ||
1) | ||
mem_per_token *= num_attention_layers * head_dim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable 'num_attention_layers' is being reassigned to represent the number of mapped pipeline layers instead of its original meaning. Consider using a new variable name (e.g., 'mapped_attention_layers') to preserve clarity and avoid confusion.
num_attention_layers = max(len(mapping.pp_layers(num_attention_layers)), | |
1) | |
mem_per_token *= num_attention_layers * head_dim | |
mapped_attention_layers = max(len(mapping.pp_layers(num_attention_layers)), | |
1) | |
mem_per_token *= mapped_attention_layers * head_dim |
Copilot uses AI. Check for mistakes.
/bot run |
PR_Github #9520 [ run ] triggered by Bot |
Signed-off-by: Tomer Asida <[email protected]>
/bot run |
PR_Github #9532 [ run ] triggered by Bot |
PR_Github #9520 [ run ] completed with state |
PR_Github #9532 [ run ] completed with state |
Throughout the code there are a few places where the number of available kv cache tokens and/or maximum batch size are estimated. These estimations are based on the available free GPU memory and on the memory size of a single kv cache entry. These estimations didn't take into account hybrid models like Nemotron-H, in which not all layers are attention layers and require a kv cache.
Changes in this PR:
trtllm-bench throughput
commandtrtllm-bench throughput
and inKvCacheCreator
kv_cache_gpu_mem_fraction
fromtrtllm-bench throughput
CLI arg to the function that estimates the maximum batch sizeMambaHybridCacheManager
is shut down (+ a small refactor to increase readability ofMambaHybridCacheManager
).