Skip to content

Commit e16c1be

Browse files
authored
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation (#5343)
Signed-off-by: Hui Gao <[email protected]>
1 parent 58a8a8f commit e16c1be

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

tensorrt_llm/_torch/pyexecutor/_util.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,13 @@ def _get_token_num_for_estimation(self) -> int:
151151
# estimate_max_kv_cache_tokens submits self._dummy_reqs
152152
num_cache_blocks = 0
153153
num_extra_tokens_per_seq = 1 # account for generated tokens
154+
pytorch_backend_config = executor_config.pytorch_backend_config
154155
spec_cfg = executor_config.speculative_config
156+
if not pytorch_backend_config.disable_overlap_scheduler:
157+
num_extra_tokens_per_seq = num_extra_tokens_per_seq + 1
158+
if spec_cfg is not None:
159+
num_extra_tokens_per_seq += spec_cfg.max_draft_tokens
160+
155161
if spec_cfg is not None:
156162
num_extra_tokens_per_seq += spec_cfg.max_draft_tokens
157163
num_extra_tokens_per_seq += spec_cfg.num_extra_kv_tokens

0 commit comments

Comments
 (0)