Skip to content

Commit

Permalink
[Model Runner][Performance] Cache the jugement result of is_encoder_d…
Browse files Browse the repository at this point in the history
…ecoder to decrease framework overhead (#138)

In Model Runner, is_encoder_decoder is exacted from model_config to
determin whether vllm is running for enc-dec models. Obtaining this
status requires a long call stack, and the CPU overhead is high. So this
PR cache this status in __init__ of ModelInputForNPUBuilder.

Signed-off-by: hw_whx <[email protected]>
Co-authored-by: hw_whx <[email protected]>
  • Loading branch information
whx-sjtu and hw_whx authored Feb 21, 2025
1 parent d21b3be commit 386817b
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions vllm_ascend/model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,7 @@ def __init__(self,
self.multi_modal_input_mapper = self.runner.multi_modal_input_mapper
self.finished_requests_ids = finished_requests_ids
self.decode_only = True
self.is_encoder_decoder = self.runner.model_config.is_encoder_decoder

# Attention metadata inputs.
self.attn_metadata_builder = self.attn_backend.make_metadata_builder(
Expand Down Expand Up @@ -423,7 +424,7 @@ def add_seq_group(self, seq_group_metadata: SequenceGroupMetadata):

encoder_seq_len = 0

if self.runner.model_config.is_encoder_decoder:
if self.is_encoder_decoder:
encoder_seq_len = seq_group_metadata.encoder_seq_data.get_len()

inter_data = self.init_cached_inter_data(
Expand Down Expand Up @@ -560,7 +561,7 @@ def _compute_lens(self, inter_data: InterDataForSeqGroup, seq_idx: int,
context_len = seq_data.get_num_computed_tokens()
seq_len = min(seq_len, context_len + token_chunk_size)
elif self.runner.scheduler_config.is_multi_step or \
self.runner.model_config.is_encoder_decoder:
self.is_encoder_decoder:
context_len = seq_len - 1
else:
context_len = seq_data.get_num_computed_tokens()
Expand Down

0 comments on commit 386817b

Please sign in to comment.