Skip to content

feat/add latency support for trtllm bench #3730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

danielafrimi
Copy link
Collaborator

@rakib-hasan
id like to get a review from you :)

most of the changes are taken from your PR (found only throughput support).

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2941 [ run ] triggered by Bot

report_json: Path = params.pop("report_json")
iteration_log: Path = params.pop("iteration_log")
iteration_writer = IterationWriter(iteration_log)

# Initialize the HF tokenizer for the specified model.
# ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False # TODO (dafrimi): nto sure where to locate this line since it requires the runConfig, but tokenizer is used to get the dataset
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakib-hasan WDYT? not sure where to locate it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ignore_eos is only used for getting the eos_id and pad_id.

Meanwhile, eos_id and pad_id are only used for creating the SamplingParameters. So why not move lines 198, 200 and 201 to be right before the creating of the SamplingParams (line ~290)? Then you could have:

ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False

since youve already created the runtime config

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this reordering is all because of runtime_config.decoding_config.decoding_mode, right?
In that case, can we have an extra API just to get that early? That way runtime_config can be created once all the extra options are populated.
For the TRT case, it will come from the engine config (we need to read the config twice, which should be fine for now)
For the PyTorch case, it is currently empty for the throughput benchmark (possibly since SD is not used for throughput usecase) and since PyTorch doesn't support the older SD techniques yet. In that case, we can just keep that empty for now like the throughput case and say that it is a limitation until the PyTorch path supports SD techniques.
Does that sound alright?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naveassaf @rakib-hasan reordering since we need the tokenizer for create_dataset_from_stream which outputs metadata, which is part of the exec_settings for the runConfig.

@rakib-hasan commenting out this line with what you suggested. thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, with that extra API, this reorder shouldn't be needed, right?

Copy link
Collaborator Author

@danielafrimi danielafrimi Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look on the comment i added (line 288). so we wont need extra API for this case. we still reorder the exec_settings, since we fetch need to fetch the setting in the pytorch flow which uses the metadata (from create_dataset_from_stream). WDYT?

BTW, checked and it works with pytorch flow as well (for TRT nothing changed)

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2943 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2941 [ run ] completed with state ABORTED

Copy link
Collaborator

@Naveassaf Naveassaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change :)


# Engine configuration parsing for PyTorch backend
kwargs = {}
if backend and backend.lower() in ["pytorch", "autodeploy"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id consider factoring this out to some const up top:

SUPPORTED_BACKENDS = ["pytorch", "autodeploy", "cpp"]

having this as the --backend's choices, and have the default be cpp. IMO its confusing to have the default not be an option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the refactor. Not just up top, but in some utils so that both low_latency.py and throughput.py can use the same.
However, for default, I would go with pytorch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytorch is the correct default -- it just got changed here recently.


# Runtime Options
kv_cache_percent = params.pop("kv_cache_free_gpu_mem_fraction")
medusa_choices = params.pop("medusa_choices")

# Reporting Options
# # Reporting Options
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra #

report_json: Path = params.pop("report_json")
iteration_log: Path = params.pop("iteration_log")
iteration_writer = IterationWriter(iteration_log)

# Initialize the HF tokenizer for the specified model.
# ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False # TODO (dafrimi): nto sure where to locate this line since it requires the runConfig, but tokenizer is used to get the dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ignore_eos is only used for getting the eos_id and pad_id.

Meanwhile, eos_id and pad_id are only used for creating the SamplingParameters. So why not move lines 198, 200 and 201 to be right before the creating of the SamplingParams (line ~290)? Then you could have:

ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False

since youve already created the runtime config

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2943 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2064 completed with status: 'FAILURE'

@rakib-hasan
Copy link
Collaborator

@danielafrimi The other issue you mentioned, making the tokenizer be optional for multimodal: can you add that change to both low_latency and throughput case? That should be a simple change to add. If it turns out to be complicated, then you can certainly create a new PR.

@danielafrimi
Copy link
Collaborator Author

danielafrimi commented Apr 22, 2025

@rakib-hasan about the issue which makes the tokenizer be optional for multimodal, its can only be optional ion the first phase when we prepare the dataset. now when im looking at this, for real_dataset, for the multimodeal we dont use the tokenzier, but for the synthetic ones it fetches the vocab_size for it.

not sure what if setting the tokenzier to be optional is a good change. WDYT? can add some new files like vocab_size but it wont make the API nicer.

for running the trtllm-bench for the NVILA i initialled the NVILA tokenizer as they did in modeling_vila.py (will create a different PR)

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3055 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3055 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2135 completed with status: 'SUCCESS'

Comment on lines 25 to 26
SUPPORTED_BACKENDS = ["pytorch", "autodeploy", "cpp"]
PYTHON_SUPPORTED_BACKENDS = ["pytorch", "autodeploy"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things:

  1. SUPPORTED_BACKENDS: can we rename this to something like ALL_SUPPORTED_BACKENDS
  2. PYTHON_SUPPORTED_BACKENDS: do we need a separate one for this? This might cause some confusion. The conditional check can be changed to maybe backend != "cpp" ?
  3. Can you add the same backend change in throughput.py ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@danielafrimi
Copy link
Collaborator Author

@rakib-hasan Can we merge it?

@rakib-hasan
Copy link
Collaborator

Hi @danielafrimi , as Frank mentioned that latency side of the benchmark is just legacy at this point, I will let him decide.

@FrankD412 do you want this PR to be merged or closed?

If merging, @danielafrimi we need to resync and rerun the CI as it has been 2 weeks since the last run.

@FrankD412
Copy link
Collaborator

@rakib-hasan -- Sorry, I meant to schedule something with you all offline but last week was a bit crazy with travel back from the East Coast.

I'm okay with this being merged as it doesn't break the benchmark and updates it to utilize PyTorch. My only concern here is that a lot of folks are using the throughput benchmark for low latency (we even added an EOS option for speculation). If you want to use this you're not in-line with current methodology.

Let me also get something on the books for a sync, I think it would be useful to talk to you all about where I'd like to go with it and maybe we can join forces to help move that forward.

@rakib-hasan
Copy link
Collaborator

Sounds good.

I am not sure whether @danielafrimi is planning to use the latency path or it was just for feature parity with the throughput path. If it is the latter, then this PR should be closed and maybe a new PR for deprecating the latency path would reduce future confusions.

@danielafrimi
Copy link
Collaborator Author

Hi @rakib-hasan @FrankD412 , regarding the latency part—I needed to benchmark a VLM for both throughput and latency. I'm currently working on supporting quantization of the NVIDIA model using TensorRT-LLM with the PyTorch backend.

My main goal here is to compare the results reported in the paper with those of the model running inside TensorRT-LLM. I created this PR as part of my modifications, building on top of the existing throughput script. That said, I’m not sure how we’ll be able to obtain metrics like TTFT, ITL, and other detailed latency numbers without proper latency support for VLMs—at least for now.

What do you think?

@rakib-hasan
Copy link
Collaborator

Hey @danielafrimi . Apologies. I wrote a response but forgot to send.
For latency numbers, you can enable the streaming (--streaming) in the throughput command. Would that be sufficient for your use-case?

If so, we can close this one as latency path won't be needed.

If not, we can certainly merge this and note what is missing so that we can add that as part of the refactor that Frank is going to work on.

@FrankD412
Copy link
Collaborator

@rakib-hasan -- just following up on this one. What's the current status looking like?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9019 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6587 completed with status: 'FAILURE'

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9026 [ run ] triggered by Bot

@danielafrimi
Copy link
Collaborator Author

/bot kill

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9027 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9028 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9027 [ kill ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9026 [ run ] completed with state ABORTED

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9031 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9031 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6598 completed with status: 'FAILURE'

@danielafrimi
Copy link
Collaborator Author

/bot run

@danielafrimi
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9164 [ run ] triggered by Bot

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9169 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9169 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6716 completed with status: 'SUCCESS'

Daniel Afrimi added 2 commits June 23, 2025 14:05
Signed-off-by: Daniel Afrimi <[email protected]>

refactor

Signed-off-by: Daniel Afrimi <[email protected]>

fix review

Signed-off-by: Ubuntu <[email protected]>

refactor review

Signed-off-by: Ubuntu <[email protected]>

wip

Signed-off-by: Ubuntu <[email protected]>

wip

Signed-off-by: Ubuntu <[email protected]>

refactor

Signed-off-by: Ubuntu <[email protected]>

refactor

Signed-off-by: Ubuntu <[email protected]>
Signed-off-by: Ubuntu <[email protected]>
@shaharmor98
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9686 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9686 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7121 completed with status: 'FAILURE'

# If we're dealing with a model name, perform a snapshot download to
# make sure we have a local copy of the model.
if checkpoint_path is None:
if bench_env.checkpoint_path is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-iterating the comment from the other PR.

@danielafrimi I ran into an issue with the previous code where the model is already downloaded and the code still tries to re-download since bench_env.checkpoint_path is None even though bench_env.model contains the local path.

Can you please elaborate on why do you think the current code is wrong?

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9842 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9842 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7260 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community Performance TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts. triaged Issue has been triaged by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants