feat/add latency support for trtllm bench #3730

danielafrimi · 2025-04-21T12:50:37Z

@rakib-hasan
id like to get a review from you :)

most of the changes are taken from your PR (found only throughput support).

danielafrimi · 2025-04-21T12:51:00Z

/bot run

tensorrt-cicd · 2025-04-21T12:56:34Z

PR_Github #2941 [ run ] triggered by Bot

danielafrimi · 2025-04-21T13:07:19Z

tensorrt_llm/bench/benchmark/low_latency.py

    report_json: Path = params.pop("report_json")
    iteration_log: Path = params.pop("iteration_log")
    iteration_writer = IterationWriter(iteration_log)

+    # Initialize the HF tokenizer for the specified model.
+    # ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False # TODO (dafrimi): nto sure where to locate this line since it requires the runConfig, but tokenizer is used to get the dataset


@rakib-hasan WDYT? not sure where to locate it

It looks like ignore_eos is only used for getting the eos_id and pad_id.

Meanwhile, eos_id and pad_id are only used for creating the SamplingParameters. So why not move lines 198, 200 and 201 to be right before the creating of the SamplingParams (line ~290)? Then you could have:

ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False

since youve already created the runtime config

IIUC, this reordering is all because of runtime_config.decoding_config.decoding_mode, right?
In that case, can we have an extra API just to get that early? That way runtime_config can be created once all the extra options are populated.
For the TRT case, it will come from the engine config (we need to read the config twice, which should be fine for now)
For the PyTorch case, it is currently empty for the throughput benchmark (possibly since SD is not used for throughput usecase) and since PyTorch doesn't support the older SD techniques yet. In that case, we can just keep that empty for now like the throughput case and say that it is a limitation until the PyTorch path supports SD techniques.
Does that sound alright?

@Naveassaf @rakib-hasan reordering since we need the tokenizer for create_dataset_from_stream which outputs metadata, which is part of the exec_settings for the runConfig.

@rakib-hasan commenting out this line with what you suggested. thanks

Just to confirm, with that extra API, this reorder shouldn't be needed, right?

look on the comment i added (line 288). so we wont need extra API for this case. we still reorder the exec_settings, since we fetch need to fetch the setting in the pytorch flow which uses the metadata (from create_dataset_from_stream). WDYT?

BTW, checked and it works with pytorch flow as well (for TRT nothing changed)

danielafrimi · 2025-04-21T13:07:35Z

/bot run

tensorrt-cicd · 2025-04-21T13:13:15Z

PR_Github #2943 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-21T13:15:29Z

PR_Github #2941 [ run ] completed with state ABORTED

Naveassaf

Nice change :)

Naveassaf · 2025-04-21T14:36:46Z

tensorrt_llm/bench/benchmark/low_latency.py

+
+    # Engine configuration parsing for PyTorch backend
+    kwargs = {}
+    if backend and backend.lower() in ["pytorch", "autodeploy"]:


Id consider factoring this out to some const up top:

SUPPORTED_BACKENDS = ["pytorch", "autodeploy", "cpp"]

having this as the --backend's choices, and have the default be cpp. IMO its confusing to have the default not be an option.

Agree on the refactor. Not just up top, but in some utils so that both low_latency.py and throughput.py can use the same.
However, for default, I would go with pytorch.

Pytorch is the correct default -- it just got changed here recently.

Naveassaf · 2025-04-21T14:37:07Z

tensorrt_llm/bench/benchmark/low_latency.py


    # Runtime Options
    kv_cache_percent = params.pop("kv_cache_free_gpu_mem_fraction")
    medusa_choices = params.pop("medusa_choices")

-    # Reporting Options
+    # # Reporting Options


Naveassaf · 2025-04-21T14:49:13Z

tensorrt_llm/bench/benchmark/low_latency.py

    report_json: Path = params.pop("report_json")
    iteration_log: Path = params.pop("iteration_log")
    iteration_writer = IterationWriter(iteration_log)

+    # Initialize the HF tokenizer for the specified model.
+    # ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False # TODO (dafrimi): nto sure where to locate this line since it requires the runConfig, but tokenizer is used to get the dataset


It looks like ignore_eos is only used for getting the eos_id and pad_id.

Meanwhile, eos_id and pad_id are only used for creating the SamplingParameters. So why not move lines 198, 200 and 201 to be right before the creating of the SamplingParams (line ~290)? Then you could have:

ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False

since youve already created the runtime config

tensorrt_llm/bench/benchmark/low_latency.py

tensorrt-cicd · 2025-04-21T15:26:03Z

PR_Github #2943 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2064 completed with status: 'FAILURE'

rakib-hasan · 2025-04-22T07:13:24Z

@danielafrimi The other issue you mentioned, making the tokenizer be optional for multimodal: can you add that change to both low_latency and throughput case? That should be a simple change to add. If it turns out to be complicated, then you can certainly create a new PR.

danielafrimi · 2025-04-22T07:41:00Z

@rakib-hasan about the issue which makes the tokenizer be optional for multimodal, its can only be optional ion the first phase when we prepare the dataset. now when im looking at this, for real_dataset, for the multimodeal we dont use the tokenzier, but for the synthetic ones it fetches the vocab_size for it.

not sure what if setting the tokenzier to be optional is a good change. WDYT? can add some new files like vocab_size but it wont make the API nicer.

for running the trtllm-bench for the NVILA i initialled the NVILA tokenizer as they did in modeling_vila.py (will create a different PR)

tensorrt_llm/bench/benchmark/low_latency.py

danielafrimi · 2025-04-22T09:10:13Z

/bot run

tensorrt-cicd · 2025-04-22T09:15:48Z

PR_Github #3055 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-22T14:49:54Z

PR_Github #3055 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2135 completed with status: 'SUCCESS'

rakib-hasan · 2025-04-23T15:59:35Z

tensorrt_llm/bench/benchmark/utils/general.py

+SUPPORTED_BACKENDS = ["pytorch", "autodeploy", "cpp"]
+PYTHON_SUPPORTED_BACKENDS = ["pytorch", "autodeploy"]


A couple of things:

SUPPORTED_BACKENDS: can we rename this to something like ALL_SUPPORTED_BACKENDS

PYTHON_SUPPORTED_BACKENDS: do we need a separate one for this? This might cause some confusion. The conditional check can be changed to maybe backend != "cpp" ?

Can you add the same backend change in throughput.py ?

tensorrt_llm/bench/benchmark/low_latency.py

danielafrimi · 2025-05-04T10:12:46Z

@rakib-hasan Can we merge it?

rakib-hasan · 2025-05-05T16:56:07Z

Hi @danielafrimi , as Frank mentioned that latency side of the benchmark is just legacy at this point, I will let him decide.

@FrankD412 do you want this PR to be merged or closed?

If merging, @danielafrimi we need to resync and rerun the CI as it has been 2 weeks since the last run.

FrankD412 · 2025-05-05T17:39:32Z

@rakib-hasan -- Sorry, I meant to schedule something with you all offline but last week was a bit crazy with travel back from the East Coast.

I'm okay with this being merged as it doesn't break the benchmark and updates it to utilize PyTorch. My only concern here is that a lot of folks are using the throughput benchmark for low latency (we even added an EOS option for speculation). If you want to use this you're not in-line with current methodology.

Let me also get something on the books for a sync, I think it would be useful to talk to you all about where I'd like to go with it and maybe we can join forces to help move that forward.

rakib-hasan · 2025-05-05T18:01:34Z

Sounds good.

I am not sure whether @danielafrimi is planning to use the latency path or it was just for feature parity with the throughput path. If it is the latter, then this PR should be closed and maybe a new PR for deprecating the latency path would reduce future confusions.

danielafrimi · 2025-05-06T07:31:52Z

Hi @rakib-hasan @FrankD412 , regarding the latency part—I needed to benchmark a VLM for both throughput and latency. I'm currently working on supporting quantization of the NVIDIA model using TensorRT-LLM with the PyTorch backend.

My main goal here is to compare the results reported in the paper with those of the model running inside TensorRT-LLM. I created this PR as part of my modifications, building on top of the existing throughput script. That said, I’m not sure how we’ll be able to obtain metrics like TTFT, ITL, and other detailed latency numbers without proper latency support for VLMs—at least for now.

What do you think?

rakib-hasan · 2025-05-09T18:57:01Z

Hey @danielafrimi . Apologies. I wrote a response but forgot to send.
For latency numbers, you can enable the streaming (--streaming) in the throughput command. Would that be sufficient for your use-case?

If so, we can close this one as latency path won't be needed.

If not, we can certainly merge this and note what is missing so that we can add that as part of the refactor that Frank is going to work on.

FrankD412 · 2025-05-27T15:56:28Z

@rakib-hasan -- just following up on this one. What's the current status looking like?

tensorrt-cicd · 2025-06-16T10:46:47Z

PR_Github #9019 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6587 completed with status: 'FAILURE'

danielafrimi · 2025-06-16T11:19:34Z

/bot run

tensorrt-cicd · 2025-06-16T11:25:17Z

PR_Github #9026 [ run ] triggered by Bot

danielafrimi · 2025-06-16T11:32:05Z

/bot kill

danielafrimi · 2025-06-16T11:32:52Z

/bot run

tensorrt-cicd · 2025-06-16T11:38:15Z

PR_Github #9027 [ kill ] triggered by Bot

tensorrt-cicd · 2025-06-16T11:38:44Z

PR_Github #9028 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-16T11:38:46Z

PR_Github #9027 [ kill ] completed with state ABORTED

tensorrt-cicd · 2025-06-16T11:39:28Z

PR_Github #9026 [ run ] completed with state ABORTED

danielafrimi · 2025-06-16T12:06:28Z

/bot run

tensorrt-cicd · 2025-06-16T12:12:16Z

PR_Github #9031 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-16T15:13:09Z

PR_Github #9031 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6598 completed with status: 'FAILURE'

tensorrt_llm/bench/benchmark/throughput.py

danielafrimi · 2025-06-17T08:24:23Z

/bot run

danielafrimi · 2025-06-17T08:29:31Z

/bot kill

tensorrt-cicd · 2025-06-17T08:30:50Z

PR_Github #9164 [ run ] triggered by Bot

danielafrimi · 2025-06-17T08:44:17Z

/bot run

tensorrt-cicd · 2025-06-17T08:49:57Z

PR_Github #9169 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-17T12:23:49Z

PR_Github #9169 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6716 completed with status: 'SUCCESS'

Signed-off-by: Daniel Afrimi <[email protected]> refactor Signed-off-by: Daniel Afrimi <[email protected]> fix review Signed-off-by: Ubuntu <[email protected]> refactor review Signed-off-by: Ubuntu <[email protected]> wip Signed-off-by: Ubuntu <[email protected]> wip Signed-off-by: Ubuntu <[email protected]> refactor Signed-off-by: Ubuntu <[email protected]> refactor Signed-off-by: Ubuntu <[email protected]>

Signed-off-by: Ubuntu <[email protected]>

shaharmor98 · 2025-06-24T10:13:38Z

/bot run

tensorrt-cicd · 2025-06-24T10:18:53Z

PR_Github #9686 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-24T12:09:53Z

PR_Github #9686 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7121 completed with status: 'FAILURE'

rakib-hasan · 2025-06-24T21:12:52Z

tensorrt_llm/bench/benchmark/throughput.py

        # If we're dealing with a model name, perform a snapshot download to
        # make sure we have a local copy of the model.
-        if checkpoint_path is None:
+        if bench_env.checkpoint_path is None:


Re-iterating the comment from the other PR.

@danielafrimi I ran into an issue with the previous code where the model is already downloaded and the code still tries to re-download since bench_env.checkpoint_path is None even though bench_env.model contains the local path.

Can you please elaborate on why do you think the current code is wrong?

danielafrimi · 2025-06-25T08:22:47Z

/bot run

tensorrt-cicd · 2025-06-25T08:28:25Z

PR_Github #9842 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-25T09:25:34Z

PR_Github #9842 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7260 completed with status: 'FAILURE'

danielafrimi force-pushed the bench_latency branch from 2d27088 to 903032b Compare April 21, 2025 13:06

danielafrimi commented Apr 21, 2025

View reviewed changes

Naveassaf reviewed Apr 21, 2025

View reviewed changes

danielafrimi commented Apr 22, 2025

View reviewed changes

tensorrt_llm/bench/benchmark/low_latency.py Show resolved Hide resolved

danielafrimi force-pushed the bench_latency branch from 35295d8 to 0fd76e1 Compare April 22, 2025 09:09

danielafrimi requested a review from rakib-hasan April 23, 2025 08:59

rakib-hasan reviewed Apr 23, 2025

View reviewed changes

rakib-hasan requested review from symphonylyh and amukkara April 23, 2025 21:28

symphonylyh reviewed Apr 24, 2025

View reviewed changes

tensorrt_llm/bench/benchmark/low_latency.py Show resolved Hide resolved

lucaslie added AutoDeploy and removed AutoDeploy labels May 14, 2025

FrankD412 reviewed Jun 17, 2025

View reviewed changes

tensorrt_llm/bench/benchmark/throughput.py Outdated Show resolved Hide resolved

danielafrimi force-pushed the bench_latency branch from 03a0522 to 5c0634c Compare June 17, 2025 08:41

Daniel Afrimi added 2 commits June 23, 2025 14:05

rebase

56766b2

Signed-off-by: Ubuntu <[email protected]>

danielafrimi force-pushed the bench_latency branch from 87805c3 to 56766b2 Compare June 23, 2025 14:09

rakib-hasan reviewed Jun 24, 2025

View reviewed changes

		SUPPORTED_BACKENDS = ["pytorch", "autodeploy", "cpp"]
		PYTHON_SUPPORTED_BACKENDS = ["pytorch", "autodeploy"]

feat/add latency support for trtllm bench #3730

Are you sure you want to change the base?

feat/add latency support for trtllm bench #3730

Uh oh!

Conversation

danielafrimi commented Apr 21, 2025

Uh oh!

danielafrimi commented Apr 21, 2025

Uh oh!

tensorrt-cicd commented Apr 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielafrimi Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielafrimi commented Apr 21, 2025

Uh oh!

tensorrt-cicd commented Apr 21, 2025

Uh oh!

tensorrt-cicd commented Apr 21, 2025

Uh oh!

Naveassaf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 21, 2025

Uh oh!

rakib-hasan commented Apr 22, 2025

Uh oh!

danielafrimi commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

danielafrimi commented Apr 22, 2025

Uh oh!

tensorrt-cicd commented Apr 22, 2025

Uh oh!

tensorrt-cicd commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielafrimi commented May 4, 2025

Uh oh!

rakib-hasan commented May 5, 2025

Uh oh!

FrankD412 commented May 5, 2025

Uh oh!

rakib-hasan commented May 5, 2025

Uh oh!

danielafrimi commented May 6, 2025

Uh oh!

rakib-hasan commented May 9, 2025

Uh oh!

FrankD412 commented May 27, 2025

danielafrimi Apr 22, 2025 •

edited

Loading

danielafrimi commented Apr 22, 2025 •

edited

Loading