-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat/add latency support for trtllm bench #3730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/bot run |
PR_Github #2941 [ run ] triggered by Bot |
2d27088
to
903032b
Compare
report_json: Path = params.pop("report_json") | ||
iteration_log: Path = params.pop("iteration_log") | ||
iteration_writer = IterationWriter(iteration_log) | ||
|
||
# Initialize the HF tokenizer for the specified model. | ||
# ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False # TODO (dafrimi): nto sure where to locate this line since it requires the runConfig, but tokenizer is used to get the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakib-hasan WDYT? not sure where to locate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like ignore_eos
is only used for getting the eos_id
and pad_id
.
Meanwhile, eos_id
and pad_id
are only used for creating the SamplingParameters
. So why not move lines 198, 200 and 201 to be right before the creating of the SamplingParams
(line ~290)? Then you could have:
ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False
since youve already created the runtime config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this reordering is all because of runtime_config.decoding_config.decoding_mode
, right?
In that case, can we have an extra API just to get that early? That way runtime_config
can be created once all the extra options are populated.
For the TRT case, it will come from the engine config (we need to read the config twice, which should be fine for now)
For the PyTorch case, it is currently empty for the throughput benchmark (possibly since SD is not used for throughput usecase) and since PyTorch doesn't support the older SD techniques yet. In that case, we can just keep that empty for now like the throughput case and say that it is a limitation until the PyTorch path supports SD techniques.
Does that sound alright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Naveassaf @rakib-hasan reordering since we need the tokenizer for create_dataset_from_stream
which outputs metadata, which is part of the exec_settings for the runConfig.
@rakib-hasan commenting out this line with what you suggested. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, with that extra API, this reorder shouldn't be needed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look on the comment i added (line 288). so we wont need extra API for this case. we still reorder the exec_settings
, since we fetch need to fetch the setting in the pytorch flow which uses the metadata (from create_dataset_from_stream
). WDYT?
BTW, checked and it works with pytorch flow as well (for TRT nothing changed)
/bot run |
PR_Github #2943 [ run ] triggered by Bot |
PR_Github #2941 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change :)
|
||
# Engine configuration parsing for PyTorch backend | ||
kwargs = {} | ||
if backend and backend.lower() in ["pytorch", "autodeploy"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Id consider factoring this out to some const up top:
SUPPORTED_BACKENDS = ["pytorch", "autodeploy", "cpp"]
having this as the --backend
's choices, and have the default be cpp
. IMO its confusing to have the default not be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on the refactor. Not just up top, but in some utils so that both low_latency.py
and throughput.py
can use the same.
However, for default, I would go with pytorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytorch is the correct default -- it just got changed here recently.
|
||
# Runtime Options | ||
kv_cache_percent = params.pop("kv_cache_free_gpu_mem_fraction") | ||
medusa_choices = params.pop("medusa_choices") | ||
|
||
# Reporting Options | ||
# # Reporting Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra #
report_json: Path = params.pop("report_json") | ||
iteration_log: Path = params.pop("iteration_log") | ||
iteration_writer = IterationWriter(iteration_log) | ||
|
||
# Initialize the HF tokenizer for the specified model. | ||
# ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False # TODO (dafrimi): nto sure where to locate this line since it requires the runConfig, but tokenizer is used to get the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like ignore_eos
is only used for getting the eos_id
and pad_id
.
Meanwhile, eos_id
and pad_id
are only used for creating the SamplingParameters
. So why not move lines 198, 200 and 201 to be right before the creating of the SamplingParams
(line ~290)? Then you could have:
ignore_eos = True if runtime_config.decoding_config.decoding_mode == SpeculativeDecodingMode.NONE else False
since youve already created the runtime config
PR_Github #2943 [ run ] completed with state |
@danielafrimi The other issue you mentioned, making the tokenizer be optional for multimodal: can you add that change to both low_latency and throughput case? That should be a simple change to add. If it turns out to be complicated, then you can certainly create a new PR. |
@rakib-hasan about the issue which makes the tokenizer be optional for multimodal, its can only be optional ion the first phase when we prepare the dataset. now when im looking at this, for real_dataset, for the multimodeal we dont use the tokenzier, but for the synthetic ones it fetches the not sure what if setting the tokenzier to be optional is a good change. WDYT? can add some new files like vocab_size but it wont make the API nicer. for running the |
35295d8
to
0fd76e1
Compare
/bot run |
PR_Github #3055 [ run ] triggered by Bot |
PR_Github #3055 [ run ] completed with state |
SUPPORTED_BACKENDS = ["pytorch", "autodeploy", "cpp"] | ||
PYTHON_SUPPORTED_BACKENDS = ["pytorch", "autodeploy"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things:
SUPPORTED_BACKENDS
: can we rename this to something likeALL_SUPPORTED_BACKENDS
PYTHON_SUPPORTED_BACKENDS
: do we need a separate one for this? This might cause some confusion. The conditional check can be changed to maybebackend != "cpp"
?- Can you add the same backend change in throughput.py ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@rakib-hasan Can we merge it? |
Hi @danielafrimi , as Frank mentioned that @FrankD412 do you want this PR to be merged or closed? If merging, @danielafrimi we need to resync and rerun the CI as it has been 2 weeks since the last run. |
@rakib-hasan -- Sorry, I meant to schedule something with you all offline but last week was a bit crazy with travel back from the East Coast. I'm okay with this being merged as it doesn't break the benchmark and updates it to utilize PyTorch. My only concern here is that a lot of folks are using the Let me also get something on the books for a sync, I think it would be useful to talk to you all about where I'd like to go with it and maybe we can join forces to help move that forward. |
Sounds good. I am not sure whether @danielafrimi is planning to use the latency path or it was just for feature parity with the throughput path. If it is the latter, then this PR should be closed and maybe a new PR for deprecating the latency path would reduce future confusions. |
Hi @rakib-hasan @FrankD412 , regarding the latency part—I needed to benchmark a VLM for both throughput and latency. I'm currently working on supporting quantization of the NVIDIA model using TensorRT-LLM with the PyTorch backend. My main goal here is to compare the results reported in the paper with those of the model running inside TensorRT-LLM. I created this PR as part of my modifications, building on top of the existing throughput script. That said, I’m not sure how we’ll be able to obtain metrics like TTFT, ITL, and other detailed latency numbers without proper latency support for VLMs—at least for now. What do you think? |
Hey @danielafrimi . Apologies. I wrote a response but forgot to send. If so, we can close this one as If not, we can certainly merge this and note what is missing so that we can add that as part of the refactor that Frank is going to work on. |
@rakib-hasan -- just following up on this one. What's the current status looking like? |
PR_Github #9019 [ run ] completed with state |
/bot run |
PR_Github #9026 [ run ] triggered by Bot |
/bot kill |
/bot run |
PR_Github #9027 [ kill ] triggered by Bot |
PR_Github #9028 [ run ] triggered by Bot |
PR_Github #9027 [ kill ] completed with state |
PR_Github #9026 [ run ] completed with state |
/bot run |
PR_Github #9031 [ run ] triggered by Bot |
PR_Github #9031 [ run ] completed with state |
/bot run |
/bot kill |
PR_Github #9164 [ run ] triggered by Bot |
03a0522
to
5c0634c
Compare
/bot run |
PR_Github #9169 [ run ] triggered by Bot |
PR_Github #9169 [ run ] completed with state |
Signed-off-by: Daniel Afrimi <[email protected]> refactor Signed-off-by: Daniel Afrimi <[email protected]> fix review Signed-off-by: Ubuntu <[email protected]> refactor review Signed-off-by: Ubuntu <[email protected]> wip Signed-off-by: Ubuntu <[email protected]> wip Signed-off-by: Ubuntu <[email protected]> refactor Signed-off-by: Ubuntu <[email protected]> refactor Signed-off-by: Ubuntu <[email protected]>
Signed-off-by: Ubuntu <[email protected]>
87805c3
to
56766b2
Compare
/bot run |
PR_Github #9686 [ run ] triggered by Bot |
PR_Github #9686 [ run ] completed with state |
# If we're dealing with a model name, perform a snapshot download to | ||
# make sure we have a local copy of the model. | ||
if checkpoint_path is None: | ||
if bench_env.checkpoint_path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-iterating the comment from the other PR.
@danielafrimi I ran into an issue with the previous code where the model is already downloaded and the code still tries to re-download since bench_env.checkpoint_path is None even though bench_env.model contains the local path.
Can you please elaborate on why do you think the current code is wrong?
/bot run |
PR_Github #9842 [ run ] triggered by Bot |
PR_Github #9842 [ run ] completed with state |
@rakib-hasan
id like to get a review from you :)
most of the changes are taken from your PR (found only throughput support).