Skip to content

[Perf] Conditionally enable SWAP AB for speculative decoding #5403

Open
@zoheth

Description

@zoheth

It is a common scenario in speculative decoding for the draft model to handle a small number of tokens. When the value of q_seq_len * head_group_size is small, enabling SWAP AB provides a considerable performance improvement.

The performance results on H20 are shown in the chart below.
Image

Metadata

Metadata

Assignees

Labels

Community want to contributePRs initiated from CommunityInvestigatingPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions