Skip to content

Custom allreduce performance improvement #2696

Open
@yizhang2077

Description

@yizhang2077

When I test custom allreduce kernel performance comparing with vllm, I found custom allreduce kernel these line line may cause observable latency especially when opening cuda graph and batch size is small, I think it is easy to save the cost of index calculation.

Besides, I think add launch_bounds(512, 1) for oneshot allreduce can also improve some performance

Metadata

Metadata

Assignees

No one assigned

    Labels

    Customized KernelsSpecialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.InvestigatingtriagedIssue has been triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions