Custom allreduce performance improvement

When I test custom allreduce kernel performance comparing with vllm, I found custom allreduce kernel these [line](https://github.com/NVIDIA/TensorRT-LLM/blob/0d0583a639cb120f09ae4af50dd0722bdd60a5df/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu#L1415) [line](https://github.com/NVIDIA/TensorRT-LLM/blob/0d0583a639cb120f09ae4af50dd0722bdd60a5df/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu#L1559) may cause observable latency especially when opening cuda graph and batch size is small, I think it is easy to save the cost of index calculation.

Besides, I think add __launch_bounds__(512, 1) for [oneshot allreduce](https://github.com/NVIDIA/TensorRT-LLM/blob/d93a2dde84eada06ae2339b4fb4e6432167a1cfd/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu#L1306) can also improve some performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom allreduce performance improvement #2696

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Custom allreduce performance improvement #2696

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions