Open
Description
When I test custom allreduce kernel performance comparing with vllm, I found custom allreduce kernel these line line may cause observable latency especially when opening cuda graph and batch size is small, I think it is easy to save the cost of index calculation.
Besides, I think add launch_bounds(512, 1) for oneshot allreduce can also improve some performance