Feature/sm free normal kernel #347

ZhiyiHu1999 · 2025-07-31T07:04:51Z

1. Motivation

In MOE training and the prefilling phase of inference, the current ring-based RDMA buffer implementation for normal kernels significantly wastes SM resources. Due to the small size of the ring buffer and its frequent reuse for token transmission, SMs are often stalled, continuously polling for RDMA buffer availability instead of performing useful computation. This inefficient resource usage severely limits overall system throughput.

To address this issue, this PR implements a SM-friendly buffer design that frees SMs from buffer polling duties. This design is inspired by the large RDMA buffer approach discussed in #39. By allocating a larger RDMA buffer in HBM, tokens can be moved from SMs to the RDMA buffer in one go. While the NIC handles transmission asynchronously in the background, SMs can immediately resume computation tasks. Once data transfer completes, SMs can process the received tokens without blocking.

2. Design

2.1. Feature

The SM Free mode decouples the execution phases of the native mode: when the user first launches internode dispatch/combine, only the send phase is executed and a recv hook is returned. The user can wait for the network transmission to complete and then launch the receive phase of the internode dispatch/combine using the recv hook.
Implementation

2.2.1. Principles

Full Compatibility: Seamlessly integrates with existing native communication modes, requiring minimal changes to existing codebases

2.2.2. Highlights

Mode Control via return_recv_hook: Introduces a user-controllable argument return_recv_hook to switch between native mode and hook mode.
RDMA Buffer Management:
- Retains the native buffer structure for ease of integration.
- Enlarges token capacity per RDMA recv chunk to fit all tokens in one batch.
- Provides a utility function get_normal_hook_rdma_size_hint() to help users estimate the minimum required RDMA buffer size.
Improved Compute Stream Utilization:
- In hook mode, the kernel runs on the compute stream, enabling more efficient offloading of data transfer tasks to the NIC and the network while freeing up SMs.
- Compared to native mode (which allocates 2 SMs per channel), hook mode maps one channel per SM, improving scalability and SM utilization.

3. Performance Evaluation

3.1. Experiment Setup

2/4 nodes, with 8 × H20 GPUs per node.
4096 tokens per batch, 7168 hidden, top-8 experts.

3.2. Effect

3.2.1. Estimated Performance (native mode → hook mode)

A GEMM operation of shape [4096, 4096] is used to overlap the network phase. It typically takes 4-5 μs in my system, sufficient for completing the token transmission.
For hook mode, the kernel execution time is the sum of the send kernel time and the receive kernel time, and the RDMA bandwidth is calculated as: RDMA Bandwidth = RDMA Bytes / Kernel Execution Time

Dispatch Kernel Execution Time & Bandwidth

#EP	FP8 Dispatch Kernel Execution Time	FP8 Dispatch RDMA Bandwidth	BF16 Dispatch Kernel Execution Time	BF16 Dispatch RDMA Bandwidth
16	1358 → 765 μs	44 → 79 GB/s	2530 → 1381 μs	46 → 85 GB/s
32	3535 → 884 μs	30 → 124 GB/s	6780 → 2468 μs	32 → 86 GB/s

Combine Kernel Execution Time & Bandwidth

#EP	BF16 Combine Kernel Execution Time	BF16 Combine RDMA Bandwidth
16	2590 → 1425 μs	45 → 82 GB/s
32	6850 → 2250 μs	31 → 81 GB/s

3.3. Cost

3.3.1. HBM Cost

The main HBM cost in hook mode comes from two sides:

RDMA Buffer:
rdma_buffer_size = num_max_dispatch_tokens_per_rank × hidden_size × size_of(element) × num_nodes × 2
The HBM cost from using the large RDMA buffer increases with num_max_dispatch_tokens_per_rank and num_nodes; for our experiment setup, the HBM cost for the RDMA buffer on each rank is about 270 MB.
NVLink Buffer:
nvl_buffer_size = num_max_nvl_chunked_recv_tokens × hidden_size × size_of(element) × num_nvl_peers × num_channels
HBM used for NVLink buffer increased with the number of SMs (channels) in the hook mode, but the number of tokens assigned to each channel is smaller, so the nvl_recv_chunk could be shrunk and the overall HBM cost for this part changed slightly.

3.3.2. Bandwidth Cost

For SM Free mode, the data movement in the recv phase of dispatch and the send phase of combine are inter-GPU through NVLink. So the respective kernel execution time is also limited by the NVLink bandwidth.

4. RoadMap

Code Refactor
Optimize HBM Usage
Implement Dynamic Resizing in RDMA Buffer
More TMA Optimization in Normal Hook mode

polarstormx · 2025-08-04T06:43:03Z

nvl_buffer_size = num_max_net_channel_recv_tokens × hidden_size × size_of(element) × num_gpus_per_rank × num_channels

Maybe you mean "num_gpus_per_node"?

sphish · 2025-08-13T06:52:06Z

The original implementation allows NVLink and RDMA transfers to be pipelined, enabling us to utilize both NVLink and RDMA bandwidth simultaneously. I think it is worthwhile to dedicate some SMs for this purpose.
In the next refactoring, we will try to minimize the usage of SMs.

…be on compute stream in hook mode

ZhiyiHu1999 marked this pull request as draft July 31, 2025 07:05

ZhiyiHu1999 marked this pull request as ready for review August 1, 2025 12:30

ZhiyiHu1999 force-pushed the feature/sm_free_normal_kernel branch from afcd1aa to 3676f96 Compare August 29, 2025 02:10

ZhiyiHu1999 force-pushed the feature/sm_free_normal_kernel branch from 3676f96 to a35f83b Compare September 11, 2025 09:42

ZhiyiHu1999 changed the base branch from main to antgroup-opt November 3, 2025 23:57

ZhiyiHu1999 force-pushed the feature/sm_free_normal_kernel branch from f3df4f1 to 7ab53e3 Compare November 8, 2025 16:28

zhiyi Hu and others added 9 commits November 8, 2025 17:32

First version of SM Free Normal Kernel for Internode Dispatch/Combine

971ee33

address conflicts due to TMA in Combine kNVLAndRDMAForwarder

26f06ac

remove duplicate codes due to faults in addressing conflicts

2f93b35

minor modifications and add an annotation

dc79091

minor modification

83e27a2

minor modifications

a5782dc

minor modifications

afa77d3

minor modifications

5e4093b

add arg return_recv_hook for get_dispatch_layout, so the kernel will …

0b1ccde

…be on compute stream in hook mode

ZhiyiHu1999 force-pushed the feature/sm_free_normal_kernel branch from 7ab53e3 to 0b1ccde Compare November 8, 2025 16:33

ZhiyiHu1999 marked this pull request as draft November 10, 2025 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/sm free normal kernel #347

Feature/sm free normal kernel #347

ZhiyiHu1999 commented Jul 31, 2025 •

edited by sphish

Loading

Uh oh!

polarstormx commented Aug 4, 2025 •

edited

Loading

Uh oh!

sphish commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/sm free normal kernel #347

Are you sure you want to change the base?

Feature/sm free normal kernel #347

Conversation

ZhiyiHu1999 commented Jul 31, 2025 • edited by sphish Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Motivation

2. Design

2.1. Feature

2.2.1. Principles

2.2.2. Highlights

3. Performance Evaluation

3.1. Experiment Setup

3.2. Effect

3.2.1. Estimated Performance (native mode → hook mode)

Dispatch Kernel Execution Time & Bandwidth

Combine Kernel Execution Time & Bandwidth

3.3. Cost

3.3.1. HBM Cost

3.3.2. Bandwidth Cost

4. RoadMap

Uh oh!

polarstormx commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sphish commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZhiyiHu1999 commented Jul 31, 2025 •

edited by sphish

Loading

polarstormx commented Aug 4, 2025 •

edited

Loading