Higher Performance with Lower SM Occupancy through Zero-Copy and TMA Offloading #453

monethuang1 · 2025-10-14T06:52:28Z

The original Internode Normal Kernel suffers from high GPU SM utilization and underutilized interconnect bandwidth, which constrains prefill performance.
In our optimized version, we apply buffer fusion and TMA offloading to enable true zero-copy communication and maximize NVLink bandwidth usage.

Evaluation on H20 clusters shows significant gains:

With EP=16, performance (dispatch(FP8) / dispatch(BF16) / combine) improved from 76.50 / 84.05 / 62.50 to 89.46 / 91.82 / 82.27.
With EP=32, performance increased from 59.95 / 61.33 / 61.24 to 62.53 / 63.24 / 62.55.

Additionally, SM occupancy was reduced by up to 66.7%. The optimized kernel uses only 12 SMs for EP=16 and 8 SMs for EP=32, compared to 24 SMs in the original version.

Co-authored-by: Xingyi Li <[email protected]> Co-authored-by: Xiaojie Huang <[email protected]>

Internal commit ac9360c465a4074dd913b885e394b43e1135d986.

Based on internal commit 6943948bd2474b3f36e03de6d1cfed839f199831.

…ffer segments

Zhehao Lin and others added 26 commits October 10, 2025 14:02

Init zero-copy internode dispatch/combine

be77c85

Co-authored-by: Xingyi Li <[email protected]> Co-authored-by: Xiaojie Huang <[email protected]>

Use 1 SM per channel for cached_notify

8ffa6f6

add combine tma alighment and fix rdma receivers num

1b38673

Reduce SMEM usage

6dd446b

Support multiple buffers for zero-copy

79d6ef5

Add assertion against too large recv token count

6c37dfa

Internal commit ac9360c465a4074dd913b885e394b43e1135d986.

Use separate sizes for Dispatch/Combine input buffers

5e5bd79

Based on internal commit 6943948bd2474b3f36e03de6d1cfed839f199831.

Ensure NVLink fused buffers are properly aligned by reordering NVL bu…

875a5a6

…ffer segments

Cleanup TMA-related code (1)

9e52f9b

Cleanup TMA-related code (2)

c4eef60

Don't change NVSHMEM_CUMEM_GRANULARITY

70ebc82

Cleanup TMA-related code (3)

6361959

Use elect_one_sync()

924dd73

Add assertion against unsupported cases

a938d18

Fix src_meta size

dd3b606

Increase combine SMEM size for 8K hidden_size

a98155d

Error when hidden size is too large

ef3db42

Improve comments

2681c4b

Fix NVL buffer size hint

b34e7c1

Fix NVL buffer size hint

12a05a3

Restore .gitignore

0c3e509

Minor format

9a19cc2

More minor

bc398dc

Minor (comment)

f434762

Fix build for 32-bit topk idx

4483ae5

Fix calc_diff for combine

6db603e

LyricZhao mentioned this pull request Oct 21, 2025

Add Tencent's zero-copy branch into README #463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Higher Performance with Lower SM Occupancy through Zero-Copy and TMA Offloading #453

Higher Performance with Lower SM Occupancy through Zero-Copy and TMA Offloading #453

Uh oh!

monethuang1 commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Higher Performance with Lower SM Occupancy through Zero-Copy and TMA Offloading #453

Are you sure you want to change the base?

Higher Performance with Lower SM Occupancy through Zero-Copy and TMA Offloading #453

Uh oh!

Conversation

monethuang1 commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant