Skip to content

Conversation

monethuang1
Copy link

The original Internode Normal Kernel suffers from high GPU SM utilization and underutilized interconnect bandwidth, which constrains prefill performance.
In our optimized version, we apply buffer fusion and TMA offloading to enable true zero-copy communication and maximize NVLink bandwidth usage.

Evaluation on H20 clusters shows significant gains:

  • With EP=16, performance (dispatch(FP8) / dispatch(BF16) / combine) improved from 76.50 / 84.05 / 62.50 to 89.46 / 91.82 / 82.27.
  • With EP=32, performance increased from 59.95 / 61.33 / 61.24 to 62.53 / 63.24 / 62.55.

Additionally, SM occupancy was reduced by up to 66.7%. The optimized kernel uses only 12 SMs for EP=16 and 8 SMs for EP=32, compared to 24 SMs in the original version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant