Add HSTU in fbgemm_gpu/experimental/ #4090

jiayus-nvidia · 2025-05-07T02:02:45Z

Summary: A cutlass based HSTU for both Ampere and Hopper, both forward and backward is added to fbgemm_gpu/experimental/.

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/1133 - Remove sm_100 and sm_120 from architectures list and keep just sm_100a and sm_120a instead, to enable compilation for FP4 CUTLASS quantization kernels (pytorch#4004), since we are running into the following error: ``` Instruction 'cvt with .e2m1x2' not supported on .target 'sm_100' ``` Pull Request resolved: pytorch#4024 Reviewed By: spcyppt Differential Revision: D73901832 Pulled By: q10 fbshipit-source-id: 690c58b214aee80374e43a93bf39fe70e430da9a

facebook-github-bot · 2025-05-07T02:02:52Z

Hi @jiayus-nvidia!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

netlify · 2025-05-07T02:03:07Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`072c347`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/682c5efacc7bff0008d78283
😎 Deploy Preview	https://deploy-preview-4090--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-05-17T00:23:20Z

@ionuthristodorescu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jwfromm · 2025-05-19T20:44:57Z

This is fantastic @jiayus-nvidia! We really appreciate your contribution. Just for my understanding, do you have any benchmarking information on these new kernels you can share?

jiayus-nvidia · 2025-05-20T10:11:38Z

Hi @jwfromm, below are some benchmarking information. The first four figures are fwd and bwd kernels with rab (and drab), compared with some commit in generative-recommenders around last November, so the performance for triton kernel must improve a lot by now. And the last two figures are fwd and bwd kernels without rab. I haven't had the chance to measure the performance of the latest version of the triton kernel yet, so I just provide the performance of our kernel for reference first. If you need performance data for more dimensions or sequence lengths, feel free to let me know.

jiayus-nvidia and others added 3 commits April 30, 2025 02:04

Add support for HSTU in fbgemm_gpu/experimental/hstu.

b0fc257

Merge branch 'pytorch:main' into main

3a484a3

jiayus-nvidia added 4 commits May 7, 2025 10:10

Delete fbgemm_gpu/experimental/hstu/test/test.py

9f802e9

Merge branch 'main' into main

d9928e2

Update mainloop_bwd_sm90_tma_gmma_ws.hpp

f803184

Merge branch 'pytorch:main' into main

5f973e3

facebook-github-bot added the cla signed label May 14, 2025

Merge branch 'main' into main

ec5626b

jiayus-nvidia added 2 commits May 18, 2025 19:45

Fix lint errors

4282b9c

Merge branch 'pytorch:main' into main

a5dd844

jiayus-nvidia added 2 commits May 20, 2025 03:50

Performance optimization and lint fixes

f80d911

Merge branch 'main' of https://github.com/jiayus-nvidia/FBGEMM

072c347

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HSTU in fbgemm_gpu/experimental/ #4090

Add HSTU in fbgemm_gpu/experimental/ #4090

jiayus-nvidia commented May 7, 2025

facebook-github-bot commented May 7, 2025

netlify bot commented May 7, 2025 •

edited

Loading

facebook-github-bot commented May 17, 2025

jwfromm commented May 19, 2025

jiayus-nvidia commented May 20, 2025

Add HSTU in fbgemm_gpu/experimental/ #4090

Are you sure you want to change the base?

Add HSTU in fbgemm_gpu/experimental/ #4090

Conversation

jiayus-nvidia commented May 7, 2025

facebook-github-bot commented May 7, 2025

Action Required

Process

netlify bot commented May 7, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented May 17, 2025

jwfromm commented May 19, 2025

jiayus-nvidia commented May 20, 2025

netlify bot commented May 7, 2025 •

edited

Loading