Skip to content

Add HSTU in fbgemm_gpu/experimental/ #4090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

jiayus-nvidia
Copy link

Summary: A cutlass based HSTU for both Ampere and Hopper, both forward and backward is added to fbgemm_gpu/experimental/.

jiayus-nvidia and others added 3 commits April 30, 2025 02:04
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1133

- Remove sm_100 and sm_120 from architectures list and keep just sm_100a and sm_120a instead, to enable compilation for FP4 CUTLASS quantization kernels (pytorch#4004), since we are running into the following error:

```
Instruction 'cvt with .e2m1x2' not supported on .target 'sm_100'
```

Pull Request resolved: pytorch#4024

Reviewed By: spcyppt

Differential Revision: D73901832

Pulled By: q10

fbshipit-source-id: 690c58b214aee80374e43a93bf39fe70e430da9a
@facebook-github-bot
Copy link
Contributor

Hi @jiayus-nvidia!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

netlify bot commented May 7, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 072c347
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/682c5efacc7bff0008d78283
😎 Deploy Preview https://deploy-preview-4090--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@facebook-github-bot
Copy link
Contributor

@ionuthristodorescu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jwfromm
Copy link
Contributor

jwfromm commented May 19, 2025

This is fantastic @jiayus-nvidia! We really appreciate your contribution. Just for my understanding, do you have any benchmarking information on these new kernels you can share?

@jiayus-nvidia
Copy link
Author

Hi @jwfromm, below are some benchmarking information. The first four figures are fwd and bwd kernels with rab (and drab), compared with some commit in generative-recommenders around last November, so the performance for triton kernel must improve a lot by now. And the last two figures are fwd and bwd kernels without rab. I haven't had the chance to measure the performance of the latest version of the triton kernel yet, so I just provide the performance of our kernel for reference first. If you need performance data for more dimensions or sequence lengths, feel free to let me know.

A100_rab_bwd
A100_rab_fwd
H100_rab_bwd
H100_rab_fwd
A100_no_rab_fwd_bwd
H100_no_rab_fwd_bwd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants