-
Notifications
You must be signed in to change notification settings - Fork 579
Add HSTU in fbgemm_gpu/experimental/ #4090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/1133 - Remove sm_100 and sm_120 from architectures list and keep just sm_100a and sm_120a instead, to enable compilation for FP4 CUTLASS quantization kernels (pytorch#4004), since we are running into the following error: ``` Instruction 'cvt with .e2m1x2' not supported on .target 'sm_100' ``` Pull Request resolved: pytorch#4024 Reviewed By: spcyppt Differential Revision: D73901832 Pulled By: q10 fbshipit-source-id: 690c58b214aee80374e43a93bf39fe70e430da9a
Hi @jiayus-nvidia! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
@ionuthristodorescu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This is fantastic @jiayus-nvidia! We really appreciate your contribution. Just for my understanding, do you have any benchmarking information on these new kernels you can share? |
Hi @jwfromm, below are some benchmarking information. The first four figures are fwd and bwd kernels with rab (and drab), compared with some commit in generative-recommenders around last November, so the performance for triton kernel must improve a lot by now. And the last two figures are fwd and bwd kernels without rab. I haven't had the chance to measure the performance of the latest version of the triton kernel yet, so I just provide the performance of our kernel for reference first. If you need performance data for more dimensions or sequence lengths, feel free to let me know. |
Summary: A cutlass based HSTU for both Ampere and Hopper, both forward and backward is added to fbgemm_gpu/experimental/.