-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable certain CUDA kernels to accept specified cuda stream #1330
Enable certain CUDA kernels to accept specified cuda stream #1330
Conversation
Dear @jeejeelee, Really cool, we weren't aware vLLM uses cudagraph. Just looked over this with Tim and overall, especially given the performance benefits this may have, this is a very strong contribution, thanks! I checked out your branch and tried running the tests, but do get the below segfault, which doesn't happen on
Please also be sure to install the pre-commit hooks 🤗 |
@Titus-von-Koeller , Thank you for the feedback, I've corrected the error mentioned above. I'm verifying whether all the unit tests are passing. |
On my machine with a 3090 GPU, my test results are as follows:
All tests in |
3617b6e
to
49ffcdc
Compare
@Titus-von-Koeller please review again, thanks~ |
@danielhanchen I believe you're directly calling some of these C-API functions in Unsloth, so I want to make sure you've got a heads up here since this changes their signatures. |
@jeejeelee Thank you for the contribution! The only nit I have is the one that I noted about using A few test failures in test_kbit_backprop and test_gemv_4bit is OK and not related to this PR. I see similar results on my 4090. The generation tests passed for me. Looks nice! |
Super thanks for the heads up!! Yep we use the C API directly! |
I'll be off until Monday, @matthewdouglas will be taking the lead. Thanks both! |
a685654
into
bitsandbytes-foundation:main
…ytes-foundation#1330) * Done * fix format * fix format * fix format * fix format * Address format error and fix default arg bug * Refine stream argument passing mechanism * Fix bug * Delete unused code
FIX #1308
By passing specified
stream
to certain kernel functions,cudagraph
can correctly capture these kernels, enabling downstream repovLLM
to run inference in cudagraph mode, resulting in significant speed improvements for BNB models.ping @matthewdouglas @Titus-von-Koeller @TimDettmers
cc @chenqianfzh