-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IFU master 2024 03 28 #17
Conversation
* Fix `max_memory` example on README - The new `max_memory` syntax expects a dictionary - This change also accounts for multiple devices * Fix model name in `from_pretrained` on README
* fix library loading Signed-off-by: Won-Kyu Park <[email protected]> * fixed library loading * use os.pathsep * use glob(), search CUDA_PATH * call find_file_recursive() without ext --------- Signed-off-by: Won-Kyu Park <[email protected]> Co-authored-by: James Wyatt <[email protected]>
* Fix erroneous type aliasing * Fix `Optional` typings (see PEP 484) * Add Mypy ignores * Fix Mypy complaints for method tables * Fix type for get_ptr * Fix various Mypy errors * Fix missed call to is_triton_available
* Adjust Ruff configuration * do not autofix always * be less strict around tests and benchmarks * adjust ignores for now * Ruff: autofix I and F401 * Apply ruff autofixes * Fix RUF013 complaint * Fix mutable default in replace_linear * Don't use bare except * Wrap bitsandbytes.__main__ entrypoint in function; fix "sensible" typo * Fix ruff B008 (function call in arguments) * Add ruff noqas as suitable * Fix RUF005 (splat instead of concatenating) * Fix B018 (useless expression) * Add pre-commit configuration + GitHub Actions lint workflow * Fix unused `e` in bitsandbytes/__main__.py * fix merge conflict resolution error * run pre-commit hook --------- Co-authored-by: Titus <[email protected]>
based on @Jamezo97 and @acpopescu work manually cherry-picked from PR bitsandbytes-foundation#788 and PR bitsandbytes-foundation#229 and cleanup by wkpark Signed-off-by: Won-Kyu Park <[email protected]>
…n#1000) `out_order` is the global parametrization list, not the test fixture argument
* test_nvidia_transform: fix variable reference `out_order` is the global parametrization list, not the test fixture argument * Make `parametrize` use more idiomatic * Use a more deterministic helper for `dim*` determination * Convert NO_CUBLASLT errors into skips too * Mark slow and benchmark tests as such (allows `-k "not benchmark"`)
…esbelkada-patch-2 Create upload_pr_documentation.yml
…n-fix minimal patch to fix Windows compilation issues
* fix project name and add lib prefix for win32 (2024/01/31) * set LIBRARY_OUTPUT_DIRECTORY property Co-authored-by: Won-Kyu Park <[email protected]>
* build matrix for ubuntu + python 3.10, 3.11 + cuda 11.8 + 12.1 (windows is disabled for now) * add environment-bnb.yml for building * more fixes suggested by @akx (2024/01/30) * use python -m build --wheel suggested by @akx Co-authored-by: Aarni Koskela <[email protected]>
* add a comment suggested by @akx (2024/01/30) Co-authored-by: Aarni Koskela <[email protected]>
Cmake + workflows
* Add CUDA 12.4 download to utility script, docs * (ci) Add CUDA 12.4.0 build to workflow * Apply ruff format to install_cuda.py
Updates the requirements on [einops](https://github.com/arogozhnikov/einops), [wheel](https://github.com/pypa/wheel), [lion-pytorch](https://github.com/lucidrains/lion-pytorch) and [scipy](https://github.com/scipy/scipy) to permit the latest version. Updates `einops` from 0.6.0 to 0.7.0 - [Release notes](https://github.com/arogozhnikov/einops/releases) - [Commits](arogozhnikov/einops@v0.6.0...v0.7.0) Updates `wheel` to 0.43.0 - [Release notes](https://github.com/pypa/wheel/releases) - [Changelog](https://github.com/pypa/wheel/blob/main/docs/news.rst) - [Commits](pypa/wheel@0.40.0...0.43.0) Updates `lion-pytorch` from 0.0.6 to 0.1.2 - [Release notes](https://github.com/lucidrains/lion-pytorch/releases) - [Commits](lucidrains/lion-pytorch@0.0.6...0.1.2) Updates `scipy` from 1.11.4 to 1.12.0 - [Release notes](https://github.com/scipy/scipy/releases) - [Commits](scipy/scipy@v1.11.4...v1.12.0) --- updated-dependencies: - dependency-name: einops dependency-type: direct:production update-type: version-update:semver-minor dependency-group: minor-patch - dependency-name: wheel dependency-type: direct:development dependency-group: minor-patch - dependency-name: lion-pytorch dependency-type: direct:production update-type: version-update:semver-minor dependency-group: minor-patch - dependency-name: scipy dependency-type: direct:production update-type: version-update:semver-minor dependency-group: minor-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@pnunna93 Looks like some linter issues and some READMe conflicts are present. Could you check those? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
CMakeLists.txt
Outdated
if(NO_CUBLASLT) | ||
target_compile_definitions(bitsandbytes PUBLIC NO_HIPBLASLT) | ||
else() | ||
find_package(hipblaslt) | ||
target_link_libraries(bitsandbytes PUBLIC roc::hipblaslt) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be preferable if we just always link hipblaslt if possible.
I see we already have supports_igemmlt()
always return True
if torch.version.hip
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can change that. Added that part to support NO_CUBLASLT option, but haven't implemented it yet. I will take it out to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDNA 2 / (3) and CDNA 1 don't support hipblasLt though, only very recent devices. Cuda has the option as well even though support is far broader there so it should absolutely stay. Python part will need adjustment on which shared library to load, we could probably take a look at how Cuda does it, but simplest solution will be to just check
gfx_arch = torch.cuda.get_device_capability()
if gfx_arch[0] == 9 and gfx_arch[1] in ["0a", "4x"] #no idea if second part of tuple will be exactly need can't test, for me gfx 1030 is (10, 3)
and only enable hibblaslt / igemm for that.
Edit: Just to make it clear. I understand you won't be able to test/validate all targets, but the changes to at least give the possibility to run bitsandbytes on any ROCm supported GPU are minimal and are worth it imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related? ROCm/hipBLASLt#516
Does this absolutely need to be implemented with hipBLASLt or could another library like rocBLAS be used?
If not, can the build/distribution still be simplified, and link to this library, and then decide whether to call it at runtime based on device support? See bitsandbytes-foundation#1103
For the CUDA version, it's just looking at torch.cuda.get_device_capability()
and only supporting igemmlt
when >= (7,5), so very similar idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related? ROCm/hipBLASLt#516
I can use latest torch just fine on gfx1030, so I guess even if we linked it, it would work. But no idea about how it works internally.
Does this absolutely need to be implemented with hipBLASLt or could another library like rocBLAS be used?
wdym? blasLt is for matrix cores, yes functions could be emulated with rocBLAS, but bitsandbytes works just fine without it. I can use optimizers and 4bit stuff and I'm pretty sure there would be no speed benefit emulating blasLt (if there was we could just improve current fallback code).
If not, can the build/distribution still be simplified, and link to this library, and then decide whether to call it at runtime based on device support? See TimDettmers#1103
I'm biased here. hipblasLt is a very recent library, it's the reason I named my fork 5.6 since it required headers (this repo moved defines though that you wouldn't even need headers which would be nice for pre 6.0 support). Also e.g. Arch still ships hipblaslt separately, don't know about other distros.
If the unified build gets merged for Cuda we can still change it for ROCm later. Imo ROCm should always mirror CUDA unless it's not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. I think mainly the part that got me was supports_igemmlt()
always returning True
. I mentioned rocBLAS because my understanding was that it would use the matrix cores / MFMA instructions where supported - or at the very least, on gfx908.
In general my understanding on the CUDA side is that the int8 matmul without tensor cores is going to be deprecated in the future too, but we'll want to fail in a friendly way, so some kind of test at runtime before calling igemmlt would be ideal. IMO not super critical to have a fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pnunna93 Few minor changes needed, but remaining all looks good. Could you please have a look at these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging it.
This PR pulls latest upstream changes and is updated to enable cmake build on ROCm. This is the conflicts_diff.txt file.
Its tested on rocm 6.0 and 6.2. UT logs are as below:
bnb_0.44.0.dev0_rocm6.0.log
bnb_0.44.0.dev0_rocm6.2.log