-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update diagnostic functions for ROCm #1333
Update diagnostic functions for ROCm #1333
Conversation
Enable rocm build for CI
This reverts commit 1413c5f.
This reverts commit cdb209a.
This reverts commit 7e9a65c.
Enable ROCm whls
Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Titus <[email protected]>
Add messages for Heuristics error
Update diagnostics
@akx Would you be available to review, at your discretion, since you worked really closely with the diagnostics code before? @matthewdouglas could you also give this a thorough review, do you have an AMD machine available to test? I'm hoping to get this merged this week and finally announce the alpha release (after being sick all of last week..) and would appreciate a hand; so much going on right now (gotta prepare the poster for PyTorch conf next week, etc etc...) Thanks <3 |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sudo rm -rf \ | ||
/usr/share/dotnet \ | ||
/opt/ghc \ | ||
"/usr/local/share/boost" \ | ||
"$AGENT_TOOLSDIRECTORY" \ | ||
/opt/hostedtoolcache \ | ||
/opt/google/chrome \ | ||
/opt/microsoft/msedge \ | ||
/opt/microsoft/powershell \ | ||
/opt/pipx \ | ||
/usr/lib/mono \ | ||
/usr/local/julia* \ | ||
/usr/local/lib/android \ | ||
/usr/local/lib/node_modules \ | ||
/usr/local/share/chromium \ | ||
/usr/local/share/powershell \ | ||
/usr/share/swift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this cleaning needed in the first place? I've never seen anything like it in any GitHub Actions workflow I've come across 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github runner runs into disk space issues during docker pull. Those applications are not used, so I deleted them to clear some space.
bitsandbytes/diagnostics/cuda.py
Outdated
@@ -56,8 +59,8 @@ def find_cuda_libraries_in_path_list(paths_list_candidate: str) -> Iterable[Path | |||
except OSError: # Assume an esoteric error trying to poke at the directory | |||
pass | |||
for lib_pattern in CUDA_RUNTIME_LIB_PATTERNS: | |||
for pth in dir.glob(lib_pattern): | |||
if pth.is_file(): | |||
for pth in dir.rglob(lib_pattern): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think changing to a recursive glob is a great idea. Some of the paths in path_list_candidate
could be e.g. /home/foo
for whatever reason, and then this will end up recursing through all of an user's files (or an entire filesystem).
bitsandbytes/diagnostics/cuda.py
Outdated
if HIP_ENVIRONMENT: | ||
CUDA_RUNTIME_LIB_PATTERNS = ("libamdhip64.so*",) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I'd make this
if HIP_ENVIRONMENT:
CUDA_RUNTIME_LIB_PATTERNS = (...)
else:
CUDA_RUNTIME_LIB_PATTERNS = (...)
so it's clear that this could get overridden.
Better yet, it could be moved to a function def get_cuda_runtime_lib_patterns():
now that it's no longer constant.
bitsandbytes/diagnostics/cuda.py
Outdated
if (cuda_major, cuda_minor) < (6, 1): | ||
print_dedented( | ||
""" | ||
WARNING: bitandbytes is fully supported only from ROCm 6.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WARNING: bitandbytes is fully supported only from ROCm 6.1. | |
WARNING: bitsandbytes is fully supported only from ROCm 6.1. |
bitsandbytes/diagnostics/cuda.py
Outdated
f""" | ||
Library not found: {binary_path}. | ||
Maybe you need to compile it from source? If you compiled from source, check that ROCM_VERSION | ||
in PyTorch Settings matches your ROCM install. If not, reinstall PyTorch for your ROCm version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ROCm or ROCM?
in PyTorch Settings matches your ROCM install. If not, reinstall PyTorch for your ROCm version | |
in PyTorch Settings matches your ROCm install. If not, reinstall PyTorch for your ROCm version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch, it is ROCm, in some variables I have used ROCM to keep it consistent with other variable names, it may have been copied from them. I have fixed it.
bitsandbytes/diagnostics/cuda.py
Outdated
if (cuda_major, cuda_minor) < (6, 1): | ||
print_dedented( | ||
""" | ||
WARNING: bitandbytes is fully supported only from ROCm 6.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is confusing... If cuda_version_tuple
is actually the ROCm version and not at all related to CUDA versioning, then we really should not call it cuda_version_tuple
at all.
bitsandbytes/diagnostics/cuda.py
Outdated
@@ -152,25 +176,41 @@ def print_cuda_diagnostics(cuda_specs: CUDASpecs) -> None: | |||
def print_cuda_runtime_diagnostics() -> None: | |||
cudart_paths = list(find_cudart_libraries()) | |||
if not cudart_paths: | |||
print("CUDA SETUP: WARNING! CUDA runtime files not found in any environmental path.") | |||
print(f"{BNB_BACKEND} SETUP: WARNING! {BNB_BACKEND} runtime files not found in any environmental path.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well simplify the messaging a bit.
print(f"{BNB_BACKEND} SETUP: WARNING! {BNB_BACKEND} runtime files not found in any environmental path.") | |
print(f"WARNING! {BNB_BACKEND} runtime files not found in any environmental path.") |
if HIP_ENVIRONMENT: | ||
rocm_specs = f" rocm_version_string='{cuda_specs.cuda_version_string}'," | ||
rocm_specs += f" rocm_version_tuple={cuda_specs.cuda_version_tuple}" | ||
print(f"{BNB_BACKEND} specs:{rocm_specs}") | ||
else: | ||
print(f"{BNB_BACKEND} specs:{cuda_specs}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this smells even more like CudaSpecs
and RocmSpecs
should be separate dataclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires updates to cextensions and additional testing. Could we use this workaround for now and make the change after alpha release?
Thanks @akx for the review, really appreciated <3 ! |
Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Aarni Koskela <[email protected]>
Fix diagnostic feedback
@akx, thanks for the feedback! I have made changes based on your comments and commented on the rest. Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better now 👍
A follow-up PR should probably separate CudaSpecs
and RocmSpecs
for clarity – and after all, other platforms will require their own diagnostics too...
When merging this, please make sure to squash-merge with a suitably edited commit message :)
I don't have an AMD machine readily available to test with, but that said, I'm comfortable with merging this. @akx I agree that we'll want to follow up later and separate out concerns between the backends, improve some of the naming, etc. I'm sure we'll see some more to come in future PRs. |
45b7d14
into
bitsandbytes-foundation:multi-backend-refactor
This PR updates debug messages for ROCm, and updates diagnostic functions to check for ROCm libraries.
It also includes a fix to remove unused applications on github runners so that CI jobs don't run out of disk space.
CC: @Titus-von-Koeller @amathews-amd @sunway513 @matthewdouglas