Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update diagnostic functions for ROCm #1333

Conversation

pnunna93
Copy link
Contributor

This PR updates debug messages for ROCm, and updates diagnostic functions to check for ROCm libraries.

It also includes a fix to remove unused applications on github runners so that CI jobs don't run out of disk space.

CC: @Titus-von-Koeller @amathews-amd @sunway513 @matthewdouglas

pnunna93 and others added 30 commits June 19, 2024 22:22
This reverts commit 7e9a65c.
Co-authored-by: Aarni Koskela <[email protected]>
@pnunna93 pnunna93 mentioned this pull request Aug 25, 2024
7 tasks
@Titus-von-Koeller
Copy link
Collaborator

@akx Would you be available to review, at your discretion, since you worked really closely with the diagnostics code before?

@matthewdouglas could you also give this a thorough review, do you have an AMD machine available to test?

I'm hoping to get this merged this week and finally announce the alpha release (after being sick all of last week..) and would appreciate a hand; so much going on right now (gotta prepare the poster for PyTorch conf next week, etc etc...) Thanks <3

Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +119 to +135
sudo rm -rf \
/usr/share/dotnet \
/opt/ghc \
"/usr/local/share/boost" \
"$AGENT_TOOLSDIRECTORY" \
/opt/hostedtoolcache \
/opt/google/chrome \
/opt/microsoft/msedge \
/opt/microsoft/powershell \
/opt/pipx \
/usr/lib/mono \
/usr/local/julia* \
/usr/local/lib/android \
/usr/local/lib/node_modules \
/usr/local/share/chromium \
/usr/local/share/powershell \
/usr/share/swift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cleaning needed in the first place? I've never seen anything like it in any GitHub Actions workflow I've come across 🤔

Copy link
Contributor Author

@pnunna93 pnunna93 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github runner runs into disk space issues during docker pull. Those applications are not used, so I deleted them to clear some space.

bitsandbytes/cextension.py Outdated Show resolved Hide resolved
@@ -56,8 +59,8 @@ def find_cuda_libraries_in_path_list(paths_list_candidate: str) -> Iterable[Path
except OSError: # Assume an esoteric error trying to poke at the directory
pass
for lib_pattern in CUDA_RUNTIME_LIB_PATTERNS:
for pth in dir.glob(lib_pattern):
if pth.is_file():
for pth in dir.rglob(lib_pattern):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think changing to a recursive glob is a great idea. Some of the paths in path_list_candidate could be e.g. /home/foo for whatever reason, and then this will end up recursing through all of an user's files (or an entire filesystem).

Comment on lines 41 to 42
if HIP_ENVIRONMENT:
CUDA_RUNTIME_LIB_PATTERNS = ("libamdhip64.so*",)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I'd make this

if HIP_ENVIRONMENT:
    CUDA_RUNTIME_LIB_PATTERNS = (...)
else:
    CUDA_RUNTIME_LIB_PATTERNS = (...)

so it's clear that this could get overridden.

Better yet, it could be moved to a function def get_cuda_runtime_lib_patterns(): now that it's no longer constant.

csrc/ops.hip Outdated Show resolved Hide resolved
if (cuda_major, cuda_minor) < (6, 1):
print_dedented(
"""
WARNING: bitandbytes is fully supported only from ROCm 6.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WARNING: bitandbytes is fully supported only from ROCm 6.1.
WARNING: bitsandbytes is fully supported only from ROCm 6.1.

f"""
Library not found: {binary_path}.
Maybe you need to compile it from source? If you compiled from source, check that ROCM_VERSION
in PyTorch Settings matches your ROCM install. If not, reinstall PyTorch for your ROCm version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ROCm or ROCM?

Suggested change
in PyTorch Settings matches your ROCM install. If not, reinstall PyTorch for your ROCm version
in PyTorch Settings matches your ROCm install. If not, reinstall PyTorch for your ROCm version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, it is ROCm, in some variables I have used ROCM to keep it consistent with other variable names, it may have been copied from them. I have fixed it.

Comment on lines 154 to 157
if (cuda_major, cuda_minor) < (6, 1):
print_dedented(
"""
WARNING: bitandbytes is fully supported only from ROCm 6.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is confusing... If cuda_version_tuple is actually the ROCm version and not at all related to CUDA versioning, then we really should not call it cuda_version_tuple at all.

@@ -152,25 +176,41 @@ def print_cuda_diagnostics(cuda_specs: CUDASpecs) -> None:
def print_cuda_runtime_diagnostics() -> None:
cudart_paths = list(find_cudart_libraries())
if not cudart_paths:
print("CUDA SETUP: WARNING! CUDA runtime files not found in any environmental path.")
print(f"{BNB_BACKEND} SETUP: WARNING! {BNB_BACKEND} runtime files not found in any environmental path.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well simplify the messaging a bit.

Suggested change
print(f"{BNB_BACKEND} SETUP: WARNING! {BNB_BACKEND} runtime files not found in any environmental path.")
print(f"WARNING! {BNB_BACKEND} runtime files not found in any environmental path.")

Comment on lines +54 to +59
if HIP_ENVIRONMENT:
rocm_specs = f" rocm_version_string='{cuda_specs.cuda_version_string}',"
rocm_specs += f" rocm_version_tuple={cuda_specs.cuda_version_tuple}"
print(f"{BNB_BACKEND} specs:{rocm_specs}")
else:
print(f"{BNB_BACKEND} specs:{cuda_specs}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this smells even more like CudaSpecs and RocmSpecs should be separate dataclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires updates to cextensions and additional testing. Could we use this workaround for now and make the change after alpha release?

@Titus-von-Koeller
Copy link
Collaborator

Thanks @akx for the review, really appreciated <3 !

@pnunna93
Copy link
Contributor Author

@akx, thanks for the feedback! I have made changes based on your comments and commented on the rest. Please take another look.

Copy link
Contributor

@akx akx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now 👍

A follow-up PR should probably separate CudaSpecs and RocmSpecs for clarity – and after all, other platforms will require their own diagnostics too...

When merging this, please make sure to squash-merge with a suitably edited commit message :)

@matthewdouglas
Copy link
Member

I don't have an AMD machine readily available to test with, but that said, I'm comfortable with merging this.

@akx I agree that we'll want to follow up later and separate out concerns between the backends, improve some of the naming, etc. I'm sure we'll see some more to come in future PRs.

@matthewdouglas matthewdouglas merged commit 45b7d14 into bitsandbytes-foundation:multi-backend-refactor Sep 16, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants