Skip to content

Commit

Permalink
Merge pull request #39 from ROCm/enable_rocm_whls
Browse files Browse the repository at this point in the history
Enable ROCm whls
  • Loading branch information
pnunna93 authored Jul 31, 2024
2 parents 578b2f4 + 3bde1b7 commit b123125
Show file tree
Hide file tree
Showing 29 changed files with 329 additions and 112 deletions.
12 changes: 7 additions & 5 deletions .github/scripts/build-rocm.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
#!/bin/bash
declare build_arch
declare build_os
declare rocm_version

set -xeuo pipefail
bnb_rocm_arch="gfx90a;gfx942;gfx1100"
if [ "${build_os:0:6}" == ubuntu ]; then
image=rocm/dev-ubuntu-22.04:6.1-complete
image=rocm/dev-ubuntu-22.04:${rocm_version}-complete
echo "Using image $image"
docker run --rm --platform "linux/$build_arch" -i \
-w /src -v "$PWD:/src" "$image" sh -c \
"apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends cmake \
&& cmake -DCOMPUTE_BACKEND=hip . \
&& cmake -DCOMPUTE_BACKEND=hip -DBNB_ROCM_ARCH=\"${bnb_rocm_arch}\" . \
&& cmake --build ."
fi

#output_dir="output/${build_os}/${build_arch}"
#mkdir -p "${output_dir}"
#(shopt -s nullglob && cp bitsandbytes/*.{so,dylib,dll} "${output_dir}")
output_dir="output/${build_os}/${build_arch}"
mkdir -p "${output_dir}"
(shopt -s nullglob && cp bitsandbytes/*.{so,dylib,dll} "${output_dir}")
4 changes: 3 additions & 1 deletion .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ jobs:
with:
commit_sha: ${{ github.sha }}
package: bitsandbytes
repo_owner: TimDettmers
repo_owner: bitsandbytes-foundation
# avoid /src suffix leading to wrong links, like bitsandbytes/blob/main/src/bitsandbytes/nn/
version_tag_suffix: '' # defaults to '/src'
custom_container: huggingface/transformers-doc-builder
secrets:
hf_token: ${{ secrets.HUGGINGFACE_PUSH }}
6 changes: 4 additions & 2 deletions .github/workflows/build_pr_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,13 @@ concurrency:

jobs:
build:
if: github.repository == 'TimDettmers/bitsandbytes'
if: github.repository == 'bitsandbytes-foundation/bitsandbytes'
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: bitsandbytes
repo_owner: TimDettmers
repo_owner: bitsandbytes-foundation
# avoid /src suffix leading to wrong links, like bitsandbytes/blob/main/src/bitsandbytes/nn/
version_tag_suffix: '' # defaults to '/src'
custom_container: huggingface/transformers-doc-builder
16 changes: 12 additions & 4 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,10 @@ jobs:
os: [ubuntu-latest, windows-latest]
arch: [x86_64, aarch64]
cuda_version:
["11.7.1", "11.8.0", "12.0.1", "12.1.1", "12.2.2", "12.3.2", "12.4.0"]
["11.7.1", "11.8.0", "12.0.1", "12.1.1", "12.2.2", "12.3.2", "12.4.1", "12.5.0"]
exclude:
- os: windows-latest # This probably requires arm64 Windows agents
arch: aarch64
- os: windows-latest # The Jimver/cuda-toolkit is action used for Windows builds is not updated for 12.4 yet.
cuda_version: "12.4.0"
- os: ubuntu-latest # Temporary. Takes too long, not ready yet.
arch: aarch64
runs-on: ${{ matrix.os }} # One day, we could run them on native agents. Azure supports this now but it's planned only for Q3 2023 for hosted agents
Expand All @@ -79,7 +77,7 @@ jobs:
if: startsWith(matrix.os, 'ubuntu')
uses: docker/setup-qemu-action@v2
# Windows: We install Cuda on the agent (slow)
- uses: Jimver/[email protected].14
- uses: Jimver/[email protected].16
if: startsWith(matrix.os, 'windows')
id: cuda-toolkit
with:
Expand Down Expand Up @@ -108,6 +106,8 @@ jobs:
matrix:
os: [ubuntu-latest]
arch: [x86_64]
rocm_version:
["6.1.2"]
runs-on: ${{ matrix.os }} # One day, we could run them on native agents. Azure supports this now but it's planned only for Q3 2023 for hosted agents
steps:
- uses: actions/checkout@v4
Expand All @@ -125,10 +125,18 @@ jobs:
env:
build_os: ${{ matrix.os }}
build_arch: ${{ matrix.arch }}
rocm_version: ${{ matrix.rocm_version }}
- name: Upload build artifact
uses: actions/upload-artifact@v4
with:
name: shared_library_rocm_${{ matrix.os }}_${{ matrix.arch }}_${{ matrix.rocm_version }}
path: output/*
retention-days: 7
build-wheels:
needs:
- build-shared-libs
- build-shared-libs-cuda
- build-shared-libs-rocm
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
Expand Down
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
### 0.43.2

This release is quite significant as the QLoRA bug fix big implications for higher `seqlen` and batch sizes.

For each sequence (i.e. batch size increase of one) we expect memory savings of:
- 405B: 39GB for `seqlen=1024`, and 4888GB for `seqlen=128,00`
- 70B: 10.1GB for `seqlen=1024` and 1258GB for `seqlen=128,00`

This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.

#### Improvements:

- docs: FSDP+QLoRA and CPU install guide (#1211 #1227, thanks @stevhliu)
- Add CUDA 12.5 and update 12.4 builds (#1284)

#### Bug Fixes

- 4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
- missing optimizers in `str2optimizer32bit` (#1222, thanks @EtienneDosSantos)
- CUDA 12.5 build issue (#1273, thanks @HennerM)
- fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
- QLoRA mem bug (#1270, thanks @Ther-nullptr)
- tests for cpu only platforms (#1259, thanks @galqiwi)
- restoration of quant_storage for CPU offloading (#1279)
- optim update error with non-contiguous grads/params (deepspeed) (#1187)

### 0.43.1

#### Improvements:
Expand Down
14 changes: 11 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,13 @@ endif()


if(BUILD_CUDA)
# NVCC normally will only work with MSVC up to 1939. VS2022 17.10+ starts using versions 1940+.
# Workaround: use --allow-unsupported-compiler
# This needs to be added *before* we try to enable the CUDA language so CMake's compiler check passes.
if(MSVC AND MSVC_VERSION VERSION_GREATER_EQUAL 1940)
string(APPEND CMAKE_CUDA_FLAGS " --allow-unsupported-compiler")
endif()

enable_language(CUDA) # This will fail if CUDA is not found
find_package(CUDAToolkit REQUIRED)

Expand Down Expand Up @@ -178,7 +185,7 @@ elseif(BUILD_HIP)
set(CMAKE_HIP_ARCHITECTURES ${BNB_ROCM_ARCH})
else()
if (NOT AMDGPU_TARGETS AND NOT CMAKE_HIP_ARCHITECTURES)
set(CMAKE_HIP_ARCHITECTURES "gfx908;gfx90a;gfx940;gfx941;gfx942")
set(CMAKE_HIP_ARCHITECTURES "gfx90a;gfx942;gfx1100")
elseif (AMDGPU_TARGETS AND NOT CMAKE_HIP_ARCHITECTURES)
set(CMAKE_HIP_ARCHITECTURES ${AMDGPU_TARGETS})
endif()
Expand All @@ -187,12 +194,14 @@ elseif(BUILD_HIP)

list(APPEND SRC_FILES ${HIP_FILES})

string(APPEND BNB_OUTPUT_NAME "_hip")
string(APPEND BNB_OUTPUT_NAME "_rocm")

# get hip version
execute_process(COMMAND hipconfig --version OUTPUT_VARIABLE HIP_CONFIG_VERSION)
string(REGEX MATCH "[0-9]+\\.[0-9]+" HIP_VERSION "${HIP_CONFIG_VERSION}")
string(REPLACE "." "" HIP_VERSION_SHORT "${HIP_VERSION}")

string(APPEND BNB_OUTPUT_NAME "${HIP_VERSION_SHORT}")
if(NO_CUBLASLT OR HIP_VERSION VERSION_LESS "6.1")
string(APPEND BNB_OUTPUT_NAME "_nohipblaslt")
endif()
Expand Down Expand Up @@ -229,7 +238,6 @@ if(WIN32)
set(CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ON)
endif()

# Weird MSVC hacks
if(MSVC)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2 /fp:fast")
endif()
Expand Down
5 changes: 5 additions & 0 deletions _typos.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
[files]

[default]
extend-ignore-re = [
"@Ther-nul", # valid Github user
]

[default.extend-identifiers]

[type.py.extend-words]
Expand Down
9 changes: 8 additions & 1 deletion bitsandbytes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@
from .cextension import lib
from .nn import modules

# NOTE: this is a temporary flag to allow outside libraries to employ conditional logic while the refactor is still in
# alpha/beta: sth like `if getattr(bitsandbytes, "is_multi_backend_refactor_preview", False): do sth`
# the getattr() call above would default to False and any string evaluates to True. This way we have temporary thing
# that we can remove in Transformers with the next release after the official BNB multi-platform release; then
# eventually making it the new default (e.g. just remove if statement and dedent in Transformers)
is_multi_backend_refactor_preview = "TO BE REMOVED ONCE MERGED TO `main`" # bool evals to True for str

# Always register the CPU backend.
register_backend("cpu", CPUBackend())

Expand Down Expand Up @@ -67,4 +74,4 @@
"optim.optimizer.MockArgs": False,
}

__version__ = "0.43.2.dev"
__version__ = "0.43.3.dev"
4 changes: 2 additions & 2 deletions bitsandbytes/autograd/_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -524,7 +524,7 @@ def forward(ctx, A, B, out=None, bias=None, quant_state: Optional[F.QuantState]
ctx.dtype_A, ctx.dtype_B, ctx.dtype_bias = A.dtype, B.dtype, None if bias is None else bias.dtype

if any(ctx.needs_input_grad[:2]):
ctx.tensors = (A, B)
ctx.tensors = (None, B)
else:
ctx.tensors = (None, None)

Expand All @@ -537,7 +537,7 @@ def backward(ctx, grad_output):
return torch.zeros_like(ctx.A), torch.zeros_like(ctx.B), None, bias_grad, None

req_gradA, _, _, req_gradBias, _ = ctx.needs_input_grad
A, B = ctx.tensors
_, B = ctx.tensors

grad_A, grad_B, grad_bias = None, None, None

Expand Down
15 changes: 12 additions & 3 deletions bitsandbytes/backends/cpu_xpu_common.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import subprocess
from typing import Optional
import warnings

Expand All @@ -19,6 +20,14 @@
ipex_xpu = None


gxx_available = False
try:
subprocess.run(["g++", "--version"])
gxx_available = True
except BaseException:
warnings.warn("g++ not found, torch.compile disabled for CPU/XPU.")


Tensor = torch.Tensor


Expand All @@ -45,8 +54,8 @@ def _ipex_xpu_version_prereq(major, minor):


def _maybe_torch_compile(func):
# torch.compile requires pytorch >= 2.0
if _torch_version_prereq(2, 0):
# torch.compile requires g++ and pytorch >= 2.0
if gxx_available and _torch_version_prereq(2, 0):
options = {}
# fx_graph_cache requires pytorch >= 2.2
if _torch_version_prereq(2, 2):
Expand Down Expand Up @@ -515,7 +524,7 @@ def gemm_4bit_impl(
output = torch.ops.torch_ipex.ipex_woq_linear(A, state.op_context.get_data_handle())
else:
dqB = dequantize_4bit_impl(B, state, blocksize=state.blocksize)
output = torch.matmul(A, dqB)
output = torch.matmul(A, dqB.to(A.dtype))
if out is not None:
out.copy_(output)
else:
Expand Down
6 changes: 4 additions & 2 deletions bitsandbytes/cextension.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ def get_cuda_bnb_library_path(cuda_specs: CUDASpecs) -> Path:
"""
if torch.version.hip:
if BNB_HIP_VERSION < 601:
return PACKAGE_DIR / f"libbitsandbytes_hip_nohipblaslt{DYNAMIC_LIBRARY_SUFFIX}"
return PACKAGE_DIR / f"libbitsandbytes_rocm{BNB_HIP_VERSION_SHORT}_nohipblaslt{DYNAMIC_LIBRARY_SUFFIX}"
else:
return PACKAGE_DIR / f"libbitsandbytes_hip{DYNAMIC_LIBRARY_SUFFIX}"
return PACKAGE_DIR / f"libbitsandbytes_rocm{BNB_HIP_VERSION_SHORT}{DYNAMIC_LIBRARY_SUFFIX}"
library_name = f"libbitsandbytes_cuda{cuda_specs.cuda_version_string}"
if not cuda_specs.has_cublaslt:
# if not has_cublaslt (CC < 7.5), then we have to choose _nocublaslt
Expand Down Expand Up @@ -119,8 +119,10 @@ def get_native_library() -> BNBNativeLibrary:
if torch.version.hip:
hip_major, hip_minor = map(int, torch.version.hip.split(".")[0:2])
HIP_ENVIRONMENT, BNB_HIP_VERSION = True, hip_major * 100 + hip_minor
BNB_HIP_VERSION_SHORT = str(hip_major) + str(hip_minor)
else:
HIP_ENVIRONMENT, BNB_HIP_VERSION = False, 0
BNB_HIP_VERSION_SHORT = ""
lib = get_native_library()
except Exception as e:
lib = None
Expand Down
29 changes: 29 additions & 0 deletions bitsandbytes/functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,35 @@ def prod(iterable):

if lib and lib.compiled_with_cuda:
"""C FUNCTIONS FOR OPTIMIZERS"""
str2optimizer32bit = {
"adam": (
lib.cadam32bit_grad_fp32,
lib.cadam32bit_grad_fp16,
lib.cadam32bit_grad_bf16,
),
"momentum": (
lib.cmomentum32bit_grad_32,
lib.cmomentum32bit_grad_16,
),
"rmsprop": (
lib.crmsprop32bit_grad_32,
lib.crmsprop32bit_grad_16,
),
"lion": (
lib.clion32bit_grad_fp32,
lib.clion32bit_grad_fp16,
lib.clion32bit_grad_bf16,
),
"adagrad": (
lib.cadagrad32bit_grad_32,
lib.cadagrad32bit_grad_16,
),
"lamb": (
lib.cadam32bit_grad_fp32,
lib.cadam32bit_grad_fp16,
),
}

str2optimizer8bit = {
"adam": (
lib.cadam_static_8bit_grad_32,
Expand Down
Loading

0 comments on commit b123125

Please sign in to comment.