From 8cd7793bae488e45cc3b0b25d86168858b1c655d Mon Sep 17 00:00:00 2001 From: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com> Date: Thu, 23 Jan 2025 11:17:29 -0500 Subject: [PATCH] Release v0.45.1 --- CHANGELOG.md | 75 ++++++++++++++++++++++++++++++++++++++-- bitsandbytes/__init__.py | 2 +- setup.py | 2 +- 3 files changed, 75 insertions(+), 4 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 693a2655d..410282f39 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,8 +1,79 @@ -### 0.45.1 +### v0.45.1 #### Improvements: -- Initial Support Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell +* Compatibility for `triton>=3.2.0` +* Moved package configuration to `pyproject.toml` +* Build system: initial support for NVIDIA Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell. + * Note: Binaries built for these platforms are not included in this release. They will be included in future releases upon the availability of the upcoming CUDA Toolkit 12.7 and 12.8. + +#### Bug Fixes: +* Packaging: wheels will no longer include unit tests. (#1478) + +#### Dependencies: +* Sets the minimum PyTorch version to 2.0.0. + +### 0.45.0 + +This is a significant release, bringing support for LLM.int8() to NVIDIA Hopper GPUs such as the H100. + +As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios. + +#### Performance Improvements +This release includes broad performance improvements for a wide variety of inference scenarios. See [this X thread](https://x.com/Tim_Dettmers/status/1864706051171287069) for a detailed explanation. + +#### Breaking Changes +🤗[PEFT](https://github.com/huggingface/peft) users wishing to merge adapters with 8-bit weights will need to upgrade to `peft>=0.14.0`. + +#### Packaging Improvements +* The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB. +* Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution. +* The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1. + + +#### Deprecations +* A number of public API functions have been marked for deprecation and will emit `FutureWarning` when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users. +* The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using `block_wise=False` is not recommended and support will be removed in a future release. +* As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts. + +#### Full Changelog + +* refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380 +* README: Replace special Unicode text symbols with regular characters by @akx in #1385 +* Update CI tools & fix typos by @akx in #1386 +* Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420 +* [Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431 +* LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401 + +### 0.44.1 + +#### Bug fixes: +* Fix optimizer support for Python <= 3.9 by @matthewdouglas in #1379 + +### 0.44.0 + +#### New: AdEMAMix Optimizer +The [AdEMAMix](https://hf.co/papers/2409.03137) optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting. + +We've implemented 8bit and paged variations: `AdEMAMix`, `AdEMAMix8bit`, `PagedAdEMAMix`, and `PagedAdEMAMix8bit`. These can be used with a similar API to existing optimizers. + +#### Improvements: +* **8-bit Optimizers**: The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in [the paper](https://hf.co/papers/2110.02861) which improves accuracy. +* **CUDA Graphs support**: A fix to enable [CUDA Graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee! + +#### Full Changelog: +* Embedding4bit and Embedding8bit implementation by @galqiwi in #1292 +* Bugfix: Load correct nocublaslt library variant when BNB_CUDA_VERSION override is set by @matthewdouglas in #1318 +* Enable certain CUDA kernels to accept specified cuda stream by @jeejeelee in #1330 +* Initial support for ppc64le by @mgiessing in #1316 +* Cuda source cleanup , refactor and fixes by @abhilash1910 in #1328 +* Update for VS2022 17.11 compatibility with CUDA < 12.4 by @matthewdouglas in #1341 +* Bump the minor-patch group with 3 updates by @dependabot in #1362 +* Update matplotlib requirement from ~=3.9.1 to ~=3.9.2 in the major group by @dependabot in #1361 +* docs: add internal reference to multi-backend guide by @Titus-von-Koeller in #1352 +* Add move_to_device kwarg to the optimizer's load_state_dict by @koute in #1344 +* Add AdEMAMix optimizer by @matthewdouglas in #1360 +* Change 8bit optimizer blocksize 2048->256; additional bf16 support by @matthewdouglas in #1365 ### 0.43.3 diff --git a/bitsandbytes/__init__.py b/bitsandbytes/__init__.py index 98c5f869a..034f6391d 100644 --- a/bitsandbytes/__init__.py +++ b/bitsandbytes/__init__.py @@ -21,4 +21,4 @@ "optim.optimizer.MockArgs": False, } -__version__ = "0.45.1.dev0" +__version__ = "0.45.1" diff --git a/setup.py b/setup.py index 7d70bbc17..c4658b08f 100644 --- a/setup.py +++ b/setup.py @@ -12,4 +12,4 @@ def has_ext_modules(self): return True -setup(version="0.45.1.dev0", packages=find_packages(), distclass=BinaryDistribution) +setup(version="0.45.1", packages=find_packages(), distclass=BinaryDistribution)