Skip to content

Commit

Permalink
Release v0.45.1
Browse files Browse the repository at this point in the history
matthewdouglas committed Jan 23, 2025
1 parent d6781bc commit 8cd7793
Showing 3 changed files with 75 additions and 4 deletions.
75 changes: 73 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,79 @@
### 0.45.1
### v0.45.1

#### Improvements:

- Initial Support Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell
* Compatibility for `triton>=3.2.0`
* Moved package configuration to `pyproject.toml`
* Build system: initial support for NVIDIA Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell.
* Note: Binaries built for these platforms are not included in this release. They will be included in future releases upon the availability of the upcoming CUDA Toolkit 12.7 and 12.8.

#### Bug Fixes:
* Packaging: wheels will no longer include unit tests. (#1478)

#### Dependencies:
* Sets the minimum PyTorch version to 2.0.0.

### 0.45.0

This is a significant release, bringing support for LLM.int8() to NVIDIA Hopper GPUs such as the H100.

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

#### Performance Improvements
This release includes broad performance improvements for a wide variety of inference scenarios. See [this X thread](https://x.com/Tim_Dettmers/status/1864706051171287069) for a detailed explanation.

#### Breaking Changes
🤗[PEFT](https://github.com/huggingface/peft) users wishing to merge adapters with 8-bit weights will need to upgrade to `peft>=0.14.0`.

#### Packaging Improvements
* The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.
* Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
* The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.


#### Deprecations
* A number of public API functions have been marked for deprecation and will emit `FutureWarning` when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.
* The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using `block_wise=False` is not recommended and support will be removed in a future release.
* As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

#### Full Changelog

* refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380
* README: Replace special Unicode text symbols with regular characters by @akx in #1385
* Update CI tools & fix typos by @akx in #1386
* Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420
* [Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431
* LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401

### 0.44.1

#### Bug fixes:
* Fix optimizer support for Python <= 3.9 by @matthewdouglas in #1379

### 0.44.0

#### New: AdEMAMix Optimizer
The [AdEMAMix](https://hf.co/papers/2409.03137) optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.

We've implemented 8bit and paged variations: `AdEMAMix`, `AdEMAMix8bit`, `PagedAdEMAMix`, and `PagedAdEMAMix8bit`. These can be used with a similar API to existing optimizers.

#### Improvements:
* **8-bit Optimizers**: The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in [the paper](https://hf.co/papers/2110.02861) which improves accuracy.
* **CUDA Graphs support**: A fix to enable [CUDA Graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee!

#### Full Changelog:
* Embedding4bit and Embedding8bit implementation by @galqiwi in #1292
* Bugfix: Load correct nocublaslt library variant when BNB_CUDA_VERSION override is set by @matthewdouglas in #1318
* Enable certain CUDA kernels to accept specified cuda stream by @jeejeelee in #1330
* Initial support for ppc64le by @mgiessing in #1316
* Cuda source cleanup , refactor and fixes by @abhilash1910 in #1328
* Update for VS2022 17.11 compatibility with CUDA < 12.4 by @matthewdouglas in #1341
* Bump the minor-patch group with 3 updates by @dependabot in #1362
* Update matplotlib requirement from ~=3.9.1 to ~=3.9.2 in the major group by @dependabot in #1361
* docs: add internal reference to multi-backend guide by @Titus-von-Koeller in #1352
* Add move_to_device kwarg to the optimizer's load_state_dict by @koute in #1344
* Add AdEMAMix optimizer by @matthewdouglas in #1360
* Change 8bit optimizer blocksize 2048->256; additional bf16 support by @matthewdouglas in #1365

### 0.43.3

2 changes: 1 addition & 1 deletion bitsandbytes/__init__.py
Original file line number Diff line number Diff line change
@@ -21,4 +21,4 @@
"optim.optimizer.MockArgs": False,
}

__version__ = "0.45.1.dev0"
__version__ = "0.45.1"
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -12,4 +12,4 @@ def has_ext_modules(self):
return True


setup(version="0.45.1.dev0", packages=find_packages(), distclass=BinaryDistribution)
setup(version="0.45.1", packages=find_packages(), distclass=BinaryDistribution)

0 comments on commit 8cd7793

Please sign in to comment.