Skip to content

Commit

Permalink
some changes
Browse files Browse the repository at this point in the history
  • Loading branch information
younesbelkada committed Feb 2, 2024
1 parent 725d29a commit 58566e2
Show file tree
Hide file tree
Showing 11 changed files with 38 additions and 36 deletions.
2 changes: 0 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@
title: Optimizers
- local: integrations
title: Integrations
- local: qlora
title: QLoRA
- title: Support & Learning
sections:
- local: resources
Expand Down
2 changes: 1 addition & 1 deletion docs/source/faqs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ Please submit your questions in [this Github Discussion thread](https://github.c

We'll pick the most generally applicable ones and post the QAs here or integrate them into the general documentation (also feel free to submit doc PRs, please).

# ... under construction ...
# ... under construction ...
8 changes: 4 additions & 4 deletions docs/source/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ Note currently `bitsandbytes` is only supported on CUDA GPU hardwares, support f

<hfoptions id="OS system">
<hfoption id="Linux">
<hfoption id="MacOS">

## Linux

Expand All @@ -22,7 +21,7 @@ CUDA_VERSION=XXX make cuda12x
python setup.py install
```

with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`
with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`. Note support for non-CUDA GPUs (e.g. AMD, Intel), is also coming soon.

</hfoption>
<hfoption id="Windows">
Expand All @@ -41,11 +40,12 @@ python -m build --wheel
Big thanks to [wkpark](https://github.com/wkpark), [Jamezo97](https://github.com/Jamezo97), [rickardp](https://github.com/rickardp), [akx](https://github.com/akx) for their amazing contributions to make bitsandbytes compatible with Windows.

</hfoption>
<hfoption id="Windows">
<hfoption id="MacOS">

## MacOS

Mac support is still a work in progress.
Mac support is still a work in progress. Please make sure to check out the latest bitsandbytes issues to get notified about the progress with respect to MacOS integration.

</hfoption>

</hfoptions>
3 changes: 3 additions & 0 deletions docs/source/integrations.mdx
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# Transformers

... TODO: to be filled out ...

# PEFT

... TODO: to be filled out ...

# Trainer for the optimizers

... TODO: to be filled out ...
15 changes: 3 additions & 12 deletions docs/source/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,10 @@ TODO: Many parts of this doc will still be redistributed among the new doc struc
The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.

There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.

**Using 8-bit optimizers**:

```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'decapoda-research/llama-7b-hf',
device_map='auto',
load_in_8bit=True,
max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB')
```

A more detailed example, can be found in [examples/int8_inference_huggingface.py](examples/int8_inference_huggingface.py).

**Using 8-bit optimizer**:
1. Comment out optimizer: ``#torch.optim.Adam(....)``
2. Add 8-bit optimizer of your choice ``bnb.optim.Adam8bit(....)`` (arguments stay the same)
3. Replace embedding layer if necessary: ``torch.nn.Embedding(..) -> bnb.nn.Embedding(..)``
Expand All @@ -40,6 +30,7 @@ out = linear(x.to(torch.float16))


## Features

- 8-bit Matrix multiplication with mixed precision decomposition
- LLM.int8() inference
- 8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB, Lion (saves 75% memory)
Expand Down
4 changes: 2 additions & 2 deletions docs/source/moduletree.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Module tree overview

- **bitsandbytes.functional**: Contains quantization functions and stateless 8-bit optimizer update functions.
- **bitsandbytes.functional**: Contains quantization functions (4-bit & 8-bit) and stateless 8-bit optimizer update functions.
- **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
- **bitsandbytes.optim**: Contains 8-bit optimizers.
- **bitsandbytes.optim**: Contains 8-bit optimizers.
32 changes: 21 additions & 11 deletions docs/source/optimizers.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Introduction: 8-bit optimizers

With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:

- Faster (e.g. 4x faster than regular Adam)
Expand All @@ -12,7 +13,7 @@ See here the biggest models
We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.

It only requires a two-line code change to get started.
```
```py
import bitsandbytes as bnb

# before: adam = torch.optim.Adam(...)
Expand All @@ -25,20 +26,30 @@ bnb.nn.StableEmbedding(...)

The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.

## Overview of supported 8-bit optimizers

TOOD: List here all optimizers in `bitsandbytes/optim/__init__.py`
TODO (future) have an automated API docs through doc-builder

## Overview of expected gradients

TODO: add pics here, no idea how to do that
<div style="text-align: center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_comparison.png", width="50%">
</div>

Want to add both pics in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/bitsandbytes
<div style="text-align: center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_largest_model.png", width="50%">
</div>

# Research Background

Stateful optimizers maintain gradient statistics over time, e.g. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. `bitsandbytes` optimizers use 8-bit statistics, while maintaining the performance levels of using 32-bit optimizer states.

To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
1) **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
2) **dynamic quantization**, which quantizes both small and large values with high precision,
3) a **stable embedding layer** improves stability during optimization for models with word embeddings.

1- **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
2- **dynamic quantization**, which quantizes both small and large values with high precision,
3- a **stable embedding layer** improves stability during optimization for models with word embeddings.

With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.

Expand All @@ -65,15 +76,14 @@ The Stable Embedding Layer enhances the standard word embedding layer for improv

Some more examples of how you can replace your old optimizer with the 8-bit optimizer:

```
```diff
import bitsandbytes as bnb

# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer

# use 32-bit Adam with 5th percentile clipping
adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
+ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
optim_bits=32, percentile_clipping=5)
```

Expand Down
1 change: 0 additions & 1 deletion docs/source/qlora.mdx

This file was deleted.

4 changes: 2 additions & 2 deletions docs/source/quantization.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Linear8bitLt
# Linear8bitLt (LLM.int8)
... TODO: to be filled out ...

# Linear4bit
# Linear4bit (QLoRA)
... TODO: to be filled out ...
2 changes: 1 addition & 1 deletion docs/source/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@

The following code illustrates the steps above.

```python
```py
```
1 change: 1 addition & 0 deletions docs/source/resources.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
The below academic work is ordered in reverse chronological order.

## [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Jun 2023)](https://arxiv.org/abs/2306.03078)

Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1666076553665744896)
Expand Down

0 comments on commit 58566e2

Please sign in to comment.