4-bit Inference
Efficient 4-bit Inference (NF4, FP4)
This release adds efficient inference routines for batch size 1. Expected speedups vs 16-bit precision (fp16/bf16) for matrix multiplications with inner product dimension of at least 4096 (LLaMA 7B) is:
- 2.2x for Turing (T4, RTX 2080, etc.)
- 3.4x for Ampere (A100, A40, RTX 3090, etc.)
- 4.0x for Ada/Hopper (H100, L40, RTX 4090, etc.)
The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.
No code changes are needed to take advantage of the new kernels as long as a batch size of 1 is used.
Big thanks to @crowsonkb, @Birch-san, and @sekstini for some beta testing and helping to debug some early errors.
Changelog
Features:
- Added 4-bit inference kernels for batch size=1. Currently supported are the NF4, FP4 data types.
- Added support for quantizations of bfloat16 input data.
Bug fixes:
- Added
device
variable for bitsandbytes layers to be compatible with PyTorch layers.
Deprecated:
- Binaries for CUDA 11.2, 11.6 no longer ship with
pip install bitsandbytes
and need to be compiled from source.