Skip to content

Commit

Permalink
Add info on when to use which algo
Browse files Browse the repository at this point in the history
  • Loading branch information
horheynm committed Feb 15, 2025
1 parent c2db397 commit fc761ff
Showing 1 changed file with 33 additions and 7 deletions.
40 changes: 33 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,32 @@
* SmoothQuant
* SparseGPT

### When to Use Which Optimization

#### PTQ
PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:

##### [W4A16](./examples/quantization_w4a16/README.md)
- Uses GPTQ to compress weights to 4 bits.
- Useful speed ups in low QPS regimes with more weight compression.
- Recommended for any GPUs types.
##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md)
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits.
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md)
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits.
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).

#### Sparsification
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:

##### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](./examples/sparse_2of4_quantization_fp8/README.md)
- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
- Useful for better inference than W8A8-fp8, with almost no drop in accuracy (blog)[https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/]. Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to encapsulate the original distribution.
- Recommended for compute capability >8.9 (Hopper and Ada Lovelace).


## Installation

Expand All @@ -35,16 +61,16 @@ pip install llmcompressor
### End-to-End Examples

Applying quantization with `llmcompressor`:
* [Activation quantization to `int8`](examples/quantization_w8a8_int8)
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8)
* [Weight only quantization to `int4`](examples/quantization_w4a16)
* [Quantizing MoE LLMs](examples/quantizing_moe)
* [Quantizing Vision-Language Models](examples/multimodal_vision)
* [Quantizing Audio-Language Models](examples/multimodal_audio)
* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
* [Weight only quantization to `int4`](examples/quantization_w4a16/README.md)
* [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)

### User Guides
Deep dives into advanced usage of `llmcompressor`:
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate)
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md)


## Quick Tour
Expand Down

0 comments on commit fc761ff

Please sign in to comment.