From 76c08e51527769ea524e51c441630f57ec546bdd Mon Sep 17 00:00:00 2001 From: George Date: Thu, 6 Mar 2025 17:59:43 -0500 Subject: [PATCH] [Docs] Add info on when to use which PTQ/Sparsification (#1157) SUMMARY: Current README shows which algo we support + how to run. However, to a user it is still hard to understand when to use which. Add more info on based on the users use-case and hardware the optimization to apply. TEST PLAN: N/A Signed-off-by: Brian Dellabetta --- README.md | 40 +++++++++++++++++++++++++++++++++------- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index e61e2a49e..3ae778835 100644 --- a/README.md +++ b/README.md @@ -23,6 +23,32 @@ * SmoothQuant * SparseGPT +### When to Use Which Optimization + +#### PTQ +PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are: + +##### [W4A16](./examples/quantization_w4a16/README.md) +- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset. +- Useful speed ups in low QPS regimes with more weight compression. +- Recommended for any GPUs types. +##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md) +- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM. +- Useful for speed ups in high QPS regimes or offline serving on vLLM. +- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). +##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md) +- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM. +- Useful for speed ups in high QPS regimes or offline serving on vLLM. +- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace). + +#### Sparsification +Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include: + +##### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](./examples/sparse_2of4_quantization_fp8/README.md) +- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits. +- Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score [blog](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution. +- Recommended for compute capability >8.9 (Hopper and Ada Lovelace). + ## Installation @@ -35,16 +61,16 @@ pip install llmcompressor ### End-to-End Examples Applying quantization with `llmcompressor`: -* [Activation quantization to `int8`](examples/quantization_w8a8_int8) -* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8) -* [Weight only quantization to `int4`](examples/quantization_w4a16) -* [Quantizing MoE LLMs](examples/quantizing_moe) -* [Quantizing Vision-Language Models](examples/multimodal_vision) -* [Quantizing Audio-Language Models](examples/multimodal_audio) +* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md) +* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md) +* [Weight only quantization to `int4`](examples/quantization_w4a16/README.md) +* [Quantizing MoE LLMs](examples/quantizing_moe/README.md) +* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md) +* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md) ### User Guides Deep dives into advanced usage of `llmcompressor`: -* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate) +* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md) ## Quick Tour