[ Docs ] Update FP8 example to use dynamic per token (#75)

robertgshaw2-redhat · web-flow · commit 23587db56595 · 2024-08-12T16:58:15.000-04:00
* update for fp8 dyanmic

* cleanup

* format

* fp8 example

* updated per michael's comments

* update example

* update

* tweak fruther

* updated
diff --git a/examples/quantization_w8a8_fp8/README.md b/examples/quantization_w8a8_fp8/README.md
@@ -1,6 +1,6 @@
 # `fp8` Weight and Activation Quantization
 
-`llm-compressor` supports quantizing weights and activations to `fp8` for memory savings and inference acceleration with `vLLM`
+`llmcompressor` supports quantizing weights and activations to `fp8` for memory savings and inference acceleration with `vllm`
 
 > `fp8` compuation is supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
 
@@ -9,9 +9,7 @@
 To get started, install:
 
 ```bash
-git clone https://github.com/vllm-project/llm-compressor.git
-cd llm-compressor
-pip install -e .
+pip install llmcompressor==0.1.0
 ```
 
 ## Quickstart
@@ -22,122 +20,93 @@ The example includes an end-to-end script for applying the quantization algorith
 python3 llama3_example.py
 ```
 
-The resulting model `Meta-Llama-3-8B-Instruct-W8A8-FP8` is ready to be loaded into vLLM.
+The resulting model `Meta-Llama-3-8B-Instruct-FP8-Dynamic` is ready to be loaded into vLLM.
 
 ## Code Walkthough
 
-Now, we will step though the code in the example. There are four steps:
+Now, we will step though the code in the example. There are three steps:
 1) Load model
-2) Prepare calibration data
-3) Apply quantization
-4) Evaluate accuracy in vLLM
+2) Apply quantization
+3) Evaluate accuracy in vLLM
 
 ### 1) Load Model
 
-Load the model using `SparseAutoModelForCausalLM`, which is a wrapper around `AutoModel` for handling quantized saving and loading. Note that `SparseAutoModel` is compatible with `accelerate` so you can load your model onto multiple GPUs if needed.
+Load the model using `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM` for saving and loading quantized models.
 
 ```python
 from llmcompressor.transformers import SparseAutoModelForCausalLM
 from transformers import AutoTokenizer
 
 MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+
 model = SparseAutoModelForCausalLM.from_pretrained(
-    MODEL_ID, device_map="auto", torch_dtype="auto",
-)
+  MODEL_ID, device_map="auto", torch_dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
 ```
 
-### 2) Prepare Calibration Data
-
-Prepare the calibration data. When quantizing activations of a model to `fp8`, we need some sample data to estimate the activation scales. As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.
-
-In our case, we are quantizing an Instruction tuned generic model, so we will use the `ultrachat` dataset. Some best practices include:
-* 512 samples is a good place to start (increase if accuracy drops)
-* 2048 sequence length is a good place to start
-* Use the chat template or instrucion template that the model is trained with
-
-```python
-from datasets import load_dataset
+### 2) Apply Quantization
 
-NUM_CALIBRATION_SAMPLES=512
-MAX_SEQUENCE_LENGTH=2048
-
-# Load dataset.
-ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
-ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-
-# Preprocess the data into the format the model is trained with.
-def preprocess(example):
-    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False,)}
-ds = ds.map(preprocess)
-
-# Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it).
-def tokenize(sample):
-    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
-ds = ds.map(tokenize, remove_columns=ds.column_names)
-```
+For `fp8` quantization, we can recover accuracy with simple PTQ quantization.
 
-### 3) Apply Quantization
+We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
+- Static, per-channel quantization on the weights
+- Dynamic, per-token quantization on the activations
 
-With the dataset ready, we will now apply quantization.
-
-We first select the quantization algorithm. In our case, we will apply the default recipe for `fp8` (which uses static-per-tensor weights and static-per-tensor activations) to all linear layers.
-> See the `Recipes` documentation for more information on making complex recipes
+Since simple PTQ does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
 
 ```python
 from llmcompressor.transformers import oneshot
 from llmcompressor.modifiers.quantization import QuantizationModifier
 
-# Configure the quantization algorithm to run.
-recipe = QuantizationModifier(targets="Linear", scheme="FP8", ignore=["lm_head"])
-
-# Apply quantization.
-oneshot(
-    model=model,
-    dataset=ds,
-    recipe=recipe,
-    max_seq_length=MAX_SEQUENCE_LENGTH,
-    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-)
-
-# Save to disk compressed.
-SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-FP8"
-model.save_pretrained(SAVE_DIR, save_compressed=True)
+# Configure the simple PTQ quantization
+recipe = QuantizationModifier(
+  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
+
+# Apply the quantization algorithm.
+oneshot(model=model, recipe=recipe)
+
+# Save the model.
+SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+model.save_pretrained(SAVE_DIR)
 tokenizer.save_pretrained(SAVE_DIR)
 ```
 
 We have successfully created an `fp8` model!
 
-### 4) Evaluate Accuracy
+### 3) Evaluate Accuracy
+
+Install `vllm` and `lm-evaluation-harness`:
 
-With the model created, we can now load and run in vLLM (after installing).
+```bash
+pip install vllm lm_eval==0.4.3
+```
+
+Load and run the model in `vllm`:
 
 ```python
 from vllm import LLM
-model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-FP8")
+model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
+model.generate("Hello my name is")
 ```
 
-We can evaluate accuracy with `lm_eval` (`pip install lm_eval==v0.4.3`):
+Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
 > Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
 
-Run the following to test accuracy on GSM-8K:
-
 ```bash
-lm_eval --model vllm \
-  --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-FP8",add_bos_token=true \
-  --tasks gsm8k \
-  --num_fewshot 5 \
-  --limit 250 \
-  --batch_size 'auto'
+MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic 
+lm_eval \
+  --model vllm \
+  --model_args pretrained=$MODEL,add_bos_token=True \
+  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250
 ```
 
-We can see the resulting scores look good!
+We can see the resulting scores look good:
 
 ```bash
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
-|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.776|±  |0.0264|
-|     |       |strict-match    |     5|exact_match|↑  |0.776|±  |0.0264|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
+|     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|
 ```
 
 ### Questions or Feature Request?
diff --git a/examples/quantization_w8a8_fp8/llama3_example.py b/examples/quantization_w8a8_fp8/llama3_example.py
@@ -1,81 +1,35 @@
-from datasets import load_dataset
 from transformers import AutoTokenizer
 
 from llmcompressor.modifiers.quantization import QuantizationModifier
 from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
 
-# Select model and load it.
 MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+# Load model.
 model = SparseAutoModelForCausalLM.from_pretrained(
-    MODEL_ID,
-    device_map="auto",
-    torch_dtype="auto",
+    MODEL_ID, device_map="auto", torch_dtype="auto"
 )
 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
 
-# Select calibration dataset.
-DATASET_ID = "HuggingFaceH4/ultrachat_200k"
-DATASET_SPLIT = "train_sft"
-
-# Select number of samples. 512 samples is a good place to start.
-# Increasing the number of samples can improve accuracy.
-NUM_CALIBRATION_SAMPLES = 512
-MAX_SEQUENCE_LENGTH = 2048
-
-# Load dataset and preprocess.
-ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
-ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-
-
-def preprocess(example):
-    return {
-        "text": tokenizer.apply_chat_template(
-            example["messages"],
-            tokenize=False,
-        )
-    }
-
-
-ds = ds.map(preprocess)
-
-
-# Tokenize inputs.
-def tokenize(sample):
-    return tokenizer(
-        sample["text"],
-        padding=False,
-        max_length=MAX_SEQUENCE_LENGTH,
-        truncation=True,
-        add_special_tokens=False,
-    )
-
-
-ds = ds.map(tokenize, remove_columns=ds.column_names)
-
-# Configure the quantization algorithm to run.
+# Configure the quantization algorithm and scheme.
 # In this case, we:
-#   * quantize the weights to fp8 with simple PTQ (static per tensor)
-#   * quantize the activations to fp8 with simple PTQ (static per tensor)
-recipe = QuantizationModifier(targets="Linear", scheme="FP8", ignore=["lm_head"])
+#   * quantize the weights to fp8 with per channel via ptq
+#   * quantize the activations to fp8 with dynamic per token
+recipe = QuantizationModifier(
+    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
+)
 
 # Apply quantization.
-oneshot(
-    model=model,
-    dataset=ds,
-    recipe=recipe,
-    max_seq_length=MAX_SEQUENCE_LENGTH,
-    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-)
+oneshot(model=model, recipe=recipe)
 
 # Confirm generations of the quantized model look sane.
-print("\n\n")
 print("========== SAMPLE GENERATION ==============")
 input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
-output = model.generate(input_ids, max_new_tokens=100)
+output = model.generate(input_ids, max_new_tokens=20)
 print(tokenizer.decode(output[0]))
-print("==========================================\n\n")
+print("==========================================")
 
-# Save to disk compressed.
-SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-FP8"
-model.save_pretrained(SAVE_DIR, save_compressed=True)
+# Save to disk in compressed-tensors format.
+SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+model.save_pretrained(SAVE_DIR)
 tokenizer.save_pretrained(SAVE_DIR)