You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -22,122 +20,93 @@ The example includes an end-to-end script for applying the quantization algorith
22
20
python3 llama3_example.py
23
21
```
24
22
25
-
The resulting model `Meta-Llama-3-8B-Instruct-W8A8-FP8` is ready to be loaded into vLLM.
23
+
The resulting model `Meta-Llama-3-8B-Instruct-FP8-Dynamic` is ready to be loaded into vLLM.
26
24
27
25
## Code Walkthough
28
26
29
-
Now, we will step though the code in the example. There are four steps:
27
+
Now, we will step though the code in the example. There are three steps:
30
28
1) Load model
31
-
2) Prepare calibration data
32
-
3) Apply quantization
33
-
4) Evaluate accuracy in vLLM
29
+
2) Apply quantization
30
+
3) Evaluate accuracy in vLLM
34
31
35
32
### 1) Load Model
36
33
37
-
Load the model using `SparseAutoModelForCausalLM`, which is a wrapper around `AutoModel` for handling quantized saving and loading. Note that `SparseAutoModel` is compatible with `accelerate` so you can load your model onto multiple GPUs if needed.
34
+
Load the model using `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM` for saving and loading quantized models.
38
35
39
36
```python
40
37
from llmcompressor.transformers import SparseAutoModelForCausalLM
41
38
from transformers import AutoTokenizer
42
39
43
40
MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
41
+
44
42
model = SparseAutoModelForCausalLM.from_pretrained(
Prepare the calibration data. When quantizing activations of a model to `fp8`, we need some sample data to estimate the activation scales. As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.
53
-
54
-
In our case, we are quantizing an Instruction tuned generic model, so we will use the `ultrachat` dataset. Some best practices include:
55
-
* 512 samples is a good place to start (increase if accuracy drops)
56
-
* 2048 sequence length is a good place to start
57
-
* Use the chat template or instrucion template that the model is trained with
For `fp8` quantization, we can recover accuracy with simple PTQ quantization.
79
50
80
-
### 3) Apply Quantization
51
+
We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
52
+
- Static, per-channel quantization on the weights
53
+
- Dynamic, per-token quantization on the activations
81
54
82
-
With the dataset ready, we will now apply quantization.
83
-
84
-
We first select the quantization algorithm. In our case, we will apply the default recipe for `fp8` (which uses static-per-tensor weights and static-per-tensor activations) to all linear layers.
85
-
> See the `Recipes` documentation for more information on making complex recipes
55
+
Since simple PTQ does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
86
56
87
57
```python
88
58
from llmcompressor.transformers import oneshot
89
59
from llmcompressor.modifiers.quantization import QuantizationModifier
With the model created, we can now load and run in vLLM (after installing).
80
+
```bash
81
+
pip install vllm lm_eval==0.4.3
82
+
```
83
+
84
+
Load and run the model in `vllm`:
114
85
115
86
```python
116
87
from vllm importLLM
117
-
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-FP8")
88
+
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
89
+
model.generate("Hello my name is")
118
90
```
119
91
120
-
We can evaluate accuracy with `lm_eval` (`pip install lm_eval==v0.4.3`):
92
+
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
121
93
> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
0 commit comments