Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 27 additions & 13 deletions examples/GPTQ/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com

- [FMS Model Optimizer requirements](../../README.md#requirements)
- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
- It is advised to install from source if you plan to use GPTQv2
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
```
pip install lm-eval
Expand Down Expand Up @@ -41,7 +42,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
--quant_method gptq \
--output_dir Meta-Llama-3-8B-GPTQ \
--bits 4 \
--group_size 128
--group_size 128 \
--use_version2 False \
--v2_mem_device cpu \

```
The model that can be found in the specified output directory (`Meta-Llama-3-8B-GPTQ` in our case) can be deployed and inferenced via `vLLM`.

Expand Down Expand Up @@ -89,26 +93,34 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|

- Quantized model with the settings showed above (`desc_act` default to False.)
-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|
- GPTQv1

|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|

- GPTQv2

|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6817 |± |0.0065|
| | | |none | 5|perplexity|↓ |4.3994 |± |0.0995|

- Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|
- GPTQv1
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|

> [!NOTE]
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.


## Code Walk-through

1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py). GPTQv1 is supported by default. To use GPTQv2, set the parameter `v2` to `True` and `v2_memory_device` to `cpu`.

```python
from gptqmodel import GPTQModel, QuantizeConfig
Expand All @@ -118,6 +130,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
group_size=gptq_args.group_size,
desc_act=gptq_args.desc_act,
damp_percent=gptq_args.damp_percent,
v2=gptq_args.use_version2,
v2_memory_device=gptq_args.v2_mem_device,
)

```
Expand Down Expand Up @@ -158,4 +172,4 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
tokenizer.save_pretrained(output_dir) # optional
```
> [!NOTE]
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100.
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100 with GPTQv1.
2 changes: 2 additions & 0 deletions fms_mo/run_quant.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,8 @@ def run_gptq(model_args, data_args, opt_args, gptq_args):
group_size=gptq_args.group_size,
desc_act=gptq_args.desc_act,
damp_percent=gptq_args.damp_percent,
v2=gptq_args.use_version2,
v2_memory_device=gptq_args.v2_mem_device,
)

# Add custom model_type mapping to gptqmodel LUT so GPTQModel can recognize them.
Expand Down
2 changes: 2 additions & 0 deletions fms_mo/training_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,8 @@
use_cuda_fp16: bool = True
autotune_warmup_after_quantized: bool = False
cache_examples_on_gpu: bool = True
use_version2: bool = False
v2_mem_device: Optional[str] = field(default="cpu", metadata={"choices": ["auto", "cpu", "cuda"]})

Check warning on line 210 in fms_mo/training_args.py

View workflow job for this annotation

GitHub Actions / lint: pylint

C0301: Line too long (102/100) (line-too-long)

Check warning on line 210 in fms_mo/training_args.py

View workflow job for this annotation

GitHub Actions / lint: pylint

C0301: Line too long (102/100) (line-too-long)


@dataclass
Expand Down
Loading