Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin' into kylesayrs/remove-double-init
Browse files Browse the repository at this point in the history
  • Loading branch information
kylesayrs committed Mar 10, 2025
2 parents 606b93b + 2a59554 commit a6d45ba
Show file tree
Hide file tree
Showing 50 changed files with 928 additions and 836 deletions.
40 changes: 33 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,32 @@
* SmoothQuant
* SparseGPT

### When to Use Which Optimization

#### PTQ
PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:

##### [W4A16](./examples/quantization_w4a16/README.md)
- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
- Useful speed ups in low QPS regimes with more weight compression.
- Recommended for any GPUs types.
##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md)
- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md)
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).

#### Sparsification
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:

##### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](./examples/sparse_2of4_quantization_fp8/README.md)
- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
- Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score [blog](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution.
- Recommended for compute capability >8.9 (Hopper and Ada Lovelace).


## Installation

Expand All @@ -35,16 +61,16 @@ pip install llmcompressor
### End-to-End Examples

Applying quantization with `llmcompressor`:
* [Activation quantization to `int8`](examples/quantization_w8a8_int8)
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8)
* [Weight only quantization to `int4`](examples/quantization_w4a16)
* [Quantizing MoE LLMs](examples/quantizing_moe)
* [Quantizing Vision-Language Models](examples/multimodal_vision)
* [Quantizing Audio-Language Models](examples/multimodal_audio)
* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
* [Weight only quantization to `int4`](examples/quantization_w4a16/README.md)
* [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)

### User Guides
Deep dives into advanced usage of `llmcompressor`:
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate)
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md)


## Quick Tour
Expand Down
8 changes: 4 additions & 4 deletions examples/trl_mixin/ex_trl_distillation.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,12 @@
max_seq_length = 512

# Load gsm8k using SparseML dataset tools
data_args = DatasetArguments(
dataset_args = DatasetArguments(
dataset="gsm8k", dataset_config_name="main", max_seq_length=max_seq_length
)
dataset_manager = TextGenerationDataset.load_from_registry(
data_args.dataset,
data_args=data_args,
dataset_args.dataset,
dataset_args=dataset_args,
split="train",
processor=tokenizer,
)
Expand Down Expand Up @@ -69,7 +69,7 @@
train_dataset=train_dataset,
data_collator=data_collator,
trl_sft_config_args=trl_sft_config_args,
data_args=data_args,
dataset_args=dataset_args,
model_args=model_args,
)
trainer.train()
Expand Down
1 change: 1 addition & 0 deletions src/llmcompressor/args/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
from .model_arguments import ModelArguments
from .recipe_arguments import RecipeArguments
from .training_arguments import TrainingArguments
from .utils import parse_args
73 changes: 73 additions & 0 deletions src/llmcompressor/args/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from typing import Tuple

from loguru import logger
from transformers import HfArgumentParser

from llmcompressor.args import (
DatasetArguments,
ModelArguments,
RecipeArguments,
TrainingArguments,
)
from llmcompressor.transformers.utils.helpers import resolve_processor_from_model_args


def parse_args(
include_training_args: bool = False, **kwargs
) -> Tuple[ModelArguments, DatasetArguments, RecipeArguments, TrainingArguments, str]:
"""
Keyword arguments passed in from `oneshot` or `train` will
separate the arguments into the following:
* ModelArguments in
src/llmcompressor/args/model_args.py
* DatasetArguments in
src/llmcompressor/args/dataset_args.py
* RecipeArguments in
src/llmcompressor/args/recipe_args.py
* TrainingArguments in
src/llmcompressor/args/training_args.py
ModelArguments, DatasetArguments, and RecipeArguments are used for both
`oneshot` and `train`. TrainingArguments is only used for `train`.
"""

# pop output_dir, used as an attr in TrainingArguments, where oneshot is not used
output_dir = kwargs.pop("output_dir", None)

parser_args = (ModelArguments, DatasetArguments, RecipeArguments)
if include_training_args:
parser_args += (TrainingArguments,)

parser = HfArgumentParser(parser_args)
parsed_args = parser.parse_dict(kwargs)

training_args = None
if include_training_args:
model_args, dataset_args, recipe_args, training_args = parsed_args
if output_dir is not None:
training_args.output_dir = output_dir
else:
model_args, dataset_args, recipe_args = parsed_args

if recipe_args.recipe_args is not None:
if not isinstance(recipe_args.recipe_args, dict):
arg_dict = {}
for recipe_arg in recipe_args.recipe_args:
key, value = recipe_arg.split("=")
arg_dict[key] = value
recipe_args.recipe_args = arg_dict

# raise depreciation warnings
if dataset_args.remove_columns is not None:
logger.warn(
"`remove_columns` argument is depreciated. When tokenizing datasets, all "
"columns which are invalid inputs the tokenizer will be removed",
DeprecationWarning,
)

# silently assign tokenizer to processor
resolve_processor_from_model_args(model_args)

return model_args, dataset_args, recipe_args, training_args, output_dir
8 changes: 8 additions & 0 deletions src/llmcompressor/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# flake8: noqa

from .utils import (
format_calibration_data,
get_calibration_dataloader,
get_processed_dataset,
make_dataset_splits,
)
191 changes: 191 additions & 0 deletions src/llmcompressor/datasets/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
import re
from typing import Any, Callable, Dict, List, Optional

import torch
from datasets import Dataset
from loguru import logger
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers.data import default_data_collator

from llmcompressor.args import DatasetArguments
from llmcompressor.transformers.finetune.data import TextGenerationDataset
from llmcompressor.typing import Processor


def get_processed_dataset(
dataset_args: DatasetArguments,
processor: Processor,
do_oneshot: bool = False,
do_train: bool = True,
) -> Optional[Dict[str, Dataset]]:
"""
Loads datasets for each flow based on dataset_args, stores a Dataset for each
enabled flow in datasets
:param dataset_args: DatasetArguments that contain dataset loading and
processing params
:param processor: processor or tokenizer to use for dataset tokenization
:param do_oneshot: True for oneshot pathway
:param do_train: True for train pathway
:return: A dataset corresponding to either train or calibration (oneshot)
"""
if dataset_args.dataset is None:
logger.warning(
"Running oneshot without calibration data. This is expected for "
"weight-only and dynamic quantization"
)
return

splits = dataset_args.splits
tokenized_datasets = {}

def _get_split_name(inp_str):
# strip out split name, for ex train[60%:] -> train
match = re.match(r"(\w*)\[.*\]", inp_str)
if match is not None:
return match.group(1)
return inp_str

if splits is None:
splits = {"all": None}
elif isinstance(splits, str):
splits = {_get_split_name(splits): splits}
elif isinstance(splits, List):
splits = {_get_split_name(s): s for s in splits}

# default to custom dataset if dataset provided isn't a string
registry_id = (
dataset_args.dataset if isinstance(dataset_args.dataset, str) else "custom"
)
for split_name, split_str in splits.items():
dataset = dataset_args.dataset
if hasattr(dataset, "column_names") and "input_ids" in dataset.column_names:
# dataset is already tokenized
tokenized_datasets[split_name] = dataset
else:
# dataset needs to be tokenized
dataset_manager = TextGenerationDataset.load_from_registry(
registry_id,
dataset_args=dataset_args,
split=split_str,
processor=processor,
)
tokenized_datasets[split_name] = dataset_manager(add_labels=do_train)

return make_dataset_splits(
tokenized_datasets,
do_oneshot=do_oneshot,
do_train=do_train,
)


def get_calibration_dataloader(
dataset_args: DatasetArguments,
processor: Processor,
) -> torch.utils.data.DataLoader:
"""
Get the dataloader used for oneshot calibration.
:param dataset_args: DatasetArguments that contains the dataset parameters.
:param processor: Processor or the tokenizer of the model.
:return: PyTorch dataloader object that contains the calibration dataset.
"""
if dataset_args.dataset is None:
# weight-only quantization or dynamic quantization
return

datasets = get_processed_dataset(
dataset_args=dataset_args,
processor=processor,
do_oneshot=True,
do_train=False,
)

calibration_dataset = datasets.get("calibration")

return format_calibration_data(
tokenized_dataset=calibration_dataset,
num_calibration_samples=dataset_args.num_calibration_samples,
do_shuffle=dataset_args.shuffle_calibration_samples,
collate_fn=dataset_args.data_collator,
)


def format_calibration_data(
tokenized_dataset: Dataset,
num_calibration_samples: Optional[int] = None,
do_shuffle: bool = True,
collate_fn: Callable = default_data_collator,
) -> List[torch.Tensor]:
"""
Creates a dataloader out of the calibration dataset split, trimming it to
the desired number of calibration samples
:param tokenized_dataset: dataset to convert to dataloader
:param num_calibration_samples: number of data samples to convert
:param do_shuffle: whether to shuffle the dataset before selecting calibration
samples, true by default
:param collate_fn: optional custom collate function, or use default
:return: list of trimmed calibration data tensors
"""
safe_calibration_samples = len(tokenized_dataset)
if num_calibration_samples is not None:
safe_calibration_samples = min(len(tokenized_dataset), num_calibration_samples)
if safe_calibration_samples != num_calibration_samples:
logger.warn(
f"Requested {num_calibration_samples} calibration samples but "
f"the provided dataset only has {safe_calibration_samples}. "
)

if do_shuffle:
tokenized_dataset = tokenized_dataset.shuffle()
tokenized_calibration = tokenized_dataset.select(range(safe_calibration_samples))

dataloader_params = {
"batch_size": 1,
"sampler": RandomSampler(tokenized_calibration)
if do_shuffle
else SequentialSampler(tokenized_calibration),
"collate_fn": collate_fn,
"pin_memory": True,
}

calibration_dataloader = DataLoader(tokenized_calibration, **dataloader_params)

return calibration_dataloader


def make_dataset_splits(
tokenized_datasets: Dict[str, Any],
do_oneshot: bool = True,
do_train: bool = False,
) -> Dict[str, Dataset]:
"""
Restructures the datasets dictionary based on what tasks will be run
train
:param tokenized_datasets: dictionary of processed datasets
:param do_oneshot: Whether to store the calibration dataset
:return: A dataset corresponding to either train or calibration (oneshot)
"""

# handles case where all splits are contained in a single dataset
if "all" in tokenized_datasets and len(tokenized_datasets) == 1:
tokenized_datasets = tokenized_datasets.get("all")
if isinstance(tokenized_datasets, Dataset):
tokenized_datasets = {"train": tokenized_datasets}

train_split = calib_split = None

if do_train:
if "train" not in tokenized_datasets:
raise ValueError("--do_train requires a train dataset")
train_split = tokenized_datasets["train"]
if do_oneshot:
calib_split = tokenized_datasets.get("calibration")
if calib_split is None:
if "train" not in tokenized_datasets:
raise ValueError("--do_oneshot requires a calibration dataset")
calib_split = tokenized_datasets["train"]

split_datasets = {
"train": train_split,
"calibration": calib_split,
}
return split_datasets
1 change: 1 addition & 0 deletions src/llmcompressor/entrypoints/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# flake8: noqa
from .oneshot import Oneshot, oneshot
from .utils import post_process, pre_process
Loading

0 comments on commit a6d45ba

Please sign in to comment.