You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -17,39 +17,48 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar
17
17
| Pre-Requisites | Required & optional packages to use this technique |\[[Link](#pre-requisites)\]||
18
18
| Getting Started | Learn how to use the pruning API |\[[Link](#getting-started)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html)\]|
19
19
| Support Matrix | View the support matrix to see available pruning algorithms and their compatibility with different models and frameworks |\[[Link](#support-matrix)\]||
20
-
| Pruning Guidelines | Guidelines for choosing how and how much to prune for best results |\[[Link](#pruning-guidelines)\]||
21
20
| Examples | Examples of different pruning methods |\[[Link](#examples)\]||
21
+
| Pruning Guidelines | Guidelines for choosing how and how much to prune for best results |\[[Link](#pruning-guidelines)\]||
22
22
| Resources | Extra links to relevant resources |\[[Link](#resources)\]||
23
23
24
24
</div>
25
25
26
26
## Pre-Requisites
27
27
28
-
For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.09`) which has all the dependencies installed.
28
+
For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.11`) which has all the dependencies installed. Make sure to upgrade Model Optimizer to the latest version using `pip`.
29
29
30
30
For FastNAS pruning for PyTorch Computer Vision models, no additional dependencies are required.
31
31
32
-
For GradNAS pruning for Hugging Face BERT / GPT-J, no additional dependencies are requisred.
32
+
For GradNAS pruning for Hugging Face BERT / GPT-J, no additional dependencies are required.
33
33
34
34
## Getting Started
35
35
36
-
As part of the pruning process, you will need to set up the training and/or validation data loaders, and optionally define a validation score function (FastNAS) or loss function (GradNAS) and specify the desired pruning constraints (See [Support Matrix](#support-matrix) for available pruning constraints).
36
+
As part of the pruning process, you will need to set up the training and/or validation data loaders, and optionally define a validation score function (Minitron, FastNAS) or loss function (GradNAS) and specify the desired pruning constraints (See [Support Matrix](#support-matrix) for available pruning constraints).
37
+
38
+
To prune your model, you can simply call the `mtp.prune` API and save the pruned model. If the model is pruned using Minitron, you can use your standard saving and loading functions since it is a homogeneous pruning; while for FastNAS or GradNAS, you need to use `mto.save` and `mto.restore` to save and restore the heterogeneous pruned model.
39
+
40
+
### Minitron
41
+
42
+
Minitron pruning supports two modes:
43
+
44
+
1.**Manual Pruning**: Manually specify the target dimensions for each pruning axis (e.g., `constraints = {"export_config": {"hidden_size": 3072, "ffn_hidden_size": 9216}}`)
45
+
2.**NAS-based Auto Pruning (New)**: Specify a target parameter count (e.g., `constraints = {"params": 6e9}`) and let the algorithm automatically search for the best architecture that maximizes a user-defined score function (e.g. MMLU, negative validation loss, etc.)
37
46
38
-
To prune your model, you can simply call the `mtp.prune` API and save the pruned model. If the model is pruned using FastNAS or GradNAS, you need to use `mto.save` and `mto.restore` to save and restore the pruned model; while for Minitron pruning, you can use your standard saving and loading functions since it is a homogeneous pruning.
47
+
Please see example snippets of both modes for Minitron pruning on Megatron-Core GPT model below. For end-to-end examples script (M-LM / NeMo framework), please refer to the examples below.
39
48
40
-
Please see an example snippet of Minitron pruning for Megatron-Core GPT model below (for other algorithms, please refer to the examples below).
49
+
#### Common Setup
41
50
42
51
```python
43
52
import modelopt.torch.prune as mtp
44
53
from megatron.core.models.gpt import GPTModel
45
54
from megatron.core.post_training.modelopt.gpt.model_specs import get_gpt_modelopt_spec
46
55
from megatron.core.transformer.transformer_config import TransformerConfig
47
56
48
-
# Load the Megatron-Core GPTModel with ModelOpt transformer layer spec
49
-
config= TransformerConfig(...)
57
+
# Load the Megatron-Core GPTModel MambaModel with ModelOpt transformer layer spec
If your model parameters are already sorted, you can skip the sorting step by setting `"skip_sorting": True` in `config` instead of passing `forward_loop`.
84
-
85
84
> [!Note]
86
85
> Fine-tuning / distillation is required after pruning to recover the accuracy. Please refer to [end-to-end pruning and distillation tutorial](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) for more details.
87
86
87
+
#### 1. Manual Pruning
88
+
89
+
This mode can be useful when you know the exact dimensions you want to prune to (e.g. fitting a specific latency / memory budget).
90
+
91
+
```python
92
+
# Specify the pruning constraints (Check Support Matrix for available pruning dimensions)
1.**Importance Scoring**: Runs forward passes on calibration data (512-1024 samples) to compute activation magnitudes for each neuron/head/layer (takes ~5 minutes for an 8B model)
102
+
2.**Ranking**: Ranks all parameters within each pruning dimension (e.g., all hidden dimensions, all attention heads) by their importance scores
103
+
3.**Pruning**: Removes the least important parameters to meet the specified target dimensions in `export_config`
104
+
4.**Weight Slicing**: Slices the model weights according to the pruned architecture (homogeneous pruning - all layers pruned uniformly)
105
+
106
+
> [!TIP]
107
+
> Checkout the [Pruning Guidelines](#pruning-guidelines) section for more details on how to choose the best pruning strategy and distillation hyperparameters.
108
+
109
+
#### 2. NAS-based Auto Pruning
110
+
111
+
This mode can be useful when you don't know the exact dimensions you want to prune to and want the algorithm to search for the best architecture that maximizes a user-defined score function at the cost of longer runtime.
112
+
113
+
```python
114
+
# Define the score function to maximize (e.g., MMLU, negative validation loss, etc.)
115
+
# The algorithm will search for the best architecture that maximizes this score
116
+
from modelopt.torch.utils.plugins.megatron_mmlu import megatron_mmlu
117
+
118
+
defscore_func(m):
119
+
return megatron_mmlu(m, tokenizer, percentage=0.05) # 5% sampled data for faster eval
120
+
121
+
# Specify target parameter count and configure the auto pruning algorithm
122
+
constraints = {"params": 6e9} # Prune to 6B parameters
# Optional: Configure search space constraints (showing defaults)
128
+
"max_width_pruning": 0.4, # Maximum 40% per width pruning hparam
129
+
"max_depth_pruning": 0.2, # Maximum 20% per depth pruning hparam (num_layers)
130
+
"hparams_to_skip": [], # Disable pruning specific hparams, e.g., ["num_attention_heads"]
131
+
"top_k": 10, # Number of top architectures to evaluate (use 20 for better results at the cost of 2x time)
132
+
}
133
+
134
+
mtp.prune(...)
135
+
```
136
+
137
+
**Under the Hood:**
138
+
139
+
1.**Importance Scoring**: Same as manual pruning - computes activation magnitudes for all parameters (takes ~5 minutes for an 8B model)
140
+
2.**Search Space Construction**: Generates a search space of possible architectures based search space config and other configs (`max_width_pruning`, `max_depth_pruning`, `hparams_to_skip`)
141
+
3.**Architecture Search**: Find candidate architectures that meet the parameter constraint and evaluate `top_k` (based on number of parameters) of them using `score_func` e.g. MMLU, negative validation loss, etc. (takes ~10 mins per candidate for an 8B model pruning)
142
+
4.**Best Architecture Selection**: Returns the architecture (best `export_config`) with the highest actual score from the top-K evaluated architectures
143
+
5.**Weight Slicing**: Slices the model weights according to the best pruned architecture found
144
+
145
+
> [!Note]
146
+
> As per the [original paper](https://arxiv.org/pdf/2407.14679), ideally we need to perform a short Knowledge Distillation on ~2B tokens for all top-K candidate architectures before evaluating the score function, which will take a lot longer to prune, require splitting the pruning process into multiple stages and a lot more compute for pruning but can lead to better pruned model. If you are interested to do this, you can take the top-K candidate's `export_config` from the pruning logs and then export all models separately and perform Knowledge Distillation on each of them before evaluating the score function.
147
+
148
+
#### Advanced Configuration
149
+
150
+
For finer control over the search space (e.g., granularity of pruning choices), you can configure the divisors:
151
+
152
+
```python
153
+
# Configure search space granularity (showing defaults)
If your model parameters are already sorted and you just want to prune the weights, you can skip the sorting step by setting `"skip_sorting": True` in `config` instead of passing `forward_loop`.
Checkout the Minitron pruning example for the [Megatron-LM Framework](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt#-pruning) or [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama-3.1-8B, Qwen3-8B, Nemotron-Nano-9B-v2, Nemotron-3-Nano-30B-A3B, etc.
183
+
Both frameworks support importing from a Hugging Face pretrained checkpoint.
184
+
185
+
You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Qwen3-8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
186
+
187
+
Some of the models pruned using Minitron method followed by distillation and post-training are:
188
+
189
+
-[Minitron Collection on Hugging Face](https://huggingface.co/collections/nvidia/minitron)
### FastNAS Pruning for PyTorch Computer Vision Models
193
+
194
+
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
195
+
196
+
You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
197
+
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
198
+
also shows how to profile the model to understand the search space of possible pruning options and demonstrates
199
+
how to save and restore pruned models.
200
+
201
+
### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)
202
+
203
+
Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
204
+
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
205
+
with distillation and quantization. The example also demonstrates how to save and restore pruned models.
206
+
98
207
## Pruning Guidelines
99
208
100
209
### Minitron
@@ -173,35 +282,6 @@ After pruning, distillation is required to recover model accuracy. Below are rec
173
282
> [!TIP]
174
283
> If you know the maximum learning rate used during the original training, a good rule of thumb for knowledge distillation is to use **1/5th of that maximum LR** when compressing by ~50%.
Checkout the Minitron pruning example for the [Megatron-LM Framework](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt#-pruning) or [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Nemotron Nano 12B v2, etc.
181
-
Both frameworks support importing from a Hugging Face pretrained checkpoint.
182
-
183
-
You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Qwen 3 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
184
-
185
-
Some of the models pruned using Minitron method followed by distillation and post-training are:
186
-
187
-
-[Minitron Collection on Hugging Face](https://huggingface.co/collections/nvidia/minitron)
### FastNAS Pruning for PyTorch Computer Vision Models
191
-
192
-
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
193
-
194
-
You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
195
-
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
196
-
also shows how to profile the model to understand the search space of possible pruning options and demonstrates
197
-
how to save and restore pruned models.
198
-
199
-
### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)
200
-
201
-
Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
202
-
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
203
-
with distillation and quantization. The example also demonstrates how to save and restore pruned models.
0 commit comments