Skip to content

Commit e41b108

Browse files
Update pruning readme
Signed-off-by: Keval Morabia <[email protected]>
1 parent f2ee949 commit e41b108

File tree

1 file changed

+136
-56
lines changed

1 file changed

+136
-56
lines changed

examples/pruning/README.md

Lines changed: 136 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -17,39 +17,48 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar
1717
| Pre-Requisites | Required & optional packages to use this technique | \[[Link](#pre-requisites)\] | |
1818
| Getting Started | Learn how to use the pruning API | \[[Link](#getting-started)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html)\] |
1919
| Support Matrix | View the support matrix to see available pruning algorithms and their compatibility with different models and frameworks | \[[Link](#support-matrix)\] | |
20-
| Pruning Guidelines | Guidelines for choosing how and how much to prune for best results | \[[Link](#pruning-guidelines)\] | |
2120
| Examples | Examples of different pruning methods | \[[Link](#examples)\] | |
21+
| Pruning Guidelines | Guidelines for choosing how and how much to prune for best results | \[[Link](#pruning-guidelines)\] | |
2222
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
2323

2424
</div>
2525

2626
## Pre-Requisites
2727

28-
For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.09`) which has all the dependencies installed.
28+
For Minitron pruning for Megatron-LM / NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.11`) which has all the dependencies installed. Make sure to upgrade Model Optimizer to the latest version using `pip`.
2929

3030
For FastNAS pruning for PyTorch Computer Vision models, no additional dependencies are required.
3131

32-
For GradNAS pruning for Hugging Face BERT / GPT-J, no additional dependencies are requisred.
32+
For GradNAS pruning for Hugging Face BERT / GPT-J, no additional dependencies are required.
3333

3434
## Getting Started
3535

36-
As part of the pruning process, you will need to set up the training and/or validation data loaders, and optionally define a validation score function (FastNAS) or loss function (GradNAS) and specify the desired pruning constraints (See [Support Matrix](#support-matrix) for available pruning constraints).
36+
As part of the pruning process, you will need to set up the training and/or validation data loaders, and optionally define a validation score function (Minitron, FastNAS) or loss function (GradNAS) and specify the desired pruning constraints (See [Support Matrix](#support-matrix) for available pruning constraints).
37+
38+
To prune your model, you can simply call the `mtp.prune` API and save the pruned model. If the model is pruned using Minitron, you can use your standard saving and loading functions since it is a homogeneous pruning; while for FastNAS or GradNAS, you need to use `mto.save` and `mto.restore` to save and restore the heterogeneous pruned model.
39+
40+
### Minitron
41+
42+
Minitron pruning supports two modes:
43+
44+
1. **Manual Pruning**: Manually specify the target dimensions for each pruning axis (e.g., `constraints = {"export_config": {"hidden_size": 3072, "ffn_hidden_size": 9216}}`)
45+
2. **NAS-based Auto Pruning (New)**: Specify a target parameter count (e.g., `constraints = {"params": 6e9}`) and let the algorithm automatically search for the best architecture that maximizes a user-defined score function (e.g. MMLU, negative validation loss, etc.)
3746

38-
To prune your model, you can simply call the `mtp.prune` API and save the pruned model. If the model is pruned using FastNAS or GradNAS, you need to use `mto.save` and `mto.restore` to save and restore the pruned model; while for Minitron pruning, you can use your standard saving and loading functions since it is a homogeneous pruning.
47+
Please see example snippets of both modes for Minitron pruning on Megatron-Core GPT model below. For end-to-end examples script (M-LM / NeMo framework), please refer to the examples below.
3948

40-
Please see an example snippet of Minitron pruning for Megatron-Core GPT model below (for other algorithms, please refer to the examples below).
49+
#### Common Setup
4150

4251
```python
4352
import modelopt.torch.prune as mtp
4453
from megatron.core.models.gpt import GPTModel
4554
from megatron.core.post_training.modelopt.gpt.model_specs import get_gpt_modelopt_spec
4655
from megatron.core.transformer.transformer_config import TransformerConfig
4756

48-
# Load the Megatron-Core GPTModel with ModelOpt transformer layer spec
49-
config = TransformerConfig(...)
57+
# Load the Megatron-Core GPTModel MambaModel with ModelOpt transformer layer spec
58+
model_config = TransformerConfig(...)
5059
model = GPTModel(
51-
config=config,
52-
transformer_layer_spec=get_gpt_modelopt_spec(config, remap_te_layernorm=True),
60+
config=model_config,
61+
transformer_layer_spec=get_gpt_modelopt_spec(model_config, remap_te_layernorm=True),
5362
...
5463
)
5564

@@ -60,41 +69,141 @@ from megatron.training.training import evaluate_and_print_results
6069
def forward_loop(_):
6170
evaluate_and_print_results(prefix, forward_step, train_iterator, model, ...)
6271

63-
64-
# Specify the pruning constraints (Check Support Matrix for available pruning dimensions)
65-
export_config = {
66-
"hidden_size": 3072,
67-
"ffn_hidden_size": 9216,
68-
}
69-
70-
7172
# Run the pruning process (if model is a list then pass model[0] to the prune API)
72-
# Save minitron scores at scores_path so we can re-run pruning with different export configs without running the forward loop again
73-
# NOTE: Skip scores_path on re-running if you want to change the dataset and re-calibrate
73+
# Save minitron scores at checkpoint so we can re-run pruning with different constraints without running the forward loop again
74+
# NOTE: Skip checkpoint on re-running if you want to change the dataset and re-calibrate
7475
model, pruning_scores = mtp.prune(
7576
model,
7677
mode="mcore_minitron",
77-
constraints={"export_config": export_config},
78+
constraints=constraints,
7879
dummy_input=None, # Not used
79-
config={"forward_loop": forward_loop, "scores_path": "modelopt_minitron_scores.pth"},
80+
config=config,
8081
)
8182
```
8283

83-
If your model parameters are already sorted, you can skip the sorting step by setting `"skip_sorting": True` in `config` instead of passing `forward_loop`.
84-
8584
> [!Note]
8685
> Fine-tuning / distillation is required after pruning to recover the accuracy. Please refer to [end-to-end pruning and distillation tutorial](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) for more details.
8786
87+
#### 1. Manual Pruning
88+
89+
This mode can be useful when you know the exact dimensions you want to prune to (e.g. fitting a specific latency / memory budget).
90+
91+
```python
92+
# Specify the pruning constraints (Check Support Matrix for available pruning dimensions)
93+
constraints = {"export_config": {"hidden_size": 3072, "ffn_hidden_size": 9216}}
94+
config = {"forward_loop": forward_loop, "checkpoint": "/path/to/cache/pruning/scores.pth"}
95+
96+
mtp.prune(...)
97+
```
98+
99+
**Under the Hood:**
100+
101+
1. **Importance Scoring**: Runs forward passes on calibration data (512-1024 samples) to compute activation magnitudes for each neuron/head/layer (takes ~5 minutes for an 8B model)
102+
2. **Ranking**: Ranks all parameters within each pruning dimension (e.g., all hidden dimensions, all attention heads) by their importance scores
103+
3. **Pruning**: Removes the least important parameters to meet the specified target dimensions in `export_config`
104+
4. **Weight Slicing**: Slices the model weights according to the pruned architecture (homogeneous pruning - all layers pruned uniformly)
105+
106+
> [!TIP]
107+
> Checkout the [Pruning Guidelines](#pruning-guidelines) section for more details on how to choose the best pruning strategy and distillation hyperparameters.
108+
109+
#### 2. NAS-based Auto Pruning
110+
111+
This mode can be useful when you don't know the exact dimensions you want to prune to and want the algorithm to search for the best architecture that maximizes a user-defined score function at the cost of longer runtime.
112+
113+
```python
114+
# Define the score function to maximize (e.g., MMLU, negative validation loss, etc.)
115+
# The algorithm will search for the best architecture that maximizes this score
116+
from modelopt.torch.utils.plugins.megatron_mmlu import megatron_mmlu
117+
118+
def score_func(m):
119+
return megatron_mmlu(m, tokenizer, percentage=0.05) # 5% sampled data for faster eval
120+
121+
# Specify target parameter count and configure the auto pruning algorithm
122+
constraints = {"params": 6e9} # Prune to 6B parameters
123+
config = {
124+
"forward_loop": forward_loop,
125+
"checkpoint": "/path/to/cache/pruning/scores.pth",
126+
"score_func": score_func,
127+
# Optional: Configure search space constraints (showing defaults)
128+
"max_width_pruning": 0.4, # Maximum 40% per width pruning hparam
129+
"max_depth_pruning": 0.2, # Maximum 20% per depth pruning hparam (num_layers)
130+
"hparams_to_skip": [], # Disable pruning specific hparams, e.g., ["num_attention_heads"]
131+
"top_k": 10, # Number of top architectures to evaluate (use 20 for better results at the cost of 2x time)
132+
}
133+
134+
mtp.prune(...)
135+
```
136+
137+
**Under the Hood:**
138+
139+
1. **Importance Scoring**: Same as manual pruning - computes activation magnitudes for all parameters (takes ~5 minutes for an 8B model)
140+
2. **Search Space Construction**: Generates a search space of possible architectures based search space config and other configs (`max_width_pruning`, `max_depth_pruning`, `hparams_to_skip`)
141+
3. **Architecture Search**: Find candidate architectures that meet the parameter constraint and evaluate `top_k` (based on number of parameters) of them using `score_func` e.g. MMLU, negative validation loss, etc. (takes ~10 mins per candidate for an 8B model pruning)
142+
4. **Best Architecture Selection**: Returns the architecture (best `export_config`) with the highest actual score from the top-K evaluated architectures
143+
5. **Weight Slicing**: Slices the model weights according to the best pruned architecture found
144+
145+
> [!Note]
146+
> As per the [original paper](https://arxiv.org/pdf/2407.14679), ideally we need to perform a short Knowledge Distillation on ~2B tokens for all top-K candidate architectures before evaluating the score function, which will take a lot longer to prune, require splitting the pruning process into multiple stages and a lot more compute for pruning but can lead to better pruned model. If you are interested to do this, you can take the top-K candidate's `export_config` from the pruning logs and then export all models separately and perform Knowledge Distillation on each of them before evaluating the score function.
147+
148+
#### Advanced Configuration
149+
150+
For finer control over the search space (e.g., granularity of pruning choices), you can configure the divisors:
151+
152+
```python
153+
# Configure search space granularity (showing defaults)
154+
ss_config = mtp.mcore_minitron.get_mcore_minitron_config(
155+
hidden_size_divisor=256,
156+
ffn_hidden_size_divisor=512,
157+
mamba_head_dim_divisor=8,
158+
num_moe_experts_divisor=8,
159+
num_layers_divisor=2,
160+
)
161+
162+
# Use the custom search space config
163+
mtp.prune(model, mode=[("mcore_minitron", ss_config)], ...)
164+
```
165+
166+
If your model parameters are already sorted and you just want to prune the weights, you can skip the sorting step by setting `"skip_sorting": True` in `config` instead of passing `forward_loop`.
167+
88168
## Support Matrix
89169

90170
| **Algorithm** | **Model** | **Pruning Constraints** |
91171
| :---: | :---: | :---: |
92-
| Minitron | Megatron-core / NeMo based GPT / Mamba / MoE / Hybrid LLM Models<sup>1</sup> | Export config with width (`hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `mamba_num_heads`, `mamba_head_dim`, `num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size`) and/or depth (`num_layers`) values |
93-
| FastNAS | Computer Vision models | flops, parameters |
94-
| GradNAS | HuggingFace BERT, GPT-J | flops, parameters |
172+
| Minitron | Megatron-core / NeMo based GPT / Mamba / MoE / Hybrid LLM Models<sup>1</sup> | **Manual:** `export_config` with width (`hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `mamba_num_heads`, `mamba_head_dim`, `num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size`) and/or depth (`num_layers`) pruned values<br>**Auto:** `params` (requires `score_func` in config) |
173+
| FastNAS | Computer Vision models | `flops`, `params` |
174+
| GradNAS | HuggingFace BERT, GPT-J | `flops`, `params` |
95175

96176
> *<sup>1.</sup>Only Pipeline Parallel models are supported. Hugging Face models can be converted to Megatron-LM/NeMo format and used subsequently.*
97177
178+
## Examples
179+
180+
### Minitron Pruning for Megatron-LM / NeMo Framework LLMs (e.g. Qwen 3, Nemotron Nano)
181+
182+
Checkout the Minitron pruning example for the [Megatron-LM Framework](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt#-pruning) or [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama-3.1-8B, Qwen3-8B, Nemotron-Nano-9B-v2, Nemotron-3-Nano-30B-A3B, etc.
183+
Both frameworks support importing from a Hugging Face pretrained checkpoint.
184+
185+
You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Qwen3-8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
186+
187+
Some of the models pruned using Minitron method followed by distillation and post-training are:
188+
189+
- [Minitron Collection on Hugging Face](https://huggingface.co/collections/nvidia/minitron)
190+
- [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)
191+
192+
### FastNAS Pruning for PyTorch Computer Vision Models
193+
194+
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
195+
196+
You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
197+
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
198+
also shows how to profile the model to understand the search space of possible pruning options and demonstrates
199+
how to save and restore pruned models.
200+
201+
### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)
202+
203+
Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
204+
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
205+
with distillation and quantization. The example also demonstrates how to save and restore pruned models.
206+
98207
## Pruning Guidelines
99208

100209
### Minitron
@@ -173,35 +282,6 @@ After pruning, distillation is required to recover model accuracy. Below are rec
173282
> [!TIP]
174283
> If you know the maximum learning rate used during the original training, a good rule of thumb for knowledge distillation is to use **1/5th of that maximum LR** when compressing by ~50%.
175284
176-
## Examples
177-
178-
### Minitron Pruning for Megatron-LM / NeMo Framework LLMs (e.g. Qwen 3, Nemotron Nano)
179-
180-
Checkout the Minitron pruning example for the [Megatron-LM Framework](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt#-pruning) or [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Nemotron Nano 12B v2, etc.
181-
Both frameworks support importing from a Hugging Face pretrained checkpoint.
182-
183-
You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Qwen 3 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
184-
185-
Some of the models pruned using Minitron method followed by distillation and post-training are:
186-
187-
- [Minitron Collection on Hugging Face](https://huggingface.co/collections/nvidia/minitron)
188-
- [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)
189-
190-
### FastNAS Pruning for PyTorch Computer Vision Models
191-
192-
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
193-
194-
You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
195-
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
196-
also shows how to profile the model to understand the search space of possible pruning options and demonstrates
197-
how to save and restore pruned models.
198-
199-
### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)
200-
201-
Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
202-
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
203-
with distillation and quantization. The example also demonstrates how to save and restore pruned models.
204-
205285
## Resources
206286

207287
- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)

0 commit comments

Comments
 (0)