diff --git a/docs/nemotron/super3/README.md b/docs/nemotron/super3/README.md index ca1206bcf..3ce922220 100644 --- a/docs/nemotron/super3/README.md +++ b/docs/nemotron/super3/README.md @@ -64,6 +64,7 @@ $ uv run nemotron super3 sft --run YOUR-CLUSTER |-------|------|---------|-------| | 0 | [Pretraining](./pretrain.md) | Base model training with MoE and multi-token prediction | [pretrain.md](./pretrain.md) | | 1 | [SFT](./sft.md) | Multi-domain instruction tuning | [sft.md](./sft.md) | +| — | [Quantization](./quantization.md) | Post-training quantization (FP8 / NVFP4) | [quantization.md](./quantization.md) | ## Model Specifications @@ -164,6 +165,7 @@ wandb login - [Stage 0: Pretraining](./pretrain.md) - [Stage 1: SFT](./sft.md) +- [Quantization (PTQ)](./quantization.md) - [Artifact Lineage](../../nemo_runspec/artifacts.md) - [Execution through NeMo-Run](../../nemo_runspec/nemo-run.md) - [W&B Integration](../wandb.md) diff --git a/docs/nemotron/super3/quantization.md b/docs/nemotron/super3/quantization.md new file mode 100644 index 000000000..3c31e67c6 --- /dev/null +++ b/docs/nemotron/super3/quantization.md @@ -0,0 +1,112 @@ +# Quantization (PTQ) + +This stage quantizes the pretrained Nemotron 3 Super model using [Megatron-Bridge](../nvidia-stack.md#megatron-bridge)'s post-training quantization (PTQ) pipeline. + +--- + +## Quantization Configurations + +Nemotron 3 Super supports four quantization configurations tailored for the Mamba-MoE architecture: + +| Config Name | Format | Description | +|---|---|---| +| `mamba_moe_fp8_aggressive` | FP8 | Aggressive FP8 quantization for Mamba-MoE | +| `mamba_moe_fp8_conservative` | FP8 | Conservative FP8 quantization for Mamba-MoE | +| `mamba_moe_nvfp4_aggressive` | NVFP4 | Aggressive NVFP4 quantization for Mamba-MoE | +| `mamba_moe_nvfp4_conservative` | NVFP4 | Conservative NVFP4 quantization for Mamba-MoE | + +Pass the desired config name via `--export-quant-cfg` to `quantize.py`. + +--- + +## Recipe Execution + +### Direct Script Execution (Megatron-Bridge) + +For direct execution outside this CLI, use the scripts in the [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) repository: + +```bash +# Clone the repository and checkout the super-v3 branch +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout super-v3 +``` + +### Quantize + +```bash +export HF_MODEL=/path/to/hf/model +export MEGATRON_SAVE_PATH=/path/to/quantized/megatron/ckpt + +torchrun --nproc_per_node=16 examples/quantization/quantize.py \ + --hf-model-id $HF_MODEL \ + --export-quant-cfg mamba_moe_nvfp4_conservative \ + --megatron-save-path $MEGATRON_SAVE_PATH \ + --pp 2 \ + --tp 8 \ + --ep 8 \ + --trust-remote-code +``` + +### Resume Quantized Megatron Checkpoint and Generate + +```bash +torchrun --nproc_per_node=16 examples/quantization/ptq_generate.py \ + --hf-model-id $HF_MODEL \ + --megatron-load-path $MEGATRON_SAVE_PATH \ + --pp 2 \ + --tp 8 \ + --ep 8 \ + --trust-remote-code +``` + +### Export Quantized Megatron Checkpoint to Huggingface Checkpoint + +After quantization, export the Megatron checkpoint back to Hugging Face format: + +```bash +export EXPORT_DIR=/path/to/output/hf/ckpt + +torchrun --nproc_per_node=16 examples/quantization/export.py \ + --hf-model-id $HF_MODEL \ + --megatron-load-path $MEGATRON_SAVE_PATH \ + --export-dir $EXPORT_DIR \ + --pp 8 \ + --dtype bfloat16 \ + --trust-remote-code +``` + +Notes: +- For multi-node setups (e.g. 2 nodes with 8× H100), increase `--pp` accordingly (e.g. `--pp 2`) and use a job scheduler like SLURM to launch across nodes. + +--- + +## Infrastructure + +This stage uses the following components from the [NVIDIA AI Stack](../nvidia-stack.md): + +| Component | Role | Documentation | +|-----------|------|---------------| +| [Megatron-Core](../nvidia-stack.md#megatron-core) | Distributed training primitives (TP, PP, EP) | [GitHub](https://github.com/NVIDIA/Megatron-LM) | +| [Megatron-Bridge](../nvidia-stack.md#megatron-bridge) | PTQ quantization, checkpoint export | [Docs](https://docs.nvidia.com/nemo/megatron-bridge/latest/) | +| [Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) | Quantization algorithms (FP8, NVFP4) | [GitHub](https://github.com/NVIDIA/TensorRT-Model-Optimizer) | + +### Parallelism Configuration + +| Parallelism | Default | Flag | +|-------------|---------|------| +| Tensor (TP) | 8 | `--tp` | +| Pipeline (PP) | 2 | `--pp` | +| Expert (EP) | 8 | `--ep` | + +**Minimum resources:** 2 nodes with 8× H100 GPUs. + +--- + +## Reference + +- [Megatron-Bridge Nemotron 3 Super](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md) — MB documentation and examples +- [NVIDIA AI Stack](../nvidia-stack.md) — Megatron-Core, Megatron-Bridge documentation +- [Stage 0: Pretraining](./pretrain.md) — Pretrain the base model +- [Stage 1: SFT](./sft.md) — Supervised fine-tuning +- [Back to Overview](./README.md)