Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions docs/train-models/how-to/convert-checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ Before conversion:
- Keep output paths separate from input paths. A failed conversion should never overwrite the source checkpoint.
- Keep tokenizer and chat-template provenance with the checkpoint. If the converter needs `hf_model_id`, use the original model or config source used by training.
- For LoRA merge, use the exact base checkpoint the adapter was trained against.
- For large Megatron checkpoints, use the default distributed conversion path. The default config runs `nvcr.io/nvidia/nemo:26.04`, which ships the multi-GPU Megatron-Bridge conversion script.
- Keep `tp`, `pp`, `ep`, and `etp` aligned with the model or checkpoint layout. The default distributed conversion path uses `tp=1 pp=1 ep=8 etp=1` for Nemotron MoE checkpoints; dense models usually need an override such as `tp=8 pp=1 ep=1 etp=1`.

## Convert Hugging Face to Megatron

Expand All @@ -45,10 +47,12 @@ Use this path when a Megatron-Bridge consumer needs a Megatron distributed check
```console
$ nemotron steps run convert/hf_to_megatron -c default \
hf_model_id=/path/to/hf_checkpoint_or_model_id \
megatron_path=/path/to/output_megatron_checkpoint
megatron_path=/path/to/output_megatron_checkpoint \
tp=1 pp=1 ep=8
```

For NVIDIA Nemotron checkpoints, keep `dtype=bfloat16` unless the source checkpoint requires another dtype.
The step fails early if multiple ranks are launched but all model-parallel values are left at `1`, because that would not reduce per-GPU model memory.

## Convert Megatron to Hugging Face

Expand All @@ -58,10 +62,12 @@ Use this path when the next consumer is Hugging Face-native evaluation, deployme
$ nemotron steps run convert/megatron_to_hf -c default \
megatron_path=/path/to/megatron/iter_0000100 \
hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
hf_path=/path/to/output_hf_checkpoint
hf_path=/path/to/output_hf_checkpoint \
tp=1 pp=1 ep=8
```

The `hf_model_id` value supplies the model configuration and tokenizer expectations used to reconstruct the Hugging Face layout.
Keep `tp`, `pp`, `ep`, and `etp` aligned with the source Megatron checkpoint for export.

## Merge LoRA Into a Hugging Face Base

Expand Down
35 changes: 34 additions & 1 deletion docs/train-models/reference/convert/hf-to-megatron.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,14 +79,46 @@ Whether to trust Hugging Face custom model code when AutoBridge loads the source
Default: `true`.
```

```{option} distributed=<true-or-false-or-auto>

Use the mounted multi-GPU converter instead of the single-process AutoBridge helper.
Keep this enabled for large models that cannot be materialized on one GPU.

Default: `true`.
```

```{option} tp=<int> pp=<int> ep=<int> etp=<int>

Tensor, pipeline, expert, and expert-tensor parallel sizes for the Megatron checkpoint written by the converter.
The defaults are `tp=1 pp=1 ep=8 etp=1`, matching the common Nemotron MoE conversion path.
Override these for dense models or a different target layout.

Defaults: `tp=1`, `pp=1`, `ep=8`, `etp=1`.
```

```{option} torchrun.nproc_per_node=<int>

Number of local conversion ranks when the step has to launch `torchrun` itself.
When a backend already launches the step with `torchrun`, the existing distributed world is reused.

Default: `NEMOTRON_CONVERT_NPROC_PER_NODE` or `8`.
```

```{option} script.path=<path>

Path to Megatron-Bridge's `convert_checkpoints_multi_gpu.py`.
Defaults to the path shipped in `nvcr.io/nvidia/nemo:26.04`: `/opt/Megatron-Bridge/examples/conversion/convert_checkpoints_multi_gpu.py`.
```

## Command Examples

Convert the default NVIDIA Nemotron base model into a local Megatron output directory:

```console
$ nemotron steps run convert/hf_to_megatron -c default \
hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
megatron_path=./output/convert/nano3-megatron
megatron_path=./output/convert/nano3-megatron \
tp=1 pp=1 ep=8
```

Submit the conversion through a generated Lepton profile:
Expand All @@ -102,6 +134,7 @@ $ nemotron steps run convert/hf_to_megatron -c default --batch lepton_convert_mo
- If the source came from LoRA training, merge the adapter into the original base first with `convert/merge_lora`.
- If tokenizer or model config files are missing, use the original Hugging Face model id as `hf_model_id`.
- If conversion fails, retry into a fresh `megatron_path` instead of reusing a partially written directory.
- If `distributed=true` launches multiple ranks with `tp=pp=ep=etp=1`, the step fails early because that would not shard the model. Set the real target parallelism, such as `tp=1 pp=1 ep=8 etp=1` for Nemotron MoE or `tp=8 pp=1 ep=1 etp=1` for a dense model.

## Related Documentation

Expand Down
41 changes: 40 additions & 1 deletion docs/train-models/reference/convert/megatron-to-hf.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,43 @@ Whether Megatron-Bridge should require source and target checkpoint keys to matc
Default: `true`.
```

```{option} distributed=<true-or-false-or-auto>

Use the mounted multi-GPU converter instead of the single-process AutoBridge helper.
Keep this enabled for large checkpoints that cannot be loaded on one GPU.

Default: `true`.
```

```{option} tp=<int> pp=<int> ep=<int> etp=<int>

Tensor, pipeline, expert, and expert-tensor parallel sizes used by the source Megatron checkpoint.
These values must match the checkpoint layout.

Defaults: `tp=1`, `pp=1`, `ep=8`, `etp=1`.
```

```{option} distributed_save=<true-or-false>

Let ranks write assigned Hugging Face shards independently, reducing rank-0 memory pressure during export.

Default: `true`.
```

```{option} torchrun.nproc_per_node=<int>

Number of local conversion ranks when the step has to launch `torchrun` itself.
When a backend already launches the step with `torchrun`, the existing distributed world is reused.

Default: `NEMOTRON_CONVERT_NPROC_PER_NODE` or `8`.
```

```{option} script.path=<path>
Comment thread
rapaul-nv marked this conversation as resolved.
Outdated

Path to Megatron-Bridge's `convert_checkpoints_multi_gpu.py`.
Defaults to the path shipped in `nvcr.io/nvidia/nemo:26.04`: `/opt/Megatron-Bridge/examples/conversion/convert_checkpoints_multi_gpu.py`.
```

## Command Examples

Export a validated Megatron checkpoint iteration to Hugging Face layout:
Expand All @@ -92,7 +129,8 @@ Export a validated Megatron checkpoint iteration to Hugging Face layout:
$ nemotron steps run convert/megatron_to_hf -c default \
megatron_path=/path/to/megatron/iter_0000100 \
hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
hf_path=./output/convert/sft-hf
hf_path=./output/convert/sft-hf \
tp=1 pp=1 ep=8
```

Submit the export through a generated Lepton profile:
Expand All @@ -108,6 +146,7 @@ $ nemotron steps run convert/megatron_to_hf -c default --batch lepton_convert_mo

- If export fails because the checkpoint is incomplete, wait for async checkpoint save to finish and retry from a complete `iter_*` directory.
- If tokenizer or config reconstruction fails, set `hf_model_id` to the original base model or config source.
- If `distributed=true` launches multiple ranks with `tp=pp=ep=etp=1`, the step fails early because that would not shard the model. Set the real source checkpoint parallelism, such as `tp=1 pp=1 ep=8 etp=1` for Nemotron MoE or `tp=8 pp=1 ep=1 etp=1` for a dense checkpoint.
- Validate the exported Hugging Face checkpoint with a small generation or evaluation job before deployment.

## Related Documentation
Expand Down
9 changes: 5 additions & 4 deletions src/nemo_runspec/config/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,10 +247,11 @@ def build_job_config(
profile_mounts = profile_env.get("mounts") or []
if existing_mounts or profile_mounts:
merged_env["mounts"] = list(existing_mounts) + list(profile_mounts)
# Re-apply YAML resource keys so recipe requirements win over profile defaults.
# The recipe knows how many nodes/GPUs it needs; env.toml provides cluster
# logistics (account, partition, tunnel, mounts) the recipe doesn't know about.
resource_keys = ("nodes", "gpus_per_node", "ntasks_per_node", "nproc_per_node")
# Re-apply YAML-owned execution keys so recipe requirements win over
# inherited profile defaults. The recipe/config knows which image and
# resource shape it requires; env.toml provides cluster logistics
# (account, partition, tunnel, site mounts) the recipe doesn't know about.
resource_keys = ("container_image", "nodes", "gpus_per_node", "ntasks_per_node", "nproc_per_node")
for key in resource_keys:
if key in existing_env:
merged_env[key] = existing_env[key]
Expand Down
Loading
Loading