NVIDIA-NeMo · rapaul-nv · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/docs/train-models/how-to/convert-checkpoints.md b/docs/train-models/how-to/convert-checkpoints.md
@@ -37,6 +37,8 @@ Before conversion:
 - Keep output paths separate from input paths. A failed conversion should never overwrite the source checkpoint.
 - Keep tokenizer and chat-template provenance with the checkpoint. If the converter needs `hf_model_id`, use the original model or config source used by training.
 - For LoRA merge, use the exact base checkpoint the adapter was trained against.
+- For large Megatron checkpoints, use the default distributed conversion path. The default config runs `nvcr.io/nvidia/nemo:26.04`, which ships the multi-GPU Megatron-Bridge conversion script.
+- Keep `tp`, `pp`, `ep`, and `etp` aligned with the model or checkpoint layout. The default distributed conversion path uses `tp=1 pp=1 ep=8 etp=1` for Nemotron MoE checkpoints; dense models usually need an override such as `tp=8 pp=1 ep=1 etp=1`.
 
 ## Convert Hugging Face to Megatron
 
@@ -45,10 +47,12 @@ Use this path when a Megatron-Bridge consumer needs a Megatron distributed check
 ```console
 $ nemotron steps run convert/hf_to_megatron -c default \
     hf_model_id=/path/to/hf_checkpoint_or_model_id \
-    megatron_path=/path/to/output_megatron_checkpoint
+    megatron_path=/path/to/output_megatron_checkpoint \
+    tp=1 pp=1 ep=8
 ```
 
 For NVIDIA Nemotron checkpoints, keep `dtype=bfloat16` unless the source checkpoint requires another dtype.
+The step fails early if multiple ranks are launched but all model-parallel values are left at `1`, because that would not reduce per-GPU model memory.
 
 ## Convert Megatron to Hugging Face
 
@@ -58,10 +62,12 @@ Use this path when the next consumer is Hugging Face-native evaluation, deployme
 $ nemotron steps run convert/megatron_to_hf -c default \
     megatron_path=/path/to/megatron/iter_0000100 \
     hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
-    hf_path=/path/to/output_hf_checkpoint
+    hf_path=/path/to/output_hf_checkpoint \
+    tp=1 pp=1 ep=8
 ```
 
 The `hf_model_id` value supplies the model configuration and tokenizer expectations used to reconstruct the Hugging Face layout.
+Keep `tp`, `pp`, `ep`, and `etp` aligned with the source Megatron checkpoint for export.
 
 ## Merge LoRA Into a Hugging Face Base
 

diff --git a/docs/train-models/reference/convert/hf-to-megatron.md b/docs/train-models/reference/convert/hf-to-megatron.md
@@ -79,14 +79,46 @@ Whether to trust Hugging Face custom model code when AutoBridge loads the source
 Default: `true`.
 ```
 
+```{option} distributed=<true-or-false-or-auto>
+
+Use the mounted multi-GPU converter instead of the single-process AutoBridge helper.
+Keep this enabled for large models that cannot be materialized on one GPU.
+
+Default: `true`.
+```
+
+```{option} tp=<int> pp=<int> ep=<int> etp=<int>
+
+Tensor, pipeline, expert, and expert-tensor parallel sizes for the Megatron checkpoint written by the converter.
+The defaults are `tp=1 pp=1 ep=8 etp=1`, matching the common Nemotron MoE conversion path.
+Override these for dense models or a different target layout.
+
+Defaults: `tp=1`, `pp=1`, `ep=8`, `etp=1`.
+```
+
+```{option} torchrun.nproc_per_node=<int>
+
+Number of local conversion ranks when the step has to launch `torchrun` itself.
+When a backend already launches the step with `torchrun`, the existing distributed world is reused.
+
+Default: `NEMOTRON_CONVERT_NPROC_PER_NODE` or `8`.
+```
+
+```{option} script.path=<path>
+
+Path to Megatron-Bridge's `convert_checkpoints_multi_gpu.py`.
+Defaults to the path shipped in `nvcr.io/nvidia/nemo:26.04`: `/opt/Megatron-Bridge/examples/conversion/convert_checkpoints_multi_gpu.py`.
+```
+
 ## Command Examples
 
 Convert the default NVIDIA Nemotron base model into a local Megatron output directory:
 
 ```console
 $ nemotron steps run convert/hf_to_megatron -c default \
     hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
-    megatron_path=./output/convert/nano3-megatron
+    megatron_path=./output/convert/nano3-megatron \
+    tp=1 pp=1 ep=8
 ```
 
 Submit the conversion through a generated Lepton profile:
@@ -102,6 +134,7 @@ $ nemotron steps run convert/hf_to_megatron -c default --batch lepton_convert_mo
 - If the source came from LoRA training, merge the adapter into the original base first with `convert/merge_lora`.
 - If tokenizer or model config files are missing, use the original Hugging Face model id as `hf_model_id`.
 - If conversion fails, retry into a fresh `megatron_path` instead of reusing a partially written directory.
+- If `distributed=true` launches multiple ranks with `tp=pp=ep=etp=1`, the step fails early because that would not shard the model. Set the real target parallelism, such as `tp=1 pp=1 ep=8 etp=1` for Nemotron MoE or `tp=8 pp=1 ep=1 etp=1` for a dense model.
 
 ## Related Documentation
 

diff --git a/docs/train-models/reference/convert/megatron-to-hf.md b/docs/train-models/reference/convert/megatron-to-hf.md
@@ -84,6 +84,43 @@ Whether Megatron-Bridge should require source and target checkpoint keys to matc
 Default: `true`.
 ```
 
+```{option} distributed=<true-or-false-or-auto>
+
+Use the mounted multi-GPU converter instead of the single-process AutoBridge helper.
+Keep this enabled for large checkpoints that cannot be loaded on one GPU.
+
+Default: `true`.
+```
+
+```{option} tp=<int> pp=<int> ep=<int> etp=<int>
+
+Tensor, pipeline, expert, and expert-tensor parallel sizes used by the source Megatron checkpoint.
+These values must match the checkpoint layout.
+
+Defaults: `tp=1`, `pp=1`, `ep=8`, `etp=1`.
+```
+
+```{option} distributed_save=<true-or-false>
+
+Let ranks write assigned Hugging Face shards independently, reducing rank-0 memory pressure during export.
+
+Default: `true`.
+```
+
+```{option} torchrun.nproc_per_node=<int>
+
+Number of local conversion ranks when the step has to launch `torchrun` itself.
+When a backend already launches the step with `torchrun`, the existing distributed world is reused.
+
+Default: `NEMOTRON_CONVERT_NPROC_PER_NODE` or `8`.
+```
+
+```{option} script.path=<path>
+
+Path to Megatron-Bridge's `convert_checkpoints_multi_gpu.py`.
+Defaults to the path shipped in `nvcr.io/nvidia/nemo:26.04`: `/opt/Megatron-Bridge/examples/conversion/convert_checkpoints_multi_gpu.py`.
+```
+
 ## Command Examples
 
 Export a validated Megatron checkpoint iteration to Hugging Face layout:
@@ -92,7 +129,8 @@ Export a validated Megatron checkpoint iteration to Hugging Face layout:
 $ nemotron steps run convert/megatron_to_hf -c default \
     megatron_path=/path/to/megatron/iter_0000100 \
     hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
-    hf_path=./output/convert/sft-hf
+    hf_path=./output/convert/sft-hf \
+    tp=1 pp=1 ep=8
 ```
 
 Submit the export through a generated Lepton profile:
@@ -108,6 +146,7 @@ $ nemotron steps run convert/megatron_to_hf -c default --batch lepton_convert_mo
 
 - If export fails because the checkpoint is incomplete, wait for async checkpoint save to finish and retry from a complete `iter_*` directory.
 - If tokenizer or config reconstruction fails, set `hf_model_id` to the original base model or config source.
+- If `distributed=true` launches multiple ranks with `tp=pp=ep=etp=1`, the step fails early because that would not shard the model. Set the real source checkpoint parallelism, such as `tp=1 pp=1 ep=8 etp=1` for Nemotron MoE or `tp=8 pp=1 ep=1 etp=1` for a dense checkpoint.
 - Validate the exported Hugging Face checkpoint with a small generation or evaluation job before deployment.
 
 ## Related Documentation

diff --git a/src/nemo_runspec/config/loader.py b/src/nemo_runspec/config/loader.py
@@ -247,10 +247,11 @@ def build_job_config(
         profile_mounts = profile_env.get("mounts") or []
         if existing_mounts or profile_mounts:
             merged_env["mounts"] = list(existing_mounts) + list(profile_mounts)
-        # Re-apply YAML resource keys so recipe requirements win over profile defaults.
-        # The recipe knows how many nodes/GPUs it needs; env.toml provides cluster
-        # logistics (account, partition, tunnel, mounts) the recipe doesn't know about.
-        resource_keys = ("nodes", "gpus_per_node", "ntasks_per_node", "nproc_per_node")
+        # Re-apply YAML-owned execution keys so recipe requirements win over
+        # inherited profile defaults. The recipe/config knows which image and
+        # resource shape it requires; env.toml provides cluster logistics
+        # (account, partition, tunnel, site mounts) the recipe doesn't know about.
+        resource_keys = ("container_image", "nodes", "gpus_per_node", "ntasks_per_node", "nproc_per_node")
         for key in resource_keys:
             if key in existing_env:
                 merged_env[key] = existing_env[key]