Fix incorrectly set decoupled_grad in training.py for MFSDP. by cspades · Pull Request #4133 · NVIDIA/Megatron-LM

cspades · 2026-04-03T16:52:33Z

What does this PR do ?

training.py DDPConfig argument assignment for megatron_fsdp_use_decoupled_grad is incorrect (introduced in m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… #4024), should follow the logic of OptimizerConfig.use_precision_aware_optimizer_no_fp8_or_ds_fp8.

Details

Megatron-FSDP does not use FusedAdam's master weights, but Megatron-LM hard-codes master_weights=True if OptimizerConfig.use_precision_aware_optimizer_no_fp8_or_ds_fp8 / use_decoupled_grad are True. This PR turns off FusedAdam master weights when using Megatron-FSDP, as FusedAdam should only provide an optimizer.step() to Megatron-FSDP's DTensor(FP32/BF16) main weights.

Testing

Adding E2E --use-precision-aware-optimizer unit test to temporarily guarantee functionality.
Using HFSDP + FP8 delayed scaling, we get a lot of memory back by turning off master_weights:

# No FusedAdam Master Weights (HFSDP + FP8 Delayed Scaling)
[Rank 0] (after 2 iterations) memory (MB) | allocated: 19595.55 | max allocated: 27022.17 | reserved: 22504.00 | max reserved: 30774.00
[2026-04-03 12:37:38.220805] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 3263.1 | throughput per GPU (TFLOP/s/GPU): 230.1 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 5.473558E+00 | loss scale: 1.0 | grad norm: 9.413 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |

# FusedAdam Master Weights (HFSDP + FP8 Delayed Scaling)
[Rank 0] (after 2 iterations) memory (MB) | allocated: 23425.69 | max allocated: 30852.31 | reserved: 26344.00 | max reserved: 34788.00
[2026-04-03 12:44:12.848377] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 3149.4 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 5.476564E+00 | loss scale: 1.0 | grad norm: 9.424 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |

TODO

@cspades will add a Megatron-FSDP x FusedAdam functional test in TransformerEngine: https://nvidia.slack.com/archives/C038G319G6R/p1775010256285649?thread_ts=1774363859.121649&cid=C038G319G6R

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-04-03T16:52:37Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades self-assigned this Apr 3, 2026

cspades force-pushed the cye/decgrad-argfix branch 3 times, most recently from 3860daf to a230105 Compare April 3, 2026 19:35

cspades marked this pull request as ready for review April 3, 2026 19:52

cspades requested review from a team as code owners April 3, 2026 19:52

svcnvidia-nemo-ci requested a review from a team April 3, 2026 19:52

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 3, 2026

copy-pr-bot bot temporarily deployed to test April 3, 2026 19:53 Inactive

svcnvidia-nemo-ci added the complexity: low label Apr 3, 2026

cspades force-pushed the cye/decgrad-argfix branch from a230105 to 349b8ff Compare April 3, 2026 20:00

copy-pr-bot bot had a problem deploying to test April 3, 2026 20:01 Error

Fix incorrectly set decoupled_grad in training.py for MFSDP.

d562ada

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades force-pushed the cye/decgrad-argfix branch from 349b8ff to d562ada Compare April 3, 2026 20:02

cspades requested a review from a team April 3, 2026 20:02

cspades added the module: megatron-fsdp label Apr 3, 2026

copy-pr-bot bot temporarily deployed to test April 3, 2026 20:03 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrectly set decoupled_grad in training.py for MFSDP.#4133

Fix incorrectly set decoupled_grad in training.py for MFSDP.#4133
cspades wants to merge 1 commit intoNVIDIA:mainfrom
cspades:cye/decgrad-argfix

cspades commented Apr 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cspades commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Details

Testing

TODO

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cspades commented Apr 3, 2026 •

edited

Loading