Paged Stashing by nanz-nv · Pull Request #2690 · NVIDIA/Megatron-LM

nanz-nv · 2025-12-17T09:08:47Z

Main contributors (Equal Contribution, sorted alphabetically): Nan Zheng (@nanz-nv), Vasudevan Rengasamy (@vasunvidia)
Other contributors (sorted alphabetically): Dennis Liu(@Victarry), Hongbin Liu(@lhb8125), Qi Zhang(@QiZhangNV), Robin Zhang(@buptzyb), Tong Liu(@Autumn1998), Zijie Yan(@yanring)

Background

In token-dropless MoE training, the number of tokens received by each expert might vary, resulting in dynamic shaped tensors. Dynamic shaped tensors are naturally supported by PyTorch, thanks to its eager mode nature. This is done by creating a tensor lazily when the shape of the tensor is known at run-time. Albeit working well in eager mode, dynamic shaped tensor poses challenges for CUDA graphs because the the size of a tensor cannot be dynamically adjusted at runtime without the intervene of the host. In order to remove the sync and enable CUDA graph, one solution is to oversize the buffer in the expert part. This however causes significantly higher memory consumption compared to the eager-mode baseline through the form of memory fragmentation.

Idea overview

To address this problem, paged stashing decouples the need of oversized buffers for compute and the need of a properly sized buffer for storing activations for the backward pass. Paged stashing achieves this through adding one level of indirection: stashing and restoring. The stash operation copies the activation from the oversized static buffer to a pre-allocated stashing buffer after the forward for that module is done, and the restore operation does the reverse operation during the backward pass.

The key of saving memory lies in the fact that the stash operation packs the variable-size activation into a contiguous stashing buffer to reduce memory fragmentation. For simple scheduling where the activation allocation and deallocation follows a first-in-last-out pattern, stash and restore can be done easily in a bump-allocation manner. To accommodate complicated scheduling schedules, e.g. pipeline parallel, paging can be used, hence the name paged stashing.

page management

To accomodate complex scheduling such as that needed in pipeline parallelism, activations are partitioned into pages and a light-weight memory management kernel is in charge of allocate and deallocate pages for stashing. Pages are managed by lightweight GPU memory management kernels that can be fused with the stash/restore GPU kernels. It maintains a freelist which is implemented as a circular buffer. Each freelist keeps track of one type of pages.

CPU offloading

Paged stashing naturally supports offloading. When the stashing buffer is a pinned CPU tensor, the activation is offloaded to the host memory during forward and is reloaded to the GPU during backward.
Furthermore, one can easily extend the paging management system to accommodate partial offloading or on-demand offloading. This feature is currently WIP.

scheduling

Overlapping stashing and restore operations with compute can be implemented by inserting two autograd functions before and after the expert compute layer: pre-scheduler and post-scheduler that schedules stash and restore operations. The roles of these autograd functions are enumerated below:

Pre-scheduler forward: Wait for previous stash op. to complete, free the max-capacity sized temporary activations for the completed stash op. The wait is performed here instead of Post-scheduler forward to reduce the peak memory usage since the following expert compute layer will allocate another set of max-capacity sized temporary activations.
Post-scheduler forward: Since this is after experts compute, stashing operations for the current layer activations are scheduled here. If the next layer in the execution is a backward pass layer, schedule restore operations for the next layer.
Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.
Post-scheduler backward: Wait for previous stash op. to complete, free the max-capacity sized temporary activations for the completed stash op. The wait is performed here instead of Pre-scheduler backward to reduce the peak memory usage since the following expert compute BPROP layer will allocate another set of max-capacity sized temporary activations.
Wait for restore operation for the current layer to complete. Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.
Pre-scheduler backward: If the next layer in the execution is a backward pass layer, schedule restore operations for the next layer.

copy-pr-bot · 2025-12-17T09:08:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Victarry · 2025-12-19T01:23:22Z

/ok to test 3e8c042

github-actions · 2025-12-19T01:23:40Z

Thank you for your contribution!

NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process.

Thank you for your understanding.

megatron/core/transformer/moe/experts.py

megatron/core/transformer/transformer_config.py

megatron/training/arguments.py

megatron/training/utils.py

megatron/core/transformer/transformer_config.py

megatron/core/transformer/moe/experts.py

megatron/core/transformer/moe/token_dispatcher.py

megatron/training/utils.py

megatron/core/transformer/moe/paged_stash.py

hxbai · 2026-03-24T09:59:44Z

megatron/training/arguments.py

        if is_te_min_version("2.10.0"):
            assert os.getenv("NVTE_CPU_OFFLOAD_V1", "0") == "1", \
                "For fine-grained activation offloading with TE >= 2.10.0, NVTE_CPU_OFFLOAD_V1 should be set to 1 to avoid offloading weights."
+        assert not args.moe_paged_stash, "Fine-grained activation offloading and paged stash cannot be enabled at the same time"


Why is this assertion added?

It was there due to historical reasons. Just removed it. Thanks for catching that.

This reverts commit be3eec1.

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Implement the following features. 1. Use boolean flags to mark group tensor data and scale_inv tensors. 2. Handle scenario when no tensors are stashed.

Victarry · 2026-03-31T07:07:03Z

megatron/core/transformer/moe/paged_stash.py

+
+def paged_stash_group_commit(tensor, name=None):
+    """Mark the end of a layer group and prepare for stash/reload."""
+    rank = torch.distributed.get_rank()


[IMPORTANT Unused Variable] rank = torch.distributed.get_rank() is computed but never used.

This is called on every expert layer forward pass. While the overhead is small, it's unnecessary dead code.

Suggestion: Remove this line.

Victarry · 2026-03-31T07:07:05Z

megatron/core/transformer/moe/paged_stash.py

+                count = 0
+                for item in self.paged_tensors_to_reload:
+                    if len(self.paged_tensors_to_reload[item]) > 0:
+                        count += 1


[IMPORTANT Unused Variable] count is computed by iterating over all reload queues but is never read or used.

Suggestion: Remove these 4 lines (720-723).

Victarry · 2026-03-31T07:07:11Z

megatron/core/transformer/moe/paged_stash.py

+
+    def get_schedule_layer(self, vp_stage, layer_no, microbatch_no):
+        """Get the schedule layer."""
+        return vp_stage * 1000000 + layer_no * 1000 + microbatch_no


Suggestion: Add an assertion to validate the ranges:

assert layer_no < 1000 and microbatch_no < 1000, "Schedule encoding overflow"

@vasunvidia

Victarry · 2026-03-31T07:09:56Z

megatron/core/transformer/transformer_block.py

                            mhc_is_last_in_recompute_block[l_no]
                        )
-
+                        


[SUGGESTION] This line changed from an empty line to a line with trailing whitespace — unintentional diff noise. Consider reverting to keep the diff clean.

Victarry · 2026-03-31T07:09:58Z

megatron/core/transformer/moe/paged_stash.py

+    def __enter__(self):
+        from megatron.core.extensions.transformer_engine import cpu_offload
+
+        if cpu_offload is not None:
+            cpu_offload.CPUOffloadEnabled = True
+        # Call the underlying context manager's __enter__
+        result = self.saved_tensors_context.__enter__()
+
+        # Add more custom logic after entering if needed
+        return result
+
+    def __exit__(self, *args: Any):
+        # Call the underlying context manager's __exit__
+        result = self.saved_tensors_context.__exit__(*args)
+        from megatron.core.extensions.transformer_engine import cpu_offload
+
+        if cpu_offload is not None:
+            cpu_offload.CPUOffloadEnabled = False


[SUGGESTION] PagedStashContext unconditionally sets CPUOffloadEnabled = True on enter and False on exit without saving/restoring the original value.

Given that transformer_config already asserts moe_paged_stash cannot coexist with cpu_offloading, the intent of toggling CPUOffloadEnabled here is unclear. If this is needed for TE internal behavior, a comment explaining why would help. If not, consider removing it to avoid accidentally enabling CPU offload in unexpected contexts.

If it must stay, save and restore the original value:

def __enter__(self): if cpu_offload is not None: self._prev_offload = cpu_offload.CPUOffloadEnabled cpu_offload.CPUOffloadEnabled = True ... def __exit__(self, *args): if cpu_offload is not None: cpu_offload.CPUOffloadEnabled = self._prev_offload

Victarry · 2026-03-31T08:52:37Z

megatron/core/transformer/moe/experts.py

+                int(max_num_tokens // cap_factor) if cap_factor is not None and cap_factor > 0 else None
+            )
+            stash_context = get_paged_stash_context(
+                name="expert_fc1_fused",


It seems the activation save for fc2 is also included in the stash context?
Such that we could change the context name expert_fc1_fused to grouped_mlp

Thanks for the suggestion. Renamed

Victarry · 2026-03-31T09:06:23Z

megatron/core/transformer/moe/paged_stash.py

+            while len(self.paged_tensors_to_stash) > 0:
+                paged_tensor = self.paged_tensors_to_stash.pop(0)


Suggested change

while len(self.paged_tensors_to_stash) > 0:

paged_tensor = self.paged_tensors_to_stash.pop(0)

self.paged_tensors_to_stash.clear()

@vasunvidia

buptzyb · 2026-04-01T13:50:28Z

Does paged stashing support partial cudagraph? We have many cases where attention has dynamic shapes (varlen, dynamic cp, KDA, ...), so only the moe part is capturable.

github-actions bot added the community-request label Dec 17, 2025

yanring assigned nanz-nv Dec 18, 2025

nanz-nv force-pushed the paged_offloading branch from d99b74f to f733d51 Compare December 18, 2025 08:45

Victarry self-requested a review December 19, 2025 00:10

copy-pr-bot bot temporarily deployed to nemo-ci December 19, 2025 01:23 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 19, 2025 01:23 Failure

ko3n1g added this to the Core 0.16 milestone Dec 19, 2025

Victarry requested review from Autumn1998, QiZhangNV and lhb8125 December 19, 2025 01:47

QiZhangNV reviewed Dec 19, 2025

View reviewed changes

vasunvidia reviewed Jan 7, 2026

View reviewed changes

megatron/core/transformer/transformer_config.py Outdated Show resolved Hide resolved

vasunvidia reviewed Jan 7, 2026

View reviewed changes

megatron/core/transformer/moe/experts.py Outdated Show resolved Hide resolved

vasunvidia reviewed Jan 7, 2026

View reviewed changes

megatron/core/transformer/moe/token_dispatcher.py Show resolved Hide resolved

vasunvidia reviewed Jan 7, 2026

View reviewed changes

megatron/training/utils.py Outdated Show resolved Hide resolved

Victarry mentioned this pull request Jan 16, 2026

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Open

44 tasks

jianyuh reviewed Jan 19, 2026

View reviewed changes

megatron/core/transformer/moe/paged_stash.py Outdated Show resolved Hide resolved

vasunvidia force-pushed the paged_offloading branch from 63126cc to d4eee90 Compare February 9, 2026 22:27

vasunvidia force-pushed the paged_offloading branch from f30202f to a1103bb Compare February 20, 2026 00:29

vasunvidia force-pushed the paged_offloading branch from a1103bb to 095db06 Compare March 17, 2026 06:10

nanz-nv force-pushed the paged_offloading branch 3 times, most recently from 3cd7a47 to b5b19b0 Compare March 23, 2026 05:46

yanring requested a review from buptzyb March 24, 2026 07:20

hxbai reviewed Mar 24, 2026

View reviewed changes

vasunvidia and others added 21 commits March 31, 2026 15:32

Change to support eager-mode fallback for validation

ac42b99

Revert "Check in dynamic-shape-aware SwiGLU triton kernel"

5cff7a9

This reverts commit be3eec1.

Fixed some minor issues

6dd213b

Fix the unit test

b28f812

Initial commit for spill to cpu feature

2e92588

Move paged stashing knobs from env vars to transformer_config knobs

58a97c1

Refactor the knobs a bit so it is more intuitive

79522cc

Use get_attr_wrapped_model util to access moe and mtp layers

b3be4de

Refactor the unit test for paged stashing

3fc366e

Clean up after rebase

7a23c78

skip routed expert padding

b4e1e56

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Refactor/clean-up logging

06d8a85

Resolve review feedback

fb620fc

Fix fallback data read for PP=1

09ef7af

Paged stashing refactor

f9a5fcf

Implement the following features. 1. Use boolean flags to mark group tensor data and scale_inv tensors. 2. Handle scenario when no tensors are stashed.

Remove logical_shape check

25be640

Remove paged_stash_set_last_layer

1d3755a

Cleanup PadUnpadFunction

f227fd6

Remove stash modules and remove stashing code for non-fused grouped gemm

b5cb760

Remove dead code

de6c6eb

Fix TE import problem in experts.py

84d1803

Victarry reviewed Mar 31, 2026

View reviewed changes

nanz-nv added 2 commits March 31, 2026 17:14

Fixed merge conflict

b49c1a0

Address reviewer's comments

2617ff9

nanz-nv force-pushed the paged_offloading branch from 41162a5 to 2617ff9 Compare March 31, 2026 11:45

vasunvidia mentioned this pull request Apr 1, 2026

GEMM + Swiglu fused Grouped MLP for MXFP8 NVIDIA/TransformerEngine#2769

Merged

13 tasks

vasunvidia and others added 3 commits April 1, 2026 18:11

Review comments

a037128

Add PagedStashRunner for overflow detection for pure M-LM training

05ea747

Release stashing buffer before fallback to restore the memory

a133251

		while len(self.paged_tensors_to_stash) > 0:
		paged_tensor = self.paged_tensors_to_stash.pop(0)

	while len(self.paged_tensors_to_stash) > 0:
	paged_tensor = self.paged_tensors_to_stash.pop(0)
	self.paged_tensors_to_stash.clear()

Conversation

nanz-nv commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Idea overview

page management

CPU offloading

scheduling

Uh oh!

copy-pr-bot bot commented Dec 17, 2025

Uh oh!

Victarry commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buptzyb commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

nanz-nv commented Dec 17, 2025 •

edited

Loading