Add capture_time_hooks to make_graphed_callables for non-capturable per-callable hooks by buptzyb · Pull Request #2831 · NVIDIA/TransformerEngine

buptzyb · 2026-04-03T14:18:29Z

Description

Summary

make_graphed_callables previously had no mechanism to run per-callable
hooks during warmup/capture that must execute outside the CUDA graph
capture context (i.e., hooks that are inherently non-capturable, such as
CPU-side state updates or FSDP parameter un-shard/re-shard calls).

This PR adds a new capture_time_hooks parameter to
make_graphed_callables that accepts per-callable hooks invoked at
capture time (warmup iterations and graph capture), but intentionally
executed outside the CUDA graph capture context so they are not
recorded into the graph and will not be replayed.

Changes

transformer_engine/pytorch/graph.py:
- Add capture_time_hooks: Optional[List[Optional[Dict[str, Dict]]]]
  parameter to make_graphed_callables
- Invoke hooks around forward and backward passes during both warmup
  iterations and graph capture, in both the _order is not None and
  _order is None capture paths
- Hook dict structure mirrors PyTorch's _forward_pre_hooks /
  _forward_hooks format: Dict[hook_type, Dict[handle_id, hook_fn]]
  where hook types are forward_pre, forward, pre_backward,
  backward
- Rename parameter from capture_hooks → capture_time_hooks with
  updated docstring clarifying the non-capturable semantics

Motivation

Used by Megatron-LM's FSDP integration: during CUDA Graph capture,
PyTorch's memory allocator is frozen, causing FSDP parameter
un-shard/re-shard (which requires allocation) to fail. By routing FSDP
hooks through capture_time_hooks, they execute outside the capture
context and are manually driven at the right points during
warmup/capture, while the graph itself only records pure GPU compute.

Hook Invocation Order

For each callable at each warmup/capture iteration:

forward_pre hooks — before func(*args, **kwargs)
(CUDA graph capture context entered)
Forward pass
(CUDA graph capture context exited)
forward hooks — after forward
pre_backward hooks — before torch.autograd.backward
Backward pass
backward hooks — after backward

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Robin Zhang <robinz@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-03T14:22:34Z

Greptile Summary

This PR adds a capture_time_hooks parameter to make_graphed_callables, letting callers inject per-callable hooks that run outside the CUDA graph capture context (e.g., FSDP un-shard/re-shard). It also refactors the monolithic warmup loop into _run_warmup_forward / _run_warmup_backward helpers and unifies the _order is None / _order is not None warmup paths.

P1 – hook output inconsistency: _run_warmup_forward passes flattened outputs to forward hooks, while both capture paths pass the raw (unflattened) return value of func(...). Any hook that inspects output tensors will observe different types depending on the phase.
P1 – pre_warmup_hook/post_warmup_hook regression: These hooks were previously called once per callable; after the refactor they are called once globally, silently breaking callers that relied on per-callable invocation.

Confidence Score: 3/5

Not safe to merge as-is; two P1 behavioral regressions need fixing before the feature is reliable.

The forward-hook output-type mismatch (flattened vs. raw) and the silently changed pre/post_warmup_hook invocation semantics are both present defects in the changed code that can produce wrong behavior for existing and new callers without any runtime error.

transformer_engine/pytorch/graph.py — specifically the _run_warmup_forward forward-hook call and the pre/post_warmup_hook placement

Important Files Changed

Filename	Overview
transformer_engine/pytorch/graph.py	Adds capture_time_hooks parameter and refactors warmup into helper functions; two behavioral issues: (1) forward hooks get flattened outputs during warmup but raw outputs during capture, breaking hook API contract; (2) pre/post_warmup_hook are now called once globally instead of once per callable, silently changing existing semantics.