fix: Remove fail-fast (-x) and guard distributed teardown against deadlock by ko3n1g · Pull Request #4139 · NVIDIA/Megatron-LM

ko3n1g · 2026-04-03T23:00:03Z

Problem

-x (fail-fast) was set in two places — pyproject.toml addopts and run_ci_test.sh — making one redundant. More importantly, through investigation we found that -x was being used as a workaround for the wrong problem.

The actual risk is that distributed fixture teardown deadlocks when a rank is hanging:

Rank 0 hangs inside an NCCL collective
Rank 1 fails a test → pytest proceeds to fixture teardown
Teardown calls barrier() + destroy_process_group() — both require all-rank coordination
Rank 0 can't participate → both ranks deadlock indefinitely

With -x, rank 1 exits fast enough that torchrun kills rank 0 before teardown runs, which avoided the symptom but not the cause.

Fix

Guard the barrier before teardown with a 30s timeout in both teardown sites:

tests/unit_tests/conftest.py — session-level cleanup fixture
tests/unit_tests/test_utilities.py — Utils.destroy_model_parallel()

If the barrier times out, a rank is unresponsive and we bail without calling destroy_process_group, breaking the deadlock. The session still exits non-zero (torchrun already recorded the failure).

With this in place, -x is no longer needed for session safety, so it is removed from both pyproject.toml and run_ci_test.sh. This means on a real failure the full suite continues running on the non-failing ranks, giving a complete picture of what broke.

Verification

Tested with a purpose-built scenario: rank 0 stuck in dist.all_reduce(), rank 1 fails before the collective. Without this fix, the session hung indefinitely (had to docker kill). With this fix, both with -x and without -x variants exit cleanly in ~6-7s.

Scenario2  WITH  fail-fast  →  7s  ✅
Scenario2  WITHOUT fail-fast →  6s  ✅  (previously: ∞, deadlock)

For the Python-level hang scenario (Scenario 1), removing -x produces the expected behaviour — rank 1 runs remaining tests after the failure, giving a fuller picture before torchrun kills the hanging rank:

Scenario1  WITH  fail-fast  →  13s  (stops at first failure)
Scenario1  WITHOUT fail-fast →  19s  (runs all remaining tests, +6s = 3×2s tests)

🤖 Generated with Claude Code

Fail-fast (-x) was set in two places — pyproject.toml addopts and run_ci_test.sh — making one redundant. More importantly, our investigation showed that -x is the wrong fix: the real risk is that distributed fixture teardown (barrier + destroy_process_group) deadlocks when a rank is hanging, not that pytest keeps running tests too long. Fix the root cause instead: wrap the barrier in cleanup (conftest.py) and destroy_model_parallel (test_utilities.py) with a 30s timeout. If the barrier times out a rank is unresponsive and we bail without calling destroy_process_group, breaking the deadlock. This makes -x unnecessary for session safety. Remove -x from pyproject.toml addopts and run_ci_test.sh so that on a real test failure the full suite still runs on the non-failing ranks, giving a complete picture of what broke rather than stopping at the first failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot · 2026-04-03T23:00:07Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g · 2026-04-03T23:06:38Z

/ok to test

skyw

LGTM.

A note on the try: over torch.distributed.barrier, it doesn't do a lot for NCCL backend.

ko3n1g requested a review from skyw April 3, 2026 23:00

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 3, 2026

skyw approved these changes Apr 3, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test April 3, 2026 23:07 Inactive

ko3n1g marked this pull request as ready for review April 3, 2026 23:08

ko3n1g requested a review from a team as a code owner April 3, 2026 23:08

svcnvidia-nemo-ci requested a review from a team April 3, 2026 23:08

svcnvidia-nemo-ci added the complexity: low label Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139
ko3n1g wants to merge 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/fix/remove-fail-fast

ko3n1g commented Apr 3, 2026

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

ko3n1g commented Apr 3, 2026

Uh oh!

skyw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ko3n1g commented Apr 3, 2026

Problem

Fix

Verification

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

ko3n1g commented Apr 3, 2026

Uh oh!

skyw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants