Skip to content

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139

Open
ko3n1g wants to merge 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/fix/remove-fail-fast
Open

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139
ko3n1g wants to merge 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/fix/remove-fail-fast

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 3, 2026

Problem

-x (fail-fast) was set in two places — pyproject.toml addopts and run_ci_test.sh — making one redundant. More importantly, through investigation we found that -x was being used as a workaround for the wrong problem.

The actual risk is that distributed fixture teardown deadlocks when a rank is hanging:

  1. Rank 0 hangs inside an NCCL collective
  2. Rank 1 fails a test → pytest proceeds to fixture teardown
  3. Teardown calls barrier() + destroy_process_group() — both require all-rank coordination
  4. Rank 0 can't participate → both ranks deadlock indefinitely

With -x, rank 1 exits fast enough that torchrun kills rank 0 before teardown runs, which avoided the symptom but not the cause.

Fix

Guard the barrier before teardown with a 30s timeout in both teardown sites:

  • tests/unit_tests/conftest.py — session-level cleanup fixture
  • tests/unit_tests/test_utilities.pyUtils.destroy_model_parallel()

If the barrier times out, a rank is unresponsive and we bail without calling destroy_process_group, breaking the deadlock. The session still exits non-zero (torchrun already recorded the failure).

With this in place, -x is no longer needed for session safety, so it is removed from both pyproject.toml and run_ci_test.sh. This means on a real failure the full suite continues running on the non-failing ranks, giving a complete picture of what broke.

Verification

Tested with a purpose-built scenario: rank 0 stuck in dist.all_reduce(), rank 1 fails before the collective. Without this fix, the session hung indefinitely (had to docker kill). With this fix, both with -x and without -x variants exit cleanly in ~6-7s.

Scenario2  WITH  fail-fast  →  7s  ✅
Scenario2  WITHOUT fail-fast →  6s  ✅  (previously: ∞, deadlock)

For the Python-level hang scenario (Scenario 1), removing -x produces the expected behaviour — rank 1 runs remaining tests after the failure, giving a fuller picture before torchrun kills the hanging rank:

Scenario1  WITH  fail-fast  →  13s  (stops at first failure)
Scenario1  WITHOUT fail-fast →  19s  (runs all remaining tests, +6s = 3×2s tests)

🤖 Generated with Claude Code

Fail-fast (-x) was set in two places — pyproject.toml addopts and
run_ci_test.sh — making one redundant. More importantly, our investigation
showed that -x is the wrong fix: the real risk is that distributed fixture
teardown (barrier + destroy_process_group) deadlocks when a rank is hanging,
not that pytest keeps running tests too long.

Fix the root cause instead: wrap the barrier in cleanup (conftest.py) and
destroy_model_parallel (test_utilities.py) with a 30s timeout. If the
barrier times out a rank is unresponsive and we bail without calling
destroy_process_group, breaking the deadlock. This makes -x unnecessary
for session safety.

Remove -x from pyproject.toml addopts and run_ci_test.sh so that on a
real test failure the full suite still runs on the non-failing ranks,
giving a complete picture of what broke rather than stopping at the first
failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g ko3n1g requested a review from skyw April 3, 2026 23:00
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 3, 2026

/ok to test

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 3, 2026
Copy link
Copy Markdown
Contributor

@skyw skyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

A note on the try: over torch.distributed.barrier, it doesn't do a lot for NCCL backend.

@ko3n1g ko3n1g marked this pull request as ready for review April 3, 2026 23:08
@ko3n1g ko3n1g requested a review from a team as a code owner April 3, 2026 23:08
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 3, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants