fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139
Open
ko3n1g wants to merge 1 commit intoNVIDIA:mainfrom
Open
fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139ko3n1g wants to merge 1 commit intoNVIDIA:mainfrom
ko3n1g wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Fail-fast (-x) was set in two places — pyproject.toml addopts and run_ci_test.sh — making one redundant. More importantly, our investigation showed that -x is the wrong fix: the real risk is that distributed fixture teardown (barrier + destroy_process_group) deadlocks when a rank is hanging, not that pytest keeps running tests too long. Fix the root cause instead: wrap the barrier in cleanup (conftest.py) and destroy_model_parallel (test_utilities.py) with a 30s timeout. If the barrier times out a rank is unresponsive and we bail without calling destroy_process_group, breaking the deadlock. This makes -x unnecessary for session safety. Remove -x from pyproject.toml addopts and run_ci_test.sh so that on a real test failure the full suite still runs on the non-failing ranks, giving a complete picture of what broke rather than stopping at the first failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test |
skyw
approved these changes
Apr 3, 2026
Contributor
skyw
left a comment
There was a problem hiding this comment.
LGTM.
A note on the try: over torch.distributed.barrier, it doesn't do a lot for NCCL backend.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
-x(fail-fast) was set in two places —pyproject.tomladdoptsandrun_ci_test.sh— making one redundant. More importantly, through investigation we found that-xwas being used as a workaround for the wrong problem.The actual risk is that distributed fixture teardown deadlocks when a rank is hanging:
barrier()+destroy_process_group()— both require all-rank coordinationWith
-x, rank 1 exits fast enough that torchrun kills rank 0 before teardown runs, which avoided the symptom but not the cause.Fix
Guard the barrier before teardown with a 30s timeout in both teardown sites:
tests/unit_tests/conftest.py— session-levelcleanupfixturetests/unit_tests/test_utilities.py—Utils.destroy_model_parallel()If the barrier times out, a rank is unresponsive and we bail without calling
destroy_process_group, breaking the deadlock. The session still exits non-zero (torchrun already recorded the failure).With this in place,
-xis no longer needed for session safety, so it is removed from bothpyproject.tomlandrun_ci_test.sh. This means on a real failure the full suite continues running on the non-failing ranks, giving a complete picture of what broke.Verification
Tested with a purpose-built scenario: rank 0 stuck in
dist.all_reduce(), rank 1 fails before the collective. Without this fix, the session hung indefinitely (had todocker kill). With this fix, bothwith -xandwithout -xvariants exit cleanly in ~6-7s.For the Python-level hang scenario (Scenario 1), removing
-xproduces the expected behaviour — rank 1 runs remaining tests after the failure, giving a fuller picture before torchrun kills the hanging rank:🤖 Generated with Claude Code