Parallel Test Execution to decrease CI run times#2824
Draft
sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
Draft
Parallel Test Execution to decrease CI run times#2824sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Detect available GPUs and dispatch pytest invocations in parallel waves, one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution. - GPU detection from CUDA_VISIBLE_DEVICES or nvidia-smi - Wave-based round-robin: launch NUM_GPUS background jobs, wait, repeat - File-based error tracking (shell vars don't propagate from subshells) - Per-test log files replayed into trace after all waves complete - Unbuffered output (python -u) for OOM error capture - Progress markers in trace during execution With 30 tests on 8 GPUs (B200), expected ~4 waves instead of 30 sequential runs, roughly 4-8x speedup depending on test duration spread. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
ec7df29 to
964db19
Compare
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Run L0 pytorch tests in parallel across multiple GPUs
Problem
qa/L0_pytorch_unittest/test.shruns 30 pytest invocations sequentially. On multi-GPU nodes (B200 8-GPU), 7 GPUs sit idle while tests run one at a time on GPU 0.Solution
Detect available GPUs and dispatch tests in parallel waves — one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution.
Design
Wave-based round-robin:
Each wave launches exactly 1 test per GPU as a background job via
CUDA_VISIBLE_DEVICES=N, thenwait. No GPU ever runs 2 tests simultaneously (avoids OOM).Key decisions:
CUDA_VISIBLE_DEVICESper background jobNUM_GPUSjobs, thenwait— 1 test per GPU at a time$FAIL_DIR/failures) — shell vars don't propagate from subshellsNUM_GPUS=1→ synchronous path, identical to original.logfiles during execution, replayed sequentially into trace afterpython -u(unbuffered) ensures errors are flushed before process death--junitxmlwrites directly to disk regardless of parallelismHow it works
CUDA_VISIBLE_DEVICESif set, otherwise counts vianvidia-smi>>> Starting: test_X on GPU Y)$XML_LOG_DIR/<test>.log(unbuffered)$FAIL_DIR/failures; parent collects at the endTrace output
During execution:
After completion (replayed cleanly):
Backward compatibility
NUM_GPUS=1, runs synchronously — identical to current behaviorCUDA_VISIBLE_DEVICES="0"(B200_1GPU): detects 1 GPU, synchronous path$XML_LOG_DIR— CI reporting unaffected.logfiles: available as artifacts inlogs/for direct accessExpected speedup
With 30 tests on 8 GPUs in ~4 waves:
Testing
Files changed
qa/L0_pytorch_unittest/test.sh— parallel test infrastructure + all 30 test invocations wrapped inrun_test