Skip to content

Parallel Test Execution to decrease CI run times#2824

Draft
sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/parallel-test-execution
Draft

Parallel Test Execution to decrease CI run times#2824
sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/parallel-test-execution

Conversation

@sudhakarsingh27
Copy link
Copy Markdown
Collaborator

@sudhakarsingh27 sudhakarsingh27 commented Apr 2, 2026

Run L0 pytorch tests in parallel across multiple GPUs

Problem

qa/L0_pytorch_unittest/test.sh runs 30 pytest invocations sequentially. On multi-GPU nodes (B200 8-GPU), 7 GPUs sit idle while tests run one at a time on GPU 0.

Solution

Detect available GPUs and dispatch tests in parallel waves — one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution.

Design

Wave-based round-robin:

Wave 1: GPU0=test_sanity  GPU1=test_recipe  GPU2=test_deferred  ... GPU7=test_nvfp4
         (wait for all 8 to finish)
Wave 2: GPU0=test_mxfp8   GPU1=test_quantized  ...
         (wait)
...

Each wave launches exactly 1 test per GPU as a background job via CUDA_VISIBLE_DEVICES=N, then wait. No GPU ever runs 2 tests simultaneously (avoids OOM).

Key decisions:

Aspect Approach
GPU assignment Round-robin via CUDA_VISIBLE_DEVICES per background job
Concurrency Wave of NUM_GPUS jobs, then wait — 1 test per GPU at a time
Error tracking File-based ($FAIL_DIR/failures) — shell vars don't propagate from subshells
Single GPU NUM_GPUS=1 → synchronous path, identical to original
Output Per-test .log files during execution, replayed sequentially into trace after
OOM safety python -u (unbuffered) ensures errors are flushed before process death
JUnit XML Unaffected — --junitxml writes directly to disk regardless of parallelism

How it works

  1. GPU detection: reads CUDA_VISIBLE_DEVICES if set, otherwise counts via nvidia-smi
  2. During execution: trace shows progress markers (>>> Starting: test_X on GPU Y)
  3. Per-test output: captured to $XML_LOG_DIR/<test>.log (unbuffered)
  4. After all waves: logs replayed sequentially into the trace — reads like the old sequential flow
  5. Failure collection: each subshell writes to $FAIL_DIR/failures; parent collects at the end

Trace output

During execution:

Detected 8 GPU(s): 0 1 2 3 4 5 6 7
>>> Starting: test_sanity.py on GPU 0
>>> Starting: test_recipe.py on GPU 1
...
>>> Finished: test_recipe.py on GPU 1
>>> Finished: test_sanity.py on GPU 0

After completion (replayed cleanly):

=== Per-test output (replayed from parallel execution) ===

────────────────────────────────────────────────────────
>>> pytest_test_sanity
────────────────────────────────────────────────────────
<full pytest output>
...
=== End of per-test output ===

Backward compatibility

  • Single GPU runners (A100, H100, L40): NUM_GPUS=1, runs synchronously — identical to current behavior
  • CUDA_VISIBLE_DEVICES="0" (B200_1GPU): detects 1 GPU, synchronous path
  • JUnit XML: same files, same names, same $XML_LOG_DIR — CI reporting unaffected
  • Job trace: all output present (progress markers + replayed logs) — manual debugging works
  • Per-test .log files: available as artifacts in logs/ for direct access

Expected speedup

With 30 tests on 8 GPUs in ~4 waves:

  • Current: sequential, wall time = sum of all tests
  • Parallel: ~4 waves, wall time = sum of longest test per wave
  • Estimated: 4-8x speedup depending on test duration spread

Testing

# 8 GPUs (parallel):
TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh

# Single GPU (sequential, same as original):
CUDA_VISIBLE_DEVICES=0 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh

# Custom GPU count:
CUDA_VISIBLE_DEVICES=0,1,2,3 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh

Files changed

  • qa/L0_pytorch_unittest/test.sh — parallel test infrastructure + all 30 test invocations wrapped in run_test

Detect available GPUs and dispatch pytest invocations in parallel waves,
one test per GPU per wave. On single-GPU machines, behavior is identical
to the original sequential execution.

- GPU detection from CUDA_VISIBLE_DEVICES or nvidia-smi
- Wave-based round-robin: launch NUM_GPUS background jobs, wait, repeat
- File-based error tracking (shell vars don't propagate from subshells)
- Per-test log files replayed into trace after all waves complete
- Unbuffered output (python -u) for OOM error capture
- Progress markers in trace during execution

With 30 tests on 8 GPUs (B200), expected ~4 waves instead of 30
sequential runs, roughly 4-8x speedup depending on test duration spread.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the sudhakars/parallel-test-execution branch from ec7df29 to 964db19 Compare April 2, 2026 06:22
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant