Parallel Test Execution to decrease CI run times by sudhakarsingh27 · Pull Request #2824 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-04-02T05:36:00Z

Run L0 pytorch tests in parallel across multiple GPUs

Problem

qa/L0_pytorch_unittest/test.sh runs 30 pytest invocations sequentially. On multi-GPU nodes (B200 8-GPU), 7 GPUs sit idle while tests run one at a time on GPU 0.

Solution

Detect available GPUs and dispatch tests in parallel waves — one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution.

Design

Wave-based round-robin:

Wave 1: GPU0=test_sanity  GPU1=test_recipe  GPU2=test_deferred  ... GPU7=test_nvfp4
         (wait for all 8 to finish)
Wave 2: GPU0=test_mxfp8   GPU1=test_quantized  ...
         (wait)
...

Each wave launches exactly 1 test per GPU as a background job via CUDA_VISIBLE_DEVICES=N, then wait. No GPU ever runs 2 tests simultaneously (avoids OOM).

Key decisions:

Aspect	Approach
GPU assignment	Round-robin via `CUDA_VISIBLE_DEVICES` per background job
Concurrency	Wave of `NUM_GPUS` jobs, then `wait` — 1 test per GPU at a time
Error tracking	File-based (`$FAIL_DIR/failures`) — shell vars don't propagate from subshells
Single GPU	`NUM_GPUS=1` → synchronous path, identical to original
Output	Per-test `.log` files during execution, replayed sequentially into trace after
OOM safety	`python -u` (unbuffered) ensures errors are flushed before process death
JUnit XML	Unaffected — `--junitxml` writes directly to disk regardless of parallelism

How it works

GPU detection: reads CUDA_VISIBLE_DEVICES if set, otherwise counts via nvidia-smi
During execution: trace shows progress markers (>>> Starting: test_X on GPU Y)
Per-test output: captured to $XML_LOG_DIR/<test>.log (unbuffered)
After all waves: logs replayed sequentially into the trace — reads like the old sequential flow
Failure collection: each subshell writes to $FAIL_DIR/failures; parent collects at the end

Trace output

During execution:

Detected 8 GPU(s): 0 1 2 3 4 5 6 7
>>> Starting: test_sanity.py on GPU 0
>>> Starting: test_recipe.py on GPU 1
...
>>> Finished: test_recipe.py on GPU 1
>>> Finished: test_sanity.py on GPU 0

After completion (replayed cleanly):

=== Per-test output (replayed from parallel execution) ===

────────────────────────────────────────────────────────
>>> pytest_test_sanity
────────────────────────────────────────────────────────
<full pytest output>
...
=== End of per-test output ===

Backward compatibility

Single GPU runners (A100, H100, L40): NUM_GPUS=1, runs synchronously — identical to current behavior
CUDA_VISIBLE_DEVICES="0" (B200_1GPU): detects 1 GPU, synchronous path
JUnit XML: same files, same names, same $XML_LOG_DIR — CI reporting unaffected
Job trace: all output present (progress markers + replayed logs) — manual debugging works
Per-test .log files: available as artifacts in logs/ for direct access

Expected speedup

With 30 tests on 8 GPUs in ~4 waves:

Current: sequential, wall time = sum of all tests
Parallel: ~4 waves, wall time = sum of longest test per wave
Estimated: 4-8x speedup depending on test duration spread

Testing

# 8 GPUs (parallel):
TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh

# Single GPU (sequential, same as original):
CUDA_VISIBLE_DEVICES=0 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh

# Custom GPU count:
CUDA_VISIBLE_DEVICES=0,1,2,3 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh

Files changed

qa/L0_pytorch_unittest/test.sh — parallel test infrastructure + all 30 test invocations wrapped in run_test

Detect available GPUs and dispatch pytest invocations in parallel waves, one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution. - GPU detection from CUDA_VISIBLE_DEVICES or nvidia-smi - Wave-based round-robin: launch NUM_GPUS background jobs, wait, repeat - File-based error tracking (shell vars don't propagate from subshells) - Per-test log files replayed into trace after all waves complete - Unbuffered output (python -u) for OOM error capture - Progress markers in trace during execution With 30 tests on 8 GPUs (B200), expected ~4 waves instead of 30 sequential runs, roughly 4-8x speedup depending on test duration spread. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the sudhakars/parallel-test-execution branch from ec7df29 to 964db19 Compare April 2, 2026 06:22

DEBUG: inject deliberate failures for testing (remove before merge)

65b1335

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Test Execution to decrease CI run times#2824

Parallel Test Execution to decrease CI run times#2824
sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/parallel-test-execution

sudhakarsingh27 commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sudhakarsingh27 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Run L0 pytorch tests in parallel across multiple GPUs

Problem

Solution

Design

How it works

Trace output

Backward compatibility

Expected speedup

Testing

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sudhakarsingh27 commented Apr 2, 2026 •

edited

Loading