Add PR description

sudhakarsingh27 · sudhakarsingh27 · commit ec7df298816e · 2026-04-01T23:15:37.000-07:00
diff --git a/PR_DESCRIPTION.md b/PR_DESCRIPTION.md
@@ -0,0 +1,98 @@
+## Run L0 pytorch tests in parallel across multiple GPUs
+
+### Problem
+
+`qa/L0_pytorch_unittest/test.sh` runs 30 pytest invocations sequentially. On multi-GPU nodes (B200 8-GPU), 7 GPUs sit idle while tests run one at a time on GPU 0.
+
+### Solution
+
+Detect available GPUs and dispatch tests in parallel waves — one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution.
+
+### Design
+
+**Wave-based round-robin:**
+```
+Wave 1: GPU0=test_sanity  GPU1=test_recipe  GPU2=test_deferred  ... GPU7=test_nvfp4
+         (wait for all 8 to finish)
+Wave 2: GPU0=test_mxfp8   GPU1=test_quantized  ...
+         (wait)
+...
+```
+
+Each wave launches exactly 1 test per GPU as a background job via `CUDA_VISIBLE_DEVICES=N`, then `wait`. No GPU ever runs 2 tests simultaneously (avoids OOM).
+
+**Key decisions:**
+
+| Aspect | Approach |
+|---|---|
+| GPU assignment | Round-robin via `CUDA_VISIBLE_DEVICES` per background job |
+| Concurrency | Wave of `NUM_GPUS` jobs, then `wait` — 1 test per GPU at a time |
+| Error tracking | File-based (`$FAIL_DIR/failures`) — shell vars don't propagate from subshells |
+| Single GPU | `NUM_GPUS=1` → synchronous path, identical to original |
+| Output | Per-test `.log` files during execution, replayed sequentially into trace after |
+| OOM safety | `python -u` (unbuffered) ensures errors are flushed before process death |
+| JUnit XML | Unaffected — `--junitxml` writes directly to disk regardless of parallelism |
+
+### How it works
+
+1. **GPU detection**: reads `CUDA_VISIBLE_DEVICES` if set, otherwise counts via `nvidia-smi`
+2. **During execution**: trace shows progress markers (`>>> Starting: test_X on GPU Y`)
+3. **Per-test output**: captured to `$XML_LOG_DIR/<test>.log` (unbuffered)
+4. **After all waves**: logs replayed sequentially into the trace — reads like the old sequential flow
+5. **Failure collection**: each subshell writes to `$FAIL_DIR/failures`; parent collects at the end
+
+### Trace output
+
+During execution:
+```
+Detected 8 GPU(s): 0 1 2 3 4 5 6 7
+>>> Starting: test_sanity.py on GPU 0
+>>> Starting: test_recipe.py on GPU 1
+...
+>>> Finished: test_recipe.py on GPU 1
+>>> Finished: test_sanity.py on GPU 0
+```
+
+After completion (replayed cleanly):
+```
+=== Per-test output (replayed from parallel execution) ===
+
+────────────────────────────────────────────────────────
+>>> pytest_test_sanity
+────────────────────────────────────────────────────────
+<full pytest output>
+...
+=== End of per-test output ===
+```
+
+### Backward compatibility
+
+- **Single GPU runners** (A100, H100, L40): `NUM_GPUS=1`, runs synchronously — identical to current behavior
+- **`CUDA_VISIBLE_DEVICES="0"`** (B200_1GPU): detects 1 GPU, synchronous path
+- **JUnit XML**: same files, same names, same `$XML_LOG_DIR` — CI reporting unaffected
+- **Job trace**: all output present (progress markers + replayed logs) — manual debugging works
+- **Per-test `.log` files**: available as artifacts in `logs/` for direct access
+
+### Expected speedup
+
+With 30 tests on 8 GPUs in ~4 waves:
+- **Current**: sequential, wall time = sum of all tests
+- **Parallel**: ~4 waves, wall time = sum of longest test per wave
+- **Estimated**: 4-8x speedup depending on test duration spread
+
+### Testing
+
+```bash
+# 8 GPUs (parallel):
+TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh
+
+# Single GPU (sequential, same as original):
+CUDA_VISIBLE_DEVICES=0 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh
+
+# Custom GPU count:
+CUDA_VISIBLE_DEVICES=0,1,2,3 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh
+```
+
+### Files changed
+
+- `qa/L0_pytorch_unittest/test.sh` — parallel test infrastructure + all 30 test invocations wrapped in `run_test`