|
| 1 | +## Run L0 pytorch tests in parallel across multiple GPUs |
| 2 | + |
| 3 | +### Problem |
| 4 | + |
| 5 | +`qa/L0_pytorch_unittest/test.sh` runs 30 pytest invocations sequentially. On multi-GPU nodes (B200 8-GPU), 7 GPUs sit idle while tests run one at a time on GPU 0. |
| 6 | + |
| 7 | +### Solution |
| 8 | + |
| 9 | +Detect available GPUs and dispatch tests in parallel waves — one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution. |
| 10 | + |
| 11 | +### Design |
| 12 | + |
| 13 | +**Wave-based round-robin:** |
| 14 | +``` |
| 15 | +Wave 1: GPU0=test_sanity GPU1=test_recipe GPU2=test_deferred ... GPU7=test_nvfp4 |
| 16 | + (wait for all 8 to finish) |
| 17 | +Wave 2: GPU0=test_mxfp8 GPU1=test_quantized ... |
| 18 | + (wait) |
| 19 | +... |
| 20 | +``` |
| 21 | + |
| 22 | +Each wave launches exactly 1 test per GPU as a background job via `CUDA_VISIBLE_DEVICES=N`, then `wait`. No GPU ever runs 2 tests simultaneously (avoids OOM). |
| 23 | + |
| 24 | +**Key decisions:** |
| 25 | + |
| 26 | +| Aspect | Approach | |
| 27 | +|---|---| |
| 28 | +| GPU assignment | Round-robin via `CUDA_VISIBLE_DEVICES` per background job | |
| 29 | +| Concurrency | Wave of `NUM_GPUS` jobs, then `wait` — 1 test per GPU at a time | |
| 30 | +| Error tracking | File-based (`$FAIL_DIR/failures`) — shell vars don't propagate from subshells | |
| 31 | +| Single GPU | `NUM_GPUS=1` → synchronous path, identical to original | |
| 32 | +| Output | Per-test `.log` files during execution, replayed sequentially into trace after | |
| 33 | +| OOM safety | `python -u` (unbuffered) ensures errors are flushed before process death | |
| 34 | +| JUnit XML | Unaffected — `--junitxml` writes directly to disk regardless of parallelism | |
| 35 | + |
| 36 | +### How it works |
| 37 | + |
| 38 | +1. **GPU detection**: reads `CUDA_VISIBLE_DEVICES` if set, otherwise counts via `nvidia-smi` |
| 39 | +2. **During execution**: trace shows progress markers (`>>> Starting: test_X on GPU Y`) |
| 40 | +3. **Per-test output**: captured to `$XML_LOG_DIR/<test>.log` (unbuffered) |
| 41 | +4. **After all waves**: logs replayed sequentially into the trace — reads like the old sequential flow |
| 42 | +5. **Failure collection**: each subshell writes to `$FAIL_DIR/failures`; parent collects at the end |
| 43 | + |
| 44 | +### Trace output |
| 45 | + |
| 46 | +During execution: |
| 47 | +``` |
| 48 | +Detected 8 GPU(s): 0 1 2 3 4 5 6 7 |
| 49 | +>>> Starting: test_sanity.py on GPU 0 |
| 50 | +>>> Starting: test_recipe.py on GPU 1 |
| 51 | +... |
| 52 | +>>> Finished: test_recipe.py on GPU 1 |
| 53 | +>>> Finished: test_sanity.py on GPU 0 |
| 54 | +``` |
| 55 | + |
| 56 | +After completion (replayed cleanly): |
| 57 | +``` |
| 58 | +=== Per-test output (replayed from parallel execution) === |
| 59 | +
|
| 60 | +──────────────────────────────────────────────────────── |
| 61 | +>>> pytest_test_sanity |
| 62 | +──────────────────────────────────────────────────────── |
| 63 | +<full pytest output> |
| 64 | +... |
| 65 | +=== End of per-test output === |
| 66 | +``` |
| 67 | + |
| 68 | +### Backward compatibility |
| 69 | + |
| 70 | +- **Single GPU runners** (A100, H100, L40): `NUM_GPUS=1`, runs synchronously — identical to current behavior |
| 71 | +- **`CUDA_VISIBLE_DEVICES="0"`** (B200_1GPU): detects 1 GPU, synchronous path |
| 72 | +- **JUnit XML**: same files, same names, same `$XML_LOG_DIR` — CI reporting unaffected |
| 73 | +- **Job trace**: all output present (progress markers + replayed logs) — manual debugging works |
| 74 | +- **Per-test `.log` files**: available as artifacts in `logs/` for direct access |
| 75 | + |
| 76 | +### Expected speedup |
| 77 | + |
| 78 | +With 30 tests on 8 GPUs in ~4 waves: |
| 79 | +- **Current**: sequential, wall time = sum of all tests |
| 80 | +- **Parallel**: ~4 waves, wall time = sum of longest test per wave |
| 81 | +- **Estimated**: 4-8x speedup depending on test duration spread |
| 82 | + |
| 83 | +### Testing |
| 84 | + |
| 85 | +```bash |
| 86 | +# 8 GPUs (parallel): |
| 87 | +TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh |
| 88 | + |
| 89 | +# Single GPU (sequential, same as original): |
| 90 | +CUDA_VISIBLE_DEVICES=0 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh |
| 91 | + |
| 92 | +# Custom GPU count: |
| 93 | +CUDA_VISIBLE_DEVICES=0,1,2,3 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh |
| 94 | +``` |
| 95 | + |
| 96 | +### Files changed |
| 97 | + |
| 98 | +- `qa/L0_pytorch_unittest/test.sh` — parallel test infrastructure + all 30 test invocations wrapped in `run_test` |
0 commit comments