Skip to content

Commit ec7df29

Browse files
Add PR description
1 parent 7fcf6e3 commit ec7df29

File tree

1 file changed

+98
-0
lines changed

1 file changed

+98
-0
lines changed

PR_DESCRIPTION.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
## Run L0 pytorch tests in parallel across multiple GPUs
2+
3+
### Problem
4+
5+
`qa/L0_pytorch_unittest/test.sh` runs 30 pytest invocations sequentially. On multi-GPU nodes (B200 8-GPU), 7 GPUs sit idle while tests run one at a time on GPU 0.
6+
7+
### Solution
8+
9+
Detect available GPUs and dispatch tests in parallel waves — one test per GPU per wave. On single-GPU machines, behavior is identical to the original sequential execution.
10+
11+
### Design
12+
13+
**Wave-based round-robin:**
14+
```
15+
Wave 1: GPU0=test_sanity GPU1=test_recipe GPU2=test_deferred ... GPU7=test_nvfp4
16+
(wait for all 8 to finish)
17+
Wave 2: GPU0=test_mxfp8 GPU1=test_quantized ...
18+
(wait)
19+
...
20+
```
21+
22+
Each wave launches exactly 1 test per GPU as a background job via `CUDA_VISIBLE_DEVICES=N`, then `wait`. No GPU ever runs 2 tests simultaneously (avoids OOM).
23+
24+
**Key decisions:**
25+
26+
| Aspect | Approach |
27+
|---|---|
28+
| GPU assignment | Round-robin via `CUDA_VISIBLE_DEVICES` per background job |
29+
| Concurrency | Wave of `NUM_GPUS` jobs, then `wait` — 1 test per GPU at a time |
30+
| Error tracking | File-based (`$FAIL_DIR/failures`) — shell vars don't propagate from subshells |
31+
| Single GPU | `NUM_GPUS=1` → synchronous path, identical to original |
32+
| Output | Per-test `.log` files during execution, replayed sequentially into trace after |
33+
| OOM safety | `python -u` (unbuffered) ensures errors are flushed before process death |
34+
| JUnit XML | Unaffected — `--junitxml` writes directly to disk regardless of parallelism |
35+
36+
### How it works
37+
38+
1. **GPU detection**: reads `CUDA_VISIBLE_DEVICES` if set, otherwise counts via `nvidia-smi`
39+
2. **During execution**: trace shows progress markers (`>>> Starting: test_X on GPU Y`)
40+
3. **Per-test output**: captured to `$XML_LOG_DIR/<test>.log` (unbuffered)
41+
4. **After all waves**: logs replayed sequentially into the trace — reads like the old sequential flow
42+
5. **Failure collection**: each subshell writes to `$FAIL_DIR/failures`; parent collects at the end
43+
44+
### Trace output
45+
46+
During execution:
47+
```
48+
Detected 8 GPU(s): 0 1 2 3 4 5 6 7
49+
>>> Starting: test_sanity.py on GPU 0
50+
>>> Starting: test_recipe.py on GPU 1
51+
...
52+
>>> Finished: test_recipe.py on GPU 1
53+
>>> Finished: test_sanity.py on GPU 0
54+
```
55+
56+
After completion (replayed cleanly):
57+
```
58+
=== Per-test output (replayed from parallel execution) ===
59+
60+
────────────────────────────────────────────────────────
61+
>>> pytest_test_sanity
62+
────────────────────────────────────────────────────────
63+
<full pytest output>
64+
...
65+
=== End of per-test output ===
66+
```
67+
68+
### Backward compatibility
69+
70+
- **Single GPU runners** (A100, H100, L40): `NUM_GPUS=1`, runs synchronously — identical to current behavior
71+
- **`CUDA_VISIBLE_DEVICES="0"`** (B200_1GPU): detects 1 GPU, synchronous path
72+
- **JUnit XML**: same files, same names, same `$XML_LOG_DIR` — CI reporting unaffected
73+
- **Job trace**: all output present (progress markers + replayed logs) — manual debugging works
74+
- **Per-test `.log` files**: available as artifacts in `logs/` for direct access
75+
76+
### Expected speedup
77+
78+
With 30 tests on 8 GPUs in ~4 waves:
79+
- **Current**: sequential, wall time = sum of all tests
80+
- **Parallel**: ~4 waves, wall time = sum of longest test per wave
81+
- **Estimated**: 4-8x speedup depending on test duration spread
82+
83+
### Testing
84+
85+
```bash
86+
# 8 GPUs (parallel):
87+
TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh
88+
89+
# Single GPU (sequential, same as original):
90+
CUDA_VISIBLE_DEVICES=0 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh
91+
92+
# Custom GPU count:
93+
CUDA_VISIBLE_DEVICES=0,1,2,3 TE_PATH=$(pwd) XML_LOG_DIR=/tmp/test_logs bash qa/L0_pytorch_unittest/test.sh
94+
```
95+
96+
### Files changed
97+
98+
- `qa/L0_pytorch_unittest/test.sh` — parallel test infrastructure + all 30 test invocations wrapped in `run_test`

0 commit comments

Comments
 (0)