Skip to content

Commit 83eb80f

Browse files
add end-to-end benchmark suite with feature coverage modes
Split the monolithic bench_e2e.py into a modular benchmark/ package with focused modules (modes, datagen, runner, verify, report, sysinfo). Added 7 feature test modes (merge, correction, dedup, qualcut, high-thread) that exercise fastp code paths beyond default filtering, including overlapping PE data generation for merge/correction testing. Supports custom FASTQ input via --r1/--r2 with automatic format conversion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b138112 commit 83eb80f

File tree

11 files changed

+1348
-0
lines changed

11 files changed

+1348
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ fastp
3434
*.out
3535
*.app
3636

37+
# Python
38+
__pycache__/
39+
3740
# Editor Config
3841
.vscode
3942

benchmark/README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# fastp Benchmark Suite
2+
3+
End-to-end benchmark for fastp: measures throughput, peak memory, and verifies output correctness across I/O format combinations and feature modes.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Run all I/O format modes with generated data (10M pairs)
9+
python3 -m benchmark --opt ./fastp
10+
11+
# Compare optimized build against baseline
12+
python3 -m benchmark --opt ./fastp_opt --orig ./fastp_baseline
13+
14+
# Use your own FASTQ files
15+
python3 -m benchmark --r1 /data/sample_R1.fq.gz --r2 /data/sample_R2.fq.gz --opt ./fastp
16+
```
17+
18+
## Modes
19+
20+
### I/O Format Modes (default: `all`)
21+
22+
Test every input/output compression combination:
23+
24+
| Mode | Type | Input | Output |
25+
|------------|------|--------|--------|
26+
| fq-fq | PE | .fq | .fq |
27+
| fq-gz | PE | .fq | .fq.gz |
28+
| gz-fq | PE | .fq.gz | .fq |
29+
| gz-gz | PE | .fq.gz | .fq.gz |
30+
| se-fq-fq | SE | .fq | .fq |
31+
| se-fq-gz | SE | .fq | .fq.gz |
32+
| se-gz-fq | SE | .fq.gz | .fq |
33+
| se-gz-gz | SE | .fq.gz | .fq.gz |
34+
| stdin-stdout| SE | stdin | stdout |
35+
36+
### Feature Modes (`all-feat`)
37+
38+
Exercise specific fastp processing paths:
39+
40+
| Mode | fastp Flags | Data | Tests |
41+
|------------|-------------------------------------------|-------------|------------------------------------|
42+
| merge | `--merge --correction` | overlapping | PE merge + base correction (#676) |
43+
| correction | `--correction` | overlapping | overlap analysis + base correction |
44+
| dedup-pe | `--dedup` | random | duplicate detection (PE) |
45+
| dedup-se | `--dedup` | random | duplicate detection (SE) |
46+
| qualcut-pe | `--cut_front --cut_tail -W 4 -M 20` | random | sliding window quality cut (PE) |
47+
| qualcut-se | `--cut_front --cut_tail -W 4 -M 20` | random | sliding window quality cut (SE) |
48+
| ht-pe | (default) `-w 32` | random | high-thread concurrency stress |
49+
50+
### Mode Groups
51+
52+
| Alias | Expands To |
53+
|------------|------------------------------------------------|
54+
| all | all 9 I/O format modes (default) |
55+
| all-pe | PE format modes only |
56+
| all-se | SE format modes + stdin-stdout |
57+
| all-feat | all 7 feature modes |
58+
| everything | all I/O + all feature modes (16 total) |
59+
60+
```bash
61+
python3 -m benchmark --mode all-feat --opt ./fastp
62+
python3 -m benchmark --mode merge,correction --opt ./fastp --orig ./fastp_v110
63+
python3 -m benchmark --mode everything --opt ./fastp
64+
```
65+
66+
## Options
67+
68+
```
69+
--opt PATH Optimized binary path (default: /tmp/fastp_opt)
70+
--orig PATH Baseline binary for comparison (optional)
71+
--r1 PATH Custom R1 input (.fq or .fq.gz), skips data generation
72+
--r2 PATH Custom R2 input (.fq or .fq.gz), required for PE modes
73+
--mode MODE Comma-separated mode list (default: all)
74+
--pairs N Read pairs to generate (default: 10,000,000)
75+
--threads N Worker threads, 0 = auto (default: 0)
76+
--runs N Repeat count, reports median (default: 3)
77+
--seed N RNG seed for data generation (default: 2026)
78+
--json PATH Load saved JSON results and print summary table
79+
--merge A B Merge two opt-only result JSONs into comparison table
80+
```
81+
82+
## Test Data
83+
84+
When `--r1`/`--r2` are not provided, the benchmark generates synthetic data:
85+
86+
- **Random PE data** (10M pairs x 150bp): independent R1/R2 with realistic quality profiles. Used for I/O format modes, dedup, qualcut, and high-thread tests.
87+
- **Overlapping PE data** (10M pairs x 150bp): R1 and R2 derived from the same fragment (insert ~220bp, overlap ~80bp). R2 has injected mismatches at low quality (Q10-14) to trigger base correction. Used for merge and correction modes.
88+
89+
Generated data is cached in `/tmp/fastp_benchmark/data/` and reused across runs. Delete the directory to regenerate.
90+
91+
When custom files are provided:
92+
- Format conversion is automatic (decompress .gz to .fq or compress .fq to .gz as needed)
93+
- Merge/correction modes use the custom data directly (real sequencing data has natural overlap)
94+
95+
## Output
96+
97+
- Per-run timing and peak RSS printed live
98+
- Markdown summary table at the end
99+
- JSON results saved to `/tmp/fastp_benchmark/results.json`
100+
- MD5 verification: when baseline is provided, checks that optimized output matches baseline byte-for-byte
101+
102+
## Module Structure
103+
104+
```
105+
benchmark/
106+
├── __main__.py # python3 -m benchmark entry point
107+
├── run.py # CLI argument parsing and orchestration
108+
├── modes.py # mode definitions, constants, path config
109+
├── datagen.py # synthetic FASTQ data generation
110+
├── sysinfo.py # hardware/OS info collection
111+
├── runner.py # benchmark execution (build_cmd, run, measure)
112+
├── verify.py # output MD5 verification
113+
└── report.py # summary table formatting
114+
```

benchmark/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""fastp end-to-end benchmark suite."""

benchmark/__main__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
"""Allow running as: python3 -m benchmark [OPTIONS]"""
2+
from .run import main
3+
4+
main()

benchmark/datagen.py

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
"""FASTQ test data generation for benchmarking."""
2+
import gzip
3+
import random
4+
import time
5+
from pathlib import Path
6+
7+
8+
def generate_data(r1_gz: Path, r2_gz: Path, num_pairs: int, seed: int = 2026):
9+
"""Generate random paired-end FASTQ reads (independent R1/R2)."""
10+
READ_LEN = 150
11+
BATCH = 100_000
12+
BASES = b"ACGT"
13+
QUAL_POOL = 512
14+
rng = random.Random(seed)
15+
16+
pool = []
17+
for _ in range(QUAL_POOL):
18+
q = bytearray(READ_LEN)
19+
for j in range(READ_LEN):
20+
if j < 5 or j > READ_LEN - 10:
21+
q[j] = rng.randint(20, 35) + 33
22+
else:
23+
q[j] = rng.randint(30, 40) + 33
24+
pool.append(bytes(q))
25+
26+
t0 = time.time()
27+
written = 0
28+
with gzip.open(r1_gz, "wb", compresslevel=1) as f1, \
29+
gzip.open(r2_gz, "wb", compresslevel=1) as f2:
30+
while written < num_pairs:
31+
n = min(BATCH, num_pairs - written)
32+
b1 = bytearray()
33+
b2 = bytearray()
34+
for i in range(n):
35+
rid = written + i + 1
36+
name = f"@SIM:BENCH:1:{1101 + rid // 10000000}:{rid % 50000}:{(rid * 7) % 50000}".encode()
37+
s1 = bytes(rng.choices(BASES, k=READ_LEN))
38+
s2 = bytes(rng.choices(BASES, k=READ_LEN))
39+
q1 = pool[rng.randint(0, QUAL_POOL - 1)]
40+
q2 = pool[rng.randint(0, QUAL_POOL - 1)]
41+
b1 += name + b" 1:N:0:ATCG\n" + s1 + b"\n+\n" + q1 + b"\n"
42+
b2 += name + b" 2:N:0:ATCG\n" + s2 + b"\n+\n" + q2 + b"\n"
43+
f1.write(bytes(b1))
44+
f2.write(bytes(b2))
45+
written += n
46+
if written % 1_000_000 == 0:
47+
e = time.time() - t0
48+
eta = (num_pairs - written) / (written / e)
49+
print(f" {100 * written / num_pairs:5.1f}% {written / 1e6:.0f}M pairs"
50+
f" {written / e / 1e6:.2f}M/s ETA {eta:.0f}s", flush=True)
51+
52+
print(f" Generated in {time.time() - t0:.1f}s")
53+
54+
55+
def generate_merge_data(r1_gz: Path, r2_gz: Path, num_pairs: int, seed: int = 2026):
56+
"""Generate overlapping PE reads for merge/correction testing.
57+
58+
Each pair derives from a random fragment with insert size ~220bp (sd=30).
59+
R2 is the reverse complement of the fragment's 3' end. Mismatches are
60+
injected in the overlap region with low quality on R2 to trigger base
61+
correction (R1 Q30+ vs R2 Q10-14 at mismatch positions).
62+
"""
63+
READ_LEN = 150
64+
BATCH = 100_000
65+
BASES = b"ACGT"
66+
COMP = bytes.maketrans(b"ACGT", b"TGCA")
67+
QUAL_POOL = 512
68+
INSERT_MEAN = 220
69+
INSERT_SD = 30
70+
MIN_INSERT = READ_LEN + 30 # need >= 30bp overlap for fastp
71+
MAX_INSERT = 2 * READ_LEN - 1
72+
MISMATCH_RATE = 0.015
73+
74+
rng = random.Random(seed)
75+
76+
pool = []
77+
for _ in range(QUAL_POOL):
78+
q = bytearray(READ_LEN)
79+
for j in range(READ_LEN):
80+
if j < 5 or j > READ_LEN - 10:
81+
q[j] = rng.randint(25, 35) + 33
82+
else:
83+
q[j] = rng.randint(30, 40) + 33
84+
pool.append(bytes(q))
85+
86+
def revcomp(seq: bytes) -> bytes:
87+
return seq.translate(COMP)[::-1]
88+
89+
t0 = time.time()
90+
written = 0
91+
with gzip.open(r1_gz, "wb", compresslevel=1) as f1, \
92+
gzip.open(r2_gz, "wb", compresslevel=1) as f2:
93+
while written < num_pairs:
94+
n = min(BATCH, num_pairs - written)
95+
b1 = bytearray()
96+
b2 = bytearray()
97+
for i in range(n):
98+
rid = written + i + 1
99+
name = f"@SIM:MERGE:1:{1101 + rid // 10000000}:{rid % 50000}:{(rid * 7) % 50000}".encode()
100+
101+
# Sample insert size from truncated normal distribution
102+
insert = int(rng.gauss(INSERT_MEAN, INSERT_SD))
103+
insert = max(MIN_INSERT, min(MAX_INSERT, insert))
104+
105+
# Generate fragment and derive reads
106+
frag = bytearray(rng.choices(BASES, k=insert))
107+
s1 = bytes(frag[:READ_LEN])
108+
s2 = bytearray(revcomp(bytes(frag[insert - READ_LEN:])))
109+
110+
q1 = bytearray(pool[rng.randint(0, QUAL_POOL - 1)])
111+
q2 = bytearray(pool[rng.randint(0, QUAL_POOL - 1)])
112+
113+
# Inject mismatches in R2 overlap with low quality to trigger correction
114+
overlap_len = 2 * READ_LEN - insert
115+
n_mm = max(1, int(overlap_len * MISMATCH_RATE))
116+
n_mm = min(n_mm, min(4, int(overlap_len * 0.19)))
117+
overlap_start = READ_LEN - overlap_len
118+
for p in rng.sample(range(overlap_start, READ_LEN), n_mm):
119+
alts = [b for b in BASES if b != s2[p]]
120+
s2[p] = rng.choice(alts)
121+
q2[p] = rng.randint(10 + 33, 14 + 33) # Q10-14
122+
123+
b1 += name + b" 1:N:0:ATCG\n" + s1 + b"\n+\n" + bytes(q1) + b"\n"
124+
b2 += name + b" 2:N:0:ATCG\n" + bytes(s2) + b"\n+\n" + bytes(q2) + b"\n"
125+
f1.write(bytes(b1))
126+
f2.write(bytes(b2))
127+
written += n
128+
if written % 1_000_000 == 0:
129+
e = time.time() - t0
130+
eta = (num_pairs - written) / (written / e)
131+
print(f" {100 * written / num_pairs:5.1f}% {written / 1e6:.0f}M pairs"
132+
f" {written / e / 1e6:.2f}M/s ETA {eta:.0f}s", flush=True)
133+
134+
print(f" Generated in {time.time() - t0:.1f}s")

benchmark/modes.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
"""Mode definitions, constants, and path configuration."""
2+
import os
3+
from pathlib import Path
4+
5+
# ── Paths ────────────────────────────────────────────────────────────────────
6+
7+
BENCH_DIR = Path("/tmp/fastp_benchmark")
8+
DATA_DIR = BENCH_DIR / "data"
9+
OUT_DIR = BENCH_DIR / "output"
10+
RESULTS_JSON = BENCH_DIR / "results.json"
11+
12+
CPU_COUNT = os.cpu_count() or 4
13+
14+
# ── Mode definitions ─────────────────────────────────────────────────────────
15+
16+
PE_MODES = ["fq-fq", "fq-gz", "gz-fq", "gz-gz"]
17+
SE_MODES = ["se-fq-fq", "se-fq-gz", "se-gz-fq", "se-gz-gz"]
18+
ALL_MODES = PE_MODES + SE_MODES + ["stdin-stdout"]
19+
20+
# Feature test modes: exercise specific fastp features beyond default filtering
21+
FEAT_MODES = [
22+
"merge", "correction", "dedup-pe", "dedup-se",
23+
"qualcut-pe", "qualcut-se", "ht-pe",
24+
]
25+
26+
FEAT_DEFS = {
27+
"merge": {"base": "pe", "in_fmt": "gz", "out_fmt": "gz", "data": "merge",
28+
"extra": ["--merge", "--correction"]},
29+
"correction": {"base": "pe", "in_fmt": "gz", "out_fmt": "gz", "data": "merge",
30+
"extra": ["--correction"]},
31+
"dedup-pe": {"base": "pe", "in_fmt": "gz", "out_fmt": "gz",
32+
"extra": ["--dedup"]},
33+
"dedup-se": {"base": "se", "in_fmt": "gz", "out_fmt": "gz",
34+
"extra": ["--dedup"]},
35+
"qualcut-pe": {"base": "pe", "in_fmt": "gz", "out_fmt": "gz",
36+
"extra": ["--cut_front", "--cut_tail", "-W", "4", "-M", "20"]},
37+
"qualcut-se": {"base": "se", "in_fmt": "gz", "out_fmt": "gz",
38+
"extra": ["--cut_front", "--cut_tail", "-W", "4", "-M", "20"]},
39+
"ht-pe": {"base": "pe", "in_fmt": "gz", "out_fmt": "gz",
40+
"extra": [], "threads": 32},
41+
}
42+
43+
MODE_ALIASES = {
44+
"all-pe": PE_MODES,
45+
"all-se": SE_MODES + ["stdin-stdout"],
46+
"all-feat": FEAT_MODES,
47+
"everything": ALL_MODES + FEAT_MODES,
48+
}
49+
50+
51+
def parse_mode(mode: str) -> tuple[str, str, str]:
52+
"""Parse mode string -> (type, in_fmt, out_fmt).
53+
54+
Returns ("stdin", "", "") for stdin-stdout,
55+
("se", "fq"|"gz", "fq"|"gz") for se-* modes,
56+
("pe", "fq"|"gz", "fq"|"gz") for bare fq-fq etc.
57+
Feature modes (merge, correction, etc.) resolve via FEAT_DEFS.
58+
"""
59+
if mode in FEAT_DEFS:
60+
d = FEAT_DEFS[mode]
61+
return (d["base"], d["in_fmt"], d["out_fmt"])
62+
if mode == "stdin-stdout":
63+
return ("stdin", "", "")
64+
if mode.startswith("se-"):
65+
parts = mode[3:].split("-")
66+
return ("se", parts[0], parts[1])
67+
parts = mode.split("-")
68+
return ("pe", parts[0], parts[1])
69+
70+
71+
def auto_threads(mode: str) -> int:
72+
"""Calculate worker threads, reserving reader/writer for each mode.
73+
74+
PE: 2 readers (L+R) + 2 writers (L+R) = reserve 4
75+
SE: 1 reader + 1 writer = reserve 2
76+
Feature modes may override thread count (e.g. ht-pe forces 32).
77+
"""
78+
if mode in FEAT_DEFS and "threads" in FEAT_DEFS[mode]:
79+
return FEAT_DEFS[mode]["threads"]
80+
mtype = parse_mode(mode)[0]
81+
reserved = 4 if mtype == "pe" else 2
82+
return max(1, CPU_COUNT - reserved)

0 commit comments

Comments
 (0)