PromoterAI benchmarks by gonzalobenegas · Pull Request #16 · Open-Athena/glm-experiments

gonzalobenegas · 2025-12-12T23:34:03Z

claude · 2025-12-12T23:35:49Z

PR Review: PromoterAI Benchmarks

Thank you for this contribution! This PR adds support for PromoterAI benchmark datasets to the evaluation pipeline. Overall, the implementation is well-structured and follows good practices. Below is my detailed review:

✅ Strengths

Excellent Refactoring of Metrics: The unified metrics rule with a metric registry (lines 18-22 in metrics.smk) is a significant improvement over the previous separate rules. This follows DRY principles perfectly.
Smart Batch Inference Optimization: The combined dataset groups concept is clever - it allows running inference once on concatenated datasets and then filtering results, which is much more efficient than processing each benchmark separately.
Consistent Architecture: The new promoterai_benchmarks.smk follows the same patterns as sat_mut_mpra.smk, making the codebase maintainable.
Clean Configuration: The config changes are well-organized with clear structure for promoterai_benchmarks and combined_dataset_groups.

⚠️ Issues & Concerns

1. CRITICAL: Removed `get_all_metric_files()` from `rule all` (Snakefile:95)

Issue: The rule all now only includes correlation files, not metric files:

rule all:
    input:
        get_all_correlation_files(),  # Missing get_all_metric_files()

Impact: Individual metric files won't be generated by default. The correlation rules depend on metric files (via aggregate_metrics), so this might still work due to Snakemake's dependency resolution, but it's semantically incorrect and could cause issues.

Fix: Restore the original:

rule all:
    input:
        get_all_metric_files(),
        get_all_correlation_files(),

2. Configuration Change: `torch_compile: True` (config.yaml:49)

Issue: The comment says "consider if overhead is worth it" but you changed it from False to True.

Concern: For the original small datasets (TraitGym, sat_mut_mpra), torch compile overhead might not be worth it. The PromoterAI combined datasets are larger, so this might make sense now, but:

This affects ALL models globally
The comment indicates uncertainty
No justification provided in PR description

Recommendation: Either:

Keep as False for now and benchmark first
Add a comment explaining why you changed it
Make it dataset-specific if some datasets benefit and others don't

3. Configuration Change: `per_device_batch_size: 512` (config.yaml:48)

Issue: Increased from 128 to 512 (4x increase).

Concerns:

Will this fit in GPU memory for all models?
Was this tested across all checkpoints?
No justification in PR

Recommendation: Either document the rationale or revert to 128 unless you've confirmed it works.

4. Potential Data Integrity Issue: Positional Filtering (metrics.smk:66-69)

Code:

if 'dataset' in df_dataset.columns:
    mask = df_dataset['dataset'] == wildcards.dataset
    df_dataset = df_dataset[mask]
    df_pred = df_pred[mask]  # Apply same positional mask

Issue: This assumes df_pred has the exact same row ordering as df_dataset. If the prediction generation or parquet read/write ever changes ordering, this will silently produce wrong results.

Fix: Add explicit validation:

if 'dataset' in df_dataset.columns:
    mask = df_dataset['dataset'] == wildcards.dataset
    df_dataset = df_dataset[mask].reset_index(drop=True)
    df_pred = df_pred[mask].reset_index(drop=True)
    # Sanity check: verify same length
    assert len(df_dataset) == len(df_pred), \
        f"Mismatch: {len(df_dataset)} vs {len(df_pred)}"

Even better, add coordinate columns to predictions for explicit matching.

5. Missing Error Handling: URL Download (promoterai_benchmarks.smk:12)

Code:

url = f"https://raw.githubusercontent.com/Illumina/PromoterAI/master/data/benchmark/{params.filename}"
V = pd.read_csv(url, sep="\t")

Issues:

No error handling for network failures
No validation that the downloaded file is correct
Will fail silently if GitHub is down or URL changes

Recommendation: Add basic error handling:

try:
    V = pd.read_csv(url, sep="\t")
    assert not V.empty, f"Empty dataset from {url}"
    assert 'consequence' in V.columns, f"Missing 'consequence' column in {url}"
except Exception as e:
    raise RuntimeError(f"Failed to download {url}: {e}")

6. Unclear Label Logic (promoterai_benchmarks.smk:17-20)

Code:

V_grouped = V.groupby(COORDINATES, as_index=False).agg({
    "consequence": lambda x: (x \!= "none").any(),
})
V_grouped = V_grouped.rename(columns={"consequence": "label"})

Questions:

Why is grouping needed? Are there duplicate variants?
Is "any() strategy correct? (True if ANY gene has non-none consequence)
Should this be documented?

Recommendation: Add a comment explaining the logic:

# Group by coordinates since variants can be near multiple genes
# Label is True if the variant has a functional consequence 
# (not "none") for ANY associated gene

🔍 Minor Issues

7. Inconsistent Scoring Configuration (config.yaml:89)

PromoterAI benchmarks use absLLR.plus.score but not LLR.minus.score. This is inconsistent with TraitGym Mendelian (which uses LLR.minus.score).

Question: Are you sure absLLR is the right choice for all PromoterAI benchmarks? Some might be directional (pathogenic vs benign) rather than magnitude-based.

Recommendation: Document why absLLR is appropriate for these datasets.

8. Missing Wildcard Constraints (sat_mut_mpra.smk:33)

The combine_sat_mut_mpra_datasets rule has:

wildcard_constraints:
    combined_group="sat_mut_mpra_combined"

But combine_promoterai_datasets also hardcodes:

wildcard_constraints:
    combined_group="promoterai_combined"

Issue: If you add more combined groups in the future, you'll need to update both rules.

Recommendation: Make this more maintainable:

wildcard_constraints:
    combined_group="|" + "|".join(config.get("combined_dataset_groups", {}).keys())

📊 Testing & Validation

Missing Tests

Since this is in the experiments/ directory, comprehensive testing isn't required per CLAUDE.md. However, you should manually verify:

All 7 PromoterAI datasets download successfully
Combined dataset creation works correctly
Metrics compute without errors
Correlation plots include new datasets
Batch size of 512 works without OOM errors
Results make sense (no all-zero or all-NaN metrics)

📝 Documentation

The PR body only contains an image with no description. Please add:

What: Brief description of PromoterAI benchmarks
Why: Why these benchmarks are valuable
How: Summary of implementation approach (combined datasets, etc.)
Testing: What testing was performed
Performance: Any performance measurements with the new batch size/torch compile settings

🎯 Recommendations Summary

Must Fix (Blocking):

❌ Restore get_all_metric_files() to rule all
⚠️ Add validation to positional filtering in metrics rule

Should Fix (Strongly Recommended):

Document/justify torch_compile: True change
Document/justify per_device_batch_size: 512 change
Add error handling to URL download
Add comment explaining grouping/labeling logic

Nice to Have:

Document why absLLR is appropriate for PromoterAI
Make wildcard constraints more maintainable
Add PR description

✨ Overall Assessment

This is good work with a clean architecture and smart optimizations. The metric registry refactoring alone is a valuable improvement. However, the configuration changes (torch_compile, batch_size) and the removal of get_all_metric_files() from rule all need attention before merging.

Recommendation: Request changes for items 1-6, then approve after fixes.

Let me know if you have questions about any of these points!

gonzalobenegas added 3 commits December 12, 2025 19:58

Add promoterai benchmark dataset processing

aee12c7

Add promoter ai benchmark results

7fa60d5

Refactor sat mut mpra

932be04

gonzalobenegas merged commit 2fcf03e into main Dec 12, 2025
6 checks passed

gonzalobenegas deleted the promoterai-benchmarks branch December 12, 2025 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PromoterAI benchmarks#16

PromoterAI benchmarks#16
gonzalobenegas merged 3 commits intomainfrom
promoterai-benchmarks

gonzalobenegas commented Dec 12, 2025

Uh oh!

Uh oh!

claude bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gonzalobenegas commented Dec 12, 2025

Uh oh!

Uh oh!

claude bot commented Dec 12, 2025

PR Review: PromoterAI Benchmarks

✅ Strengths

⚠️ Issues & Concerns

1. CRITICAL: Removed get_all_metric_files() from rule all (Snakefile:95)

2. Configuration Change: torch_compile: True (config.yaml:49)

3. Configuration Change: per_device_batch_size: 512 (config.yaml:48)

4. Potential Data Integrity Issue: Positional Filtering (metrics.smk:66-69)

5. Missing Error Handling: URL Download (promoterai_benchmarks.smk:12)

6. Unclear Label Logic (promoterai_benchmarks.smk:17-20)

🔍 Minor Issues

7. Inconsistent Scoring Configuration (config.yaml:89)

8. Missing Wildcard Constraints (sat_mut_mpra.smk:33)

📊 Testing & Validation

Missing Tests

📝 Documentation

🎯 Recommendations Summary

Must Fix (Blocking):

Should Fix (Strongly Recommended):

Nice to Have:

✨ Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. CRITICAL: Removed `get_all_metric_files()` from `rule all` (Snakefile:95)

2. Configuration Change: `torch_compile: True` (config.yaml:49)

3. Configuration Change: `per_device_batch_size: 512` (config.yaml:48)