Skip to content

PromoterAI benchmarks#16

Merged
gonzalobenegas merged 3 commits intomainfrom
promoterai-benchmarks
Dec 12, 2025
Merged

PromoterAI benchmarks#16
gonzalobenegas merged 3 commits intomainfrom
promoterai-benchmarks

Conversation

@gonzalobenegas
Copy link
Copy Markdown
Member

image

@gonzalobenegas gonzalobenegas merged commit 2fcf03e into main Dec 12, 2025
6 checks passed
@gonzalobenegas gonzalobenegas deleted the promoterai-benchmarks branch December 12, 2025 23:34
@claude
Copy link
Copy Markdown

claude bot commented Dec 12, 2025

PR Review: PromoterAI Benchmarks

Thank you for this contribution! This PR adds support for PromoterAI benchmark datasets to the evaluation pipeline. Overall, the implementation is well-structured and follows good practices. Below is my detailed review:


✅ Strengths

  1. Excellent Refactoring of Metrics: The unified metrics rule with a metric registry (lines 18-22 in metrics.smk) is a significant improvement over the previous separate rules. This follows DRY principles perfectly.

  2. Smart Batch Inference Optimization: The combined dataset groups concept is clever - it allows running inference once on concatenated datasets and then filtering results, which is much more efficient than processing each benchmark separately.

  3. Consistent Architecture: The new promoterai_benchmarks.smk follows the same patterns as sat_mut_mpra.smk, making the codebase maintainable.

  4. Clean Configuration: The config changes are well-organized with clear structure for promoterai_benchmarks and combined_dataset_groups.


⚠️ Issues & Concerns

1. CRITICAL: Removed get_all_metric_files() from rule all (Snakefile:95)

Issue: The rule all now only includes correlation files, not metric files:

rule all:
    input:
        get_all_correlation_files(),  # Missing get_all_metric_files()

Impact: Individual metric files won't be generated by default. The correlation rules depend on metric files (via aggregate_metrics), so this might still work due to Snakemake's dependency resolution, but it's semantically incorrect and could cause issues.

Fix: Restore the original:

rule all:
    input:
        get_all_metric_files(),
        get_all_correlation_files(),

2. Configuration Change: torch_compile: True (config.yaml:49)

Issue: The comment says "consider if overhead is worth it" but you changed it from False to True.

Concern: For the original small datasets (TraitGym, sat_mut_mpra), torch compile overhead might not be worth it. The PromoterAI combined datasets are larger, so this might make sense now, but:

  • This affects ALL models globally
  • The comment indicates uncertainty
  • No justification provided in PR description

Recommendation: Either:

  1. Keep as False for now and benchmark first
  2. Add a comment explaining why you changed it
  3. Make it dataset-specific if some datasets benefit and others don't

3. Configuration Change: per_device_batch_size: 512 (config.yaml:48)

Issue: Increased from 128 to 512 (4x increase).

Concerns:

  • Will this fit in GPU memory for all models?
  • Was this tested across all checkpoints?
  • No justification in PR

Recommendation: Either document the rationale or revert to 128 unless you've confirmed it works.


4. Potential Data Integrity Issue: Positional Filtering (metrics.smk:66-69)

Code:

if 'dataset' in df_dataset.columns:
    mask = df_dataset['dataset'] == wildcards.dataset
    df_dataset = df_dataset[mask]
    df_pred = df_pred[mask]  # Apply same positional mask

Issue: This assumes df_pred has the exact same row ordering as df_dataset. If the prediction generation or parquet read/write ever changes ordering, this will silently produce wrong results.

Fix: Add explicit validation:

if 'dataset' in df_dataset.columns:
    mask = df_dataset['dataset'] == wildcards.dataset
    df_dataset = df_dataset[mask].reset_index(drop=True)
    df_pred = df_pred[mask].reset_index(drop=True)
    # Sanity check: verify same length
    assert len(df_dataset) == len(df_pred), \
        f"Mismatch: {len(df_dataset)} vs {len(df_pred)}"

Even better, add coordinate columns to predictions for explicit matching.


5. Missing Error Handling: URL Download (promoterai_benchmarks.smk:12)

Code:

url = f"https://raw.githubusercontent.com/Illumina/PromoterAI/master/data/benchmark/{params.filename}"
V = pd.read_csv(url, sep="\t")

Issues:

  • No error handling for network failures
  • No validation that the downloaded file is correct
  • Will fail silently if GitHub is down or URL changes

Recommendation: Add basic error handling:

try:
    V = pd.read_csv(url, sep="\t")
    assert not V.empty, f"Empty dataset from {url}"
    assert 'consequence' in V.columns, f"Missing 'consequence' column in {url}"
except Exception as e:
    raise RuntimeError(f"Failed to download {url}: {e}")

6. Unclear Label Logic (promoterai_benchmarks.smk:17-20)

Code:

V_grouped = V.groupby(COORDINATES, as_index=False).agg({
    "consequence": lambda x: (x \!= "none").any(),
})
V_grouped = V_grouped.rename(columns={"consequence": "label"})

Questions:

  • Why is grouping needed? Are there duplicate variants?
  • Is "any() strategy correct? (True if ANY gene has non-none consequence)
  • Should this be documented?

Recommendation: Add a comment explaining the logic:

# Group by coordinates since variants can be near multiple genes
# Label is True if the variant has a functional consequence 
# (not "none") for ANY associated gene

🔍 Minor Issues

7. Inconsistent Scoring Configuration (config.yaml:89)

PromoterAI benchmarks use absLLR.plus.score but not LLR.minus.score. This is inconsistent with TraitGym Mendelian (which uses LLR.minus.score).

Question: Are you sure absLLR is the right choice for all PromoterAI benchmarks? Some might be directional (pathogenic vs benign) rather than magnitude-based.

Recommendation: Document why absLLR is appropriate for these datasets.


8. Missing Wildcard Constraints (sat_mut_mpra.smk:33)

The combine_sat_mut_mpra_datasets rule has:

wildcard_constraints:
    combined_group="sat_mut_mpra_combined"

But combine_promoterai_datasets also hardcodes:

wildcard_constraints:
    combined_group="promoterai_combined"

Issue: If you add more combined groups in the future, you'll need to update both rules.

Recommendation: Make this more maintainable:

wildcard_constraints:
    combined_group="|" + "|".join(config.get("combined_dataset_groups", {}).keys())

📊 Testing & Validation

Missing Tests

Since this is in the experiments/ directory, comprehensive testing isn't required per CLAUDE.md. However, you should manually verify:

  • All 7 PromoterAI datasets download successfully
  • Combined dataset creation works correctly
  • Metrics compute without errors
  • Correlation plots include new datasets
  • Batch size of 512 works without OOM errors
  • Results make sense (no all-zero or all-NaN metrics)

📝 Documentation

The PR body only contains an image with no description. Please add:

  1. What: Brief description of PromoterAI benchmarks
  2. Why: Why these benchmarks are valuable
  3. How: Summary of implementation approach (combined datasets, etc.)
  4. Testing: What testing was performed
  5. Performance: Any performance measurements with the new batch size/torch compile settings

🎯 Recommendations Summary

Must Fix (Blocking):

  1. ❌ Restore get_all_metric_files() to rule all
  2. ⚠️ Add validation to positional filtering in metrics rule

Should Fix (Strongly Recommended):

  1. Document/justify torch_compile: True change
  2. Document/justify per_device_batch_size: 512 change
  3. Add error handling to URL download
  4. Add comment explaining grouping/labeling logic

Nice to Have:

  1. Document why absLLR is appropriate for PromoterAI
  2. Make wildcard constraints more maintainable
  3. Add PR description

✨ Overall Assessment

This is good work with a clean architecture and smart optimizations. The metric registry refactoring alone is a valuable improvement. However, the configuration changes (torch_compile, batch_size) and the removal of get_all_metric_files() from rule all need attention before merging.

Recommendation: Request changes for items 1-6, then approve after fixes.

Let me know if you have questions about any of these points!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant