You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this contribution! This PR adds support for PromoterAI benchmark datasets to the evaluation pipeline. Overall, the implementation is well-structured and follows good practices. Below is my detailed review:
✅ Strengths
Excellent Refactoring of Metrics: The unified metrics rule with a metric registry (lines 18-22 in metrics.smk) is a significant improvement over the previous separate rules. This follows DRY principles perfectly.
Smart Batch Inference Optimization: The combined dataset groups concept is clever - it allows running inference once on concatenated datasets and then filtering results, which is much more efficient than processing each benchmark separately.
Consistent Architecture: The new promoterai_benchmarks.smk follows the same patterns as sat_mut_mpra.smk, making the codebase maintainable.
Clean Configuration: The config changes are well-organized with clear structure for promoterai_benchmarks and combined_dataset_groups.
⚠️ Issues & Concerns
1. CRITICAL: Removed get_all_metric_files() from rule all (Snakefile:95)
Issue: The rule all now only includes correlation files, not metric files:
Impact: Individual metric files won't be generated by default. The correlation rules depend on metric files (via aggregate_metrics), so this might still work due to Snakemake's dependency resolution, but it's semantically incorrect and could cause issues.
Issue: The comment says "consider if overhead is worth it" but you changed it from False to True.
Concern: For the original small datasets (TraitGym, sat_mut_mpra), torch compile overhead might not be worth it. The PromoterAI combined datasets are larger, so this might make sense now, but:
This affects ALL models globally
The comment indicates uncertainty
No justification provided in PR description
Recommendation: Either:
Keep as False for now and benchmark first
Add a comment explaining why you changed it
Make it dataset-specific if some datasets benefit and others don't
Recommendation: Either document the rationale or revert to 128 unless you've confirmed it works.
4. Potential Data Integrity Issue: Positional Filtering (metrics.smk:66-69)
Code:
if'dataset'indf_dataset.columns:
mask=df_dataset['dataset'] ==wildcards.datasetdf_dataset=df_dataset[mask]
df_pred=df_pred[mask] # Apply same positional mask
Issue: This assumes df_pred has the exact same row ordering as df_dataset. If the prediction generation or parquet read/write ever changes ordering, this will silently produce wrong results.
Fix: Add explicit validation:
if'dataset'indf_dataset.columns:
mask=df_dataset['dataset'] ==wildcards.datasetdf_dataset=df_dataset[mask].reset_index(drop=True)
df_pred=df_pred[mask].reset_index(drop=True)
# Sanity check: verify same lengthassertlen(df_dataset) ==len(df_pred), \
f"Mismatch: {len(df_dataset)} vs {len(df_pred)}"
Even better, add coordinate columns to predictions for explicit matching.
Why is grouping needed? Are there duplicate variants?
Is "any() strategy correct? (True if ANY gene has non-none consequence)
Should this be documented?
Recommendation: Add a comment explaining the logic:
# Group by coordinates since variants can be near multiple genes# Label is True if the variant has a functional consequence # (not "none") for ANY associated gene
PromoterAI benchmarks use absLLR.plus.score but not LLR.minus.score. This is inconsistent with TraitGym Mendelian (which uses LLR.minus.score).
Question: Are you sure absLLR is the right choice for all PromoterAI benchmarks? Some might be directional (pathogenic vs benign) rather than magnitude-based.
Recommendation: Document why absLLR is appropriate for these datasets.
This is good work with a clean architecture and smart optimizations. The metric registry refactoring alone is a valuable improvement. However, the configuration changes (torch_compile, batch_size) and the removal of get_all_metric_files() from rule all need attention before merging.
Recommendation: Request changes for items 1-6, then approve after fixes.
Let me know if you have questions about any of these points!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.