Conversation
gonzalobenegas
commented
Dec 15, 2025
Code Review for PR #17: Explore embedding distance metricStatus: ✅ PR has been merged SummaryThis PR adds L2 Euclidean distance as a new scoring method alongside LLR and absLLR for variant effect prediction. The implementation includes:
✅ Strengths1. Clean Integration with Existing Infrastructure
2. Comprehensive Scoring Coverage
3. Excellent Visualization Addition
4. Appropriate for Experiments Directory
🔍 Observations & Minor Issues1. Commented-out Code in Snakefile rule all:
input:
# get_all_metric_files(), # ← Commented out
get_all_correlation_files(),
2. Import Additions from biofoundation.model.adapters.gpn import GPNMaskedLM, GPNEmbeddingModel
from biofoundation.inference import run_llr_mlm, run_euclidean_distance
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel
3. Removed Top-Level
4. Hardcoded Dataset List in Visualization dataset_order = [
'traitgym_mendelian_promoter',
'traitgym_complex_promoter',
# ... 17 more hardcoded names
]
5. Plot Grid Assumptions nrows, ncols = 4, 5 # Hardcoded 4x5 grid
axes[-1].axis('off') # Assumes exactly 19 datasets (20 - 1)
🎯 Technical CorrectnessModel Usage: ✅ model = GPNEmbeddingModel(AutoModel.from_pretrained(model_name), layer="last")
euclidean_distance = run_euclidean_distance(...)
Threading: ✅
Data Format: ✅
📊 Scientific ValidityL2 Distance as Variant Effect Metric:
Potential Concern:
🧪 TestingMissing Tests (acceptable for experiments/):
Reproducibility: ✅
📝 DocumentationMissing Documentation:
For experiments directory: This level of documentation is acceptable ✅ 🚀 Performance ConsiderationsComputational Cost:
Disk Usage:
🔒 SecurityNo security concerns - all inputs from config, no user input, no external API calls ✅ ✅ Checklist (Agentic Git Flow)Since PR is merged, this is retrospective:
🎓 Recommendations for Future Work
📊 Final AssessmentOverall Quality: ★★★★☆ (4/5) Recommendation: ✅ Approved (already merged, retroactive review) This is solid exploratory work that follows project patterns and adds valuable scientific insight. The new visualization will clearly show whether embedding distance (L2) is competitive with likelihood-based scores (LLR, absLLR) for variant effect prediction. The minor issues (commented code, hardcoded lists) are acceptable for the experiments directory per CLAUDE.md guidelines. If this analysis becomes production code, those would need addressing. Great work on the visualization design - the 4×5 grid with color-coded scoring methods will make it immediately obvious which scoring method works best for each dataset type! 📈 |