Explore embedding distance metric by gonzalobenegas · Pull Request #17 · Open-Athena/glm-experiments

gonzalobenegas · 2025-12-15T17:10:29Z

claude · 2025-12-15T17:12:15Z

Code Review for PR #17: Explore embedding distance metric

Status: ✅ PR has been merged

Summary

This PR adds L2 Euclidean distance as a new scoring method alongside LLR and absLLR for variant effect prediction. The implementation includes:

New model_euclidean_distance Snakemake rule using GPNEmbeddingModel
Updated all dataset configs to compute all three scoring methods (LLR.minus, absLLR.plus, L2.plus)
New visualization rule plot_scorings_vs_step for comparing scoring methods across training checkpoints

✅ Strengths

1. Clean Integration with Existing Infrastructure

Follows the same pattern as model_llr rule - consistent with codebase patterns
Properly uses GPNEmbeddingModel and run_euclidean_distance from biofoundation library
Configuration changes are minimal and well-structured

2. Comprehensive Scoring Coverage

All datasets now evaluate all three scoring methods, enabling direct comparison
Good scientific rigor - testing multiple hypotheses about what makes a good variant effect score

3. Excellent Visualization Addition

The plot_scorings_vs_step rule (lines 334-472 in metrics.smk) is well-designed:
- 4x5 grid layout accommodates all 19 datasets logically grouped (classification tasks first, then regression)
- Color-coded scoring methods with distinct markers
- Clean helper functions (get_primary_metric, clean_dataset_name)
- Professional publication-quality output (PNG + PDF)

4. Appropriate for Experiments Directory

This exploratory analysis fits the experiments/ guidelines perfectly
Less stringent standards are acceptable here vs main package
Snakemake workflow ensures reproducibility

🔍 Observations & Minor Issues

1. Commented-out Code in Snakefile

rule all:
    input:
        # get_all_metric_files(),  # ← Commented out
        get_all_correlation_files(),

This changes the default target from computing all metrics to only computing correlations
Question: Is this intentional? If so, consider adding a comment explaining why
Users expecting snakemake --cores all to compute metrics will be surprised

2. Import Additions

from biofoundation.model.adapters.gpn import GPNMaskedLM, GPNEmbeddingModel
from biofoundation.inference import run_llr_mlm, run_euclidean_distance
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel

All imports are used - good!
Follows existing import patterns

3. Removed Top-Level scorings Config

The PR removes the unused top-level scorings: list from config.yaml
Good cleanup - scoring methods are now properly defined per-dataset in dataset_configs

4. Hardcoded Dataset List in Visualization

dataset_order = [
    'traitgym_mendelian_promoter',
    'traitgym_complex_promoter',
    # ... 17 more hardcoded names
]

Minor concern: If new datasets are added, this list must be manually updated
Suggestion: Could dynamically generate from df_primary['dataset'].unique() and sort by category
But: For experiments directory, hardcoding is acceptable for one-off analysis

5. Plot Grid Assumptions

nrows, ncols = 4, 5  # Hardcoded 4x5 grid
axes[-1].axis('off')  # Assumes exactly 19 datasets (20 - 1)

Works perfectly for current 19 datasets
Will break silently if dataset count changes significantly
For experiments code, this is acceptable

🎯 Technical Correctness

Model Usage: ✅

model = GPNEmbeddingModel(AutoModel.from_pretrained(model_name), layer="last")
euclidean_distance = run_euclidean_distance(...)

Correctly uses AutoModel (not AutoModelForMaskedLM)
layer="last" is sensible - uses final layer embeddings
Inference kwargs match other rules (bf16, batch size, workers)

Threading: ✅

Properly uses dataloader_num_workers=threads to parallelize across all available cores

Data Format: ✅

Output format matches existing patterns: pd.DataFrame(euclidean_distance, columns=["score"])

📊 Scientific Validity

L2 Distance as Variant Effect Metric:

Hypothesis: Variants with larger embedding distance from reference might have larger functional effects
This is a reasonable hypothesis to test
The L2.plus.score naming convention is consistent (higher = more effect)

Potential Concern:

L2 distance is unsigned by nature (always positive)
For pathogenicity prediction (TraitGym Mendelian), LLR sign matters (deleterious vs beneficial)
L2 will behave like absLLR - detecting effect magnitude, not directionality
Expected: L2 should correlate better with absLLR than LLR.minus on classification tasks
This is fine for exploration - the new visualization will reveal this!

🧪 Testing

Missing Tests (acceptable for experiments/):

No unit tests for model_euclidean_distance rule
No validation that L2 scores are always positive
Per CLAUDE.md: "less stringent code standards" for experiments/ ✅

Reproducibility: ✅

Snakemake ensures deterministic execution
All parameters in config.yaml

📝 Documentation

Missing Documentation:

No docstring for model_euclidean_distance rule
README.md might need updating to mention L2 scoring
CLAUDE.md experiments section could mention the three scoring methods

For experiments directory: This level of documentation is acceptable ✅

🚀 Performance Considerations

Computational Cost:

Adds ~33% more compute (3 scoring methods instead of 2)
L2 distance computation should be similar cost to LLR
Uses same batching/parallelization as existing rules ✅

Disk Usage:

Each dataset now has 3× the feature files
For 19 datasets × 8 checkpoints × 3 scorings = 456 feature files
Parquet format is space-efficient ✅

🔒 Security

No security concerns - all inputs from config, no user input, no external API calls ✅

✅ Checklist (Agentic Git Flow)

Since PR is merged, this is retrospective:

✅ Code follows existing patterns
✅ No fake tests (N/A for experiments/)
✅ Appropriate for experiments directory
✅ Snakemake ensures reproducibility
⚠️ Documentation minimal (acceptable for experiments/)
⚠️ Commented code in Snakefile (minor issue)

🎓 Recommendations for Future Work

Document the rule all change: Add comment explaining why only correlations are computed by default

Dynamic dataset ordering: If this analysis becomes recurring, make the visualization more robust:

# Get datasets dynamically and sort by category
datasets = df_primary['dataset'].unique()
classification = [d for d in datasets if not d.startswith('sat_mut_mpra')]
regression = [d for d in datasets if d.startswith('sat_mut_mpra')]
dataset_order = sorted(classification) + sorted(regression)

Add README section: Document the three scoring methods and their expected behaviors
Consider correlation analysis: The new L2 scores will produce interesting correlation data - might reveal which datasets benefit from embedding distance vs likelihood

📊 Final Assessment

Overall Quality: ★★★★☆ (4/5)

Recommendation: ✅ Approved (already merged, retroactive review)

This is solid exploratory work that follows project patterns and adds valuable scientific insight. The new visualization will clearly show whether embedding distance (L2) is competitive with likelihood-based scores (LLR, absLLR) for variant effect prediction.

The minor issues (commented code, hardcoded lists) are acceptable for the experiments directory per CLAUDE.md guidelines. If this analysis becomes production code, those would need addressing.

Great work on the visualization design - the 4×5 grid with color-coded scoring methods will make it immediately obvious which scoring method works best for each dataset type! 📈

gonzalobenegas added 3 commits December 12, 2025 23:58

Add L2 calculation rule

8544c56

Update config

960862c

Add plot

8d2ed1f

gonzalobenegas merged commit fdda7be into main Dec 15, 2025
5 checks passed

gonzalobenegas deleted the embed-dist branch December 15, 2025 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore embedding distance metric#17

Explore embedding distance metric#17
gonzalobenegas merged 3 commits intomainfrom
embed-dist

gonzalobenegas commented Dec 15, 2025

Uh oh!

Uh oh!

claude bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gonzalobenegas commented Dec 15, 2025

Uh oh!

Uh oh!

claude bot commented Dec 15, 2025

Code Review for PR #17: Explore embedding distance metric

Summary

✅ Strengths

🔍 Observations & Minor Issues

🎯 Technical Correctness

📊 Scientific Validity

🧪 Testing

📝 Documentation

🚀 Performance Considerations

🔒 Security

✅ Checklist (Agentic Git Flow)

🎓 Recommendations for Future Work

📊 Final Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant