Skip to content

Explore embedding distance metric#17

Merged
gonzalobenegas merged 3 commits intomainfrom
embed-dist
Dec 15, 2025
Merged

Explore embedding distance metric#17
gonzalobenegas merged 3 commits intomainfrom
embed-dist

Conversation

@gonzalobenegas
Copy link
Copy Markdown
Member

image

@gonzalobenegas gonzalobenegas merged commit fdda7be into main Dec 15, 2025
5 checks passed
@gonzalobenegas gonzalobenegas deleted the embed-dist branch December 15, 2025 17:10
@claude
Copy link
Copy Markdown

claude bot commented Dec 15, 2025

Code Review for PR #17: Explore embedding distance metric

Status: ✅ PR has been merged

Summary

This PR adds L2 Euclidean distance as a new scoring method alongside LLR and absLLR for variant effect prediction. The implementation includes:

  • New model_euclidean_distance Snakemake rule using GPNEmbeddingModel
  • Updated all dataset configs to compute all three scoring methods (LLR.minus, absLLR.plus, L2.plus)
  • New visualization rule plot_scorings_vs_step for comparing scoring methods across training checkpoints

✅ Strengths

1. Clean Integration with Existing Infrastructure

  • Follows the same pattern as model_llr rule - consistent with codebase patterns
  • Properly uses GPNEmbeddingModel and run_euclidean_distance from biofoundation library
  • Configuration changes are minimal and well-structured

2. Comprehensive Scoring Coverage

  • All datasets now evaluate all three scoring methods, enabling direct comparison
  • Good scientific rigor - testing multiple hypotheses about what makes a good variant effect score

3. Excellent Visualization Addition

  • The plot_scorings_vs_step rule (lines 334-472 in metrics.smk) is well-designed:
    • 4x5 grid layout accommodates all 19 datasets logically grouped (classification tasks first, then regression)
    • Color-coded scoring methods with distinct markers
    • Clean helper functions (get_primary_metric, clean_dataset_name)
    • Professional publication-quality output (PNG + PDF)

4. Appropriate for Experiments Directory

  • This exploratory analysis fits the experiments/ guidelines perfectly
  • Less stringent standards are acceptable here vs main package
  • Snakemake workflow ensures reproducibility

🔍 Observations & Minor Issues

1. Commented-out Code in Snakefile

rule all:
    input:
        # get_all_metric_files(),  # ← Commented out
        get_all_correlation_files(),
  • This changes the default target from computing all metrics to only computing correlations
  • Question: Is this intentional? If so, consider adding a comment explaining why
  • Users expecting snakemake --cores all to compute metrics will be surprised

2. Import Additions

from biofoundation.model.adapters.gpn import GPNMaskedLM, GPNEmbeddingModel
from biofoundation.inference import run_llr_mlm, run_euclidean_distance
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel
  • All imports are used - good!
  • Follows existing import patterns

3. Removed Top-Level scorings Config

  • The PR removes the unused top-level scorings: list from config.yaml
  • Good cleanup - scoring methods are now properly defined per-dataset in dataset_configs

4. Hardcoded Dataset List in Visualization

dataset_order = [
    'traitgym_mendelian_promoter',
    'traitgym_complex_promoter',
    # ... 17 more hardcoded names
]
  • Minor concern: If new datasets are added, this list must be manually updated
  • Suggestion: Could dynamically generate from df_primary['dataset'].unique() and sort by category
  • But: For experiments directory, hardcoding is acceptable for one-off analysis

5. Plot Grid Assumptions

nrows, ncols = 4, 5  # Hardcoded 4x5 grid
axes[-1].axis('off')  # Assumes exactly 19 datasets (20 - 1)
  • Works perfectly for current 19 datasets
  • Will break silently if dataset count changes significantly
  • For experiments code, this is acceptable

🎯 Technical Correctness

Model Usage: ✅

model = GPNEmbeddingModel(AutoModel.from_pretrained(model_name), layer="last")
euclidean_distance = run_euclidean_distance(...)
  • Correctly uses AutoModel (not AutoModelForMaskedLM)
  • layer="last" is sensible - uses final layer embeddings
  • Inference kwargs match other rules (bf16, batch size, workers)

Threading: ✅

  • Properly uses dataloader_num_workers=threads to parallelize across all available cores

Data Format: ✅

  • Output format matches existing patterns: pd.DataFrame(euclidean_distance, columns=["score"])

📊 Scientific Validity

L2 Distance as Variant Effect Metric:

  • Hypothesis: Variants with larger embedding distance from reference might have larger functional effects
  • This is a reasonable hypothesis to test
  • The L2.plus.score naming convention is consistent (higher = more effect)

Potential Concern:

  • L2 distance is unsigned by nature (always positive)
  • For pathogenicity prediction (TraitGym Mendelian), LLR sign matters (deleterious vs beneficial)
  • L2 will behave like absLLR - detecting effect magnitude, not directionality
  • Expected: L2 should correlate better with absLLR than LLR.minus on classification tasks
  • This is fine for exploration - the new visualization will reveal this!

🧪 Testing

Missing Tests (acceptable for experiments/):

  • No unit tests for model_euclidean_distance rule
  • No validation that L2 scores are always positive
  • Per CLAUDE.md: "less stringent code standards" for experiments/ ✅

Reproducibility: ✅

  • Snakemake ensures deterministic execution
  • All parameters in config.yaml

📝 Documentation

Missing Documentation:

  • No docstring for model_euclidean_distance rule
  • README.md might need updating to mention L2 scoring
  • CLAUDE.md experiments section could mention the three scoring methods

For experiments directory: This level of documentation is acceptable ✅


🚀 Performance Considerations

Computational Cost:

  • Adds ~33% more compute (3 scoring methods instead of 2)
  • L2 distance computation should be similar cost to LLR
  • Uses same batching/parallelization as existing rules ✅

Disk Usage:

  • Each dataset now has 3× the feature files
  • For 19 datasets × 8 checkpoints × 3 scorings = 456 feature files
  • Parquet format is space-efficient ✅

🔒 Security

No security concerns - all inputs from config, no user input, no external API calls ✅


✅ Checklist (Agentic Git Flow)

Since PR is merged, this is retrospective:

  • ✅ Code follows existing patterns
  • ✅ No fake tests (N/A for experiments/)
  • ✅ Appropriate for experiments directory
  • ✅ Snakemake ensures reproducibility
  • ⚠️ Documentation minimal (acceptable for experiments/)
  • ⚠️ Commented code in Snakefile (minor issue)

🎓 Recommendations for Future Work

  1. Document the rule all change: Add comment explaining why only correlations are computed by default

  2. Dynamic dataset ordering: If this analysis becomes recurring, make the visualization more robust:

    # Get datasets dynamically and sort by category
    datasets = df_primary['dataset'].unique()
    classification = [d for d in datasets if not d.startswith('sat_mut_mpra')]
    regression = [d for d in datasets if d.startswith('sat_mut_mpra')]
    dataset_order = sorted(classification) + sorted(regression)
  3. Add README section: Document the three scoring methods and their expected behaviors

  4. Consider correlation analysis: The new L2 scores will produce interesting correlation data - might reveal which datasets benefit from embedding distance vs likelihood


📊 Final Assessment

Overall Quality: ★★★★☆ (4/5)

Recommendation: ✅ Approved (already merged, retroactive review)

This is solid exploratory work that follows project patterns and adds valuable scientific insight. The new visualization will clearly show whether embedding distance (L2) is competitive with likelihood-based scores (LLR, absLLR) for variant effect prediction.

The minor issues (commented code, hardcoded lists) are acceptable for the experiments directory per CLAUDE.md guidelines. If this analysis becomes production code, those would need addressing.

Great work on the visualization design - the 4×5 grid with color-coded scoring methods will make it immediately obvious which scoring method works best for each dataset type! 📈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant