Add plants training dataset and eval by gonzalobenegas · Pull Request #10 · Open-Athena/glm-experiments

gonzalobenegas · 2025-12-01T21:25:24Z

No description provided.

gonzalobenegas · 2025-12-01T21:34:08Z

@claude please review for critical bugs. be concise.

claude · 2025-12-01T21:34:22Z

Claude finished @gonzalobenegas's task —— View job

Code Review for Critical Bugs

Critical Issues Found

1. CRITICAL: Race condition in eval metric ordering (lm_lit_module.py:227-236)

At lm_lit_module.py:227-236, the eval dataset index mapping relies on dictionary iteration order:

eval_idx = dataloader_idx - 1

if eval_idx >= len(self._eval_names):
    raise IndexError(...)

eval_name = self._eval_names[eval_idx]

But _eval_names is set from evals.keys() at line 97, which depends on dict insertion order. While Python 3.7+ dicts are ordered, the DataLoader iteration order is determined by the DataModule's val_dataloader() method (lm_datamodule.py:390), which iterates over self.eval_datasets.items().

Problem: If eval_datasets insertion order differs from the evals config order passed to LitModule, indices won't match.

Location:

lm_datamodule.py:390 - DataLoader ordering
lm_lit_module.py:97 - Metric name ordering
lm_lit_module.py:227 - Index mapping

Impact: Wrong metrics logged to wrong eval datasets (e.g., TraitGym AUPRC shown for PlantCAD data).

Fix: Ensure both use the same iteration order, or pass explicit ordering from DataModule to LitModule.

2. Potential type annotation error (lm_datamodule.py:156)

self.eval_datasets: dict[str, any] = {}

Should be Any (capitalized), not any. While Python may accept this in some versions, it's incorrect typing syntax.

Location: lm_datamodule.py:156

Minor Issues

3. Missing validation for metric/label compatibility (lm_lit_module.py:283-284)

At line 283-284, metrics are computed without checking if the metric type matches the label type:

Binary classification metrics (AUPRC, AUROC) require binary labels
Regression metrics (Pearson, Spearman) work with continuous labels

If config specifies metrics: [auprc] but label_column: AF (continuous allele frequency), sklearn will error.

Suggestion: Add validation in _setup_eval_metrics or catch exceptions with better error messages.

4. Security note acknowledged (evals.py:69, 141)

# nosec B310 and # nosec B615 comments acknowledge urllib and dataset loading security concerns - acceptable for scientific code with trusted sources.

Summary

Critical bug: Eval dataset index mismatch could cause incorrect metric attribution.

Action needed: Verify eval_datasets dict ordering matches evals config ordering, or implement explicit index mapping.

claude · 2025-12-01T22:15:49Z

Code Review for PR #10: Add plants training dataset and eval

I've conducted a comprehensive review of this PR following the CLAUDE.md standards. Overall, this is well-structured code that successfully generalizes the evaluation system from a TraitGym-specific implementation to a flexible, registry-based architecture. Here's my detailed feedback:

✅ Strengths

1. Excellent Generalization & Architecture

Registry pattern: The EVAL_FILTERS and METRICS/TRANSFORMS registries are clean and extensible
Separation of concerns: evals.py handles data loading, metrics.py handles scoring, lm_lit_module.py orchestrates
Deletion of dead code: Removed traitgym.py and replaced hardcoded logic with configurable system ✅

2. Strong Test Coverage

Real tests, not fakes: Tests use actual HuggingFace datasets and models (no mocks!) ✅
Comprehensive coverage: Tests for transforms, metrics, filters, dataset loading, and Lightning module integration
Good test structure: Clear test names, proper fixtures, appropriate assertions

3. Good Documentation

Comprehensive docstrings with Google-style format
Type hints throughout
Config examples in YAML comments

4. Configuration Design

Clean migration from dict-based to list-based config for evals
Sensible defaults (label_column: "label", transform: "identity")
Good use of Hydra interpolation (evals: ${data.evals})

🔍 Issues Found

CRITICAL: Test Configuration Inconsistency ⚠️

Location: tests/test_evals.py:166-275

The test test_hydra_instantiation_with_evals expects the old dict-based config format:

assert "traitgym_mendelian_promoter" in dm.hparams.evals  # Line 269
assert dm.hparams.evals["traitgym_mendelian_promoter"]["dataset_name"] == "songlab/TraitGym"

But the codebase now uses list-based config (configs/data/default.yaml and configs/data/gpn_animal_promoter.yaml):

evals:
  - name: traitgym_mendelian_promoter
    dataset_name: songlab/TraitGym

Impact: This test will fail when run with the new config format.

Fix Required: Update lines 266-274 in tests/test_evals.py:

# Check evals config is present (list-based format)
assert dm.hparams.evals is not None
assert isinstance(dm.hparams.evals, list)
assert len(dm.hparams.evals) > 0

first_eval = dm.hparams.evals[0]
assert first_eval["name"] == "traitgym_mendelian_promoter"
assert first_eval["dataset_name"] == "songlab/TraitGym"
assert first_eval["window_size"] == 512

MODERATE: Edge Case in Dataloader Mapping

Location: glm_experiments/models/lm_lit_module.py:231-235

if eval_idx >= len(self._eval_names):
    raise IndexError(
        f"dataloader_idx={dataloader_idx} exceeds configured evals. "
        f"Expected indices 0-{len(self._eval_names)} for {len(self._eval_names)} eval(s)."
    )

Issue: The error message says "Expected indices 0-{len}" but should clarify that 0 is LM validation, 1+ are evals.

Suggested Fix:

raise IndexError(
    f"dataloader_idx={dataloader_idx} exceeds configured evals. "
    f"Expected: 0 (LM val) or 1-{len(self._eval_names)} (evals). "
    f"Configured {len(self._eval_names)} eval dataset(s)."
)

MINOR: Missing Default in Config Documentation

Location: configs/data/default.yaml:27-39

Suggestion: Add data_dir default to comment for completeness:

#     genome_url: https://ftp.ensembl.org/...
#     data_dir: data # Directory for genome downloads (default: "data")
#     filter_name: traitgym_promoter # Filter from EVAL_FILTERS registry (default: "none")

📋 CLAUDE.md Compliance Checklist

✅ Issue Compliance: No issue linked (feature PR)
✅ Code Formatting: Follows Black/isort conventions
✅ Documentation: Comprehensive docstrings with type hints
✅ Clean Code: Simple, DRY, no unnecessary complexity
✅ Antipattern Check: No over-engineering, no premature abstractions
✅ Design Compliance: Proper Hydra instantiation, PyTorch Lightning patterns
✅ Test Implementation: Real tests (no fakes!) ✓
⚠️ Test Suite: One test needs fixing for new config format

🎯 Recommendations

Required Before Merge:

Fix test_hydra_instantiation_with_evals to use list-based config format

Nice to Have:

Improve error message clarity in lm_lit_module.py:231-235
Add data_dir default comment in configs/data/default.yaml

Future Considerations:

Consider adding integration test that runs load_eval_dataset end-to-end with a small synthetic genome (currently tests only check component functions separately)
The cache_dir for evals defaults to data/evals_cache but is not configurable via YAML - consider adding to config if users need different cache locations

📊 Summary

Overall Assessment: ⭐⭐⭐⭐ (4/5 stars)

This PR successfully generalizes the evaluation system and adds plants dataset support. The architecture is clean, tests are real (not faked!), and documentation is strong. The one critical issue is a test expecting old config format, which must be fixed before merge.

Merge Recommendation: ✅ Approve after fixing test_hydra_instantiation_with_evals

Great work on this refactoring! The registry-based design is much more maintainable than the previous hardcoded approach.

gonzalobenegas added 6 commits December 1, 2025 19:38

Refactor evals

e222016

Refactor label column

593a501

Update batch size

0e3a796

Refactor evals metrics

7bb3718

Add plants config

a16183a

Update configs

49acf80

Fix eval dataset ordering

4628ee7

gonzalobenegas merged commit 331add1 into main Dec 1, 2025
6 checks passed

gonzalobenegas deleted the plants branch December 1, 2025 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add plants training dataset and eval#10

Add plants training dataset and eval#10
gonzalobenegas merged 7 commits intomainfrom
plants

gonzalobenegas commented Dec 1, 2025

Uh oh!

gonzalobenegas commented Dec 1, 2025

Uh oh!

claude bot commented Dec 1, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gonzalobenegas commented Dec 1, 2025

Uh oh!

gonzalobenegas commented Dec 1, 2025

Uh oh!

claude bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for Critical Bugs

Critical Issues Found

Minor Issues

Summary

Uh oh!

claude bot commented Dec 1, 2025

Code Review for PR #10: Add plants training dataset and eval

✅ Strengths

1. Excellent Generalization & Architecture

2. Strong Test Coverage

3. Good Documentation

4. Configuration Design

🔍 Issues Found

CRITICAL: Test Configuration Inconsistency ⚠️

MODERATE: Edge Case in Dataloader Mapping

MINOR: Missing Default in Config Documentation

📋 CLAUDE.md Compliance Checklist

🎯 Recommendations

Required Before Merge:

Nice to Have:

Future Considerations:

📊 Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Dec 1, 2025 •

edited

Loading