Conversation
PR Review: Evals Correlation AnalysisOverviewThis PR adds correlation analysis capabilities to the evals pipeline, computing and visualizing relationships between different evaluation metrics across training steps. The PR is already merged, so this is a post-merge review for documentation purposes. ✅ Strengths
|
| Guideline | Status | Notes |
|---|---|---|
| Experiments directory standards | ✅ | Correctly uses relaxed standards for exploratory analysis |
| Simplify relentlessly | Some code duplication could be extracted | |
| DRY principle | Heatmap plotting code duplicated | |
| No fake tests | N/A | No tests added (acceptable for experiments/) |
| Documentation | ✅ | Docstrings present on all rules |
| Configuration-driven | ✅ | Good use of Snakemake config |
🔒 Security Concerns
Low Risk: This is exploratory analysis code. No security vulnerabilities identified.
- No user input handling
- No external API calls
- File paths constructed from config (trusted source)
📊 Performance Considerations
-
Sequential correlation computation (lines 108-116): For large metric matrices, this could be slow. Pandas
.corr()is already optimized, but consider parallelization if metric count grows. -
File I/O in loop (line 74): Reading 100+ small TSV files sequentially. Could benefit from parallel reads, but likely not a bottleneck for current scale.
-
Memory usage: Wide pivot (line 95) loads all metrics into memory. Should be fine for current scale (~10 datasets × 8 models × 2 scorings = ~160 rows).
📝 Test Coverage
Note: Per CLAUDE.md, experiments directory has relaxed testing standards.
Recommendation for future: Consider adding lightweight validation:
- Smoke test: Run workflow on tiny synthetic dataset
- Schema test: Validate output parquet schemas
- Visual test: Generate expected plots and compare checksums
🚀 Suggestions for Future Improvements
- Add missing numpy imports (critical - should be hotfixed)
- Extract duplicate heatmap code (refactoring)
- Add config validation at workflow start
- Document magic numbers (3 columns, top 9 pairs)
- Add error handling for malformed metric files
- Consider adding
--configfilesupport for experimenting with different dataset configs
📋 Pre-Merge Checklist (Retrospective)
Since this is already merged, here's what should have been verified:
- ❌ Tests pass: No tests for experiments (acceptable per CLAUDE.md)
- ❓ Code runs successfully: Needs verification - numpy import missing
- ✅ Code formatted: N/A for experiments directory
⚠️ Real functionality: Likely works if numpy was imported in shell environment- ✅ Documentation: Adequate docstrings
⚠️ No unnecessary complexity: Some duplication exists- ❌ No linked issue: No issue reference found (violates agentic git flow)
🎓 Agentic Git Flow Compliance
Violation: This PR lacks an associated GitHub issue (CLAUDE.md Phase 2). Per the workflow:
- Should have created issue draft in Markdown
- Human review and approval
- Create issue with
gh issue create - PR title should reference issue: "Closes #N: Evals correlation"
Impact: Makes it harder to track requirements and design decisions.
✅ Final Verdict
Overall Assessment: Good exploratory feature addition with clean design, but contains a critical runtime bug (missing numpy import) and minor code quality issues.
Immediate Action Required:
- Hotfix the missing
import numpy as npstatements
Recommended Follow-up:
- Create issue to track DRY refactoring
- Add basic smoke test to catch import errors
- Document correlation analysis in experiments/evals/README.md
Code Quality Score: 7/10
- Loses points for missing import (critical bug)
- Loses points for code duplication
- Gains points for good structure and visualization quality
Review conducted following CLAUDE.md standards for the experiments directory (relaxed standards for exploratory analysis).
Calculate and plot correlations between metrics