Evals correlation by gonzalobenegas · Pull Request #15 · Open-Athena/glm-experiments

gonzalobenegas · 2025-12-12T19:15:28Z

Calculate and plot correlations between metrics

claude · 2025-12-12T19:17:29Z

PR Review: Evals Correlation Analysis

Overview

This PR adds correlation analysis capabilities to the evals pipeline, computing and visualizing relationships between different evaluation metrics across training steps. The PR is already merged, so this is a post-merge review for documentation purposes.

✅ Strengths

Well-structured configuration refactoring: The move from a flat list of datasets to dataset_configs (config.yaml:35-51) is cleaner and more maintainable. Each dataset now explicitly declares its metrics and scorings.
Good separation of concerns: New correlation rules are isolated in metrics.smk, with clear helper functions in Snakefile for managing wildcards and file generation.
Comprehensive visualization: Multiple complementary plots (heatmaps, time series, scatter pairs) provide thorough insight into metric correlations.
Appropriate for experiments directory: This exploratory analysis fits well in experiments/ with relaxed standards per CLAUDE.md guidelines.

⚠️ Issues & Concerns

1. Missing Import Statements (Critical Bug)

Location: metrics.smk:54, 185, 233

The plotting rules use numpy (np.array, np.polyfit, np.poly1d, np.linspace) but never import it. This will cause runtime failures:

# Line 202: axes = np.array([axes])  # NameError: name 'np' is not defined
# Line 272: z = np.polyfit(...)      # NameError: name 'np' is not defined

Fix needed: Add import numpy as np to the run: block of each plotting rule.

2. Hardcoded Magic Numbers

Location: metrics.smk:197, 254, 257-258

Line 197: ncols = min(3, n_combos) - Why 3 columns?
Line 254: top_pairs = corr_pairs[:min(9, len(corr_pairs))] - Why top 9 pairs?
Lines 257-258: 3x3 grid hardcoded

Recommendation: Extract to config or add explanatory comments justifying these choices.

3. Potential Index Errors

Location: metrics.smk:200-203

fig, axes = plt.subplots(nrows, ncols, figsize=(6*ncols, 4*nrows))
if nrows == 1 and ncols == 1:
    axes = np.array([axes])
axes = axes.flatten()

Issue: When nrows == 1 but ncols > 1, axes is a 1D array and doesn't need wrapping. The condition should be:

if nrows == 1 or ncols == 1:
    axes = np.atleast_1d(axes).flatten()

Or simply always use axes.flatten() after handling the single-plot case.

4. Edge Case: Empty Correlation Pairs

Location: metrics.smk:245-254

If metric_cols has fewer than 2 elements, corr_pairs will be empty, leading to empty plots. Consider adding a guard:

if len(metric_cols) < 2:
    # Create empty plots or skip
    return

5. Configuration Validation Missing

Location: Snakefile:17-41

The get_all_metric_files() function silently iterates over dataset_configs without validating that:

metrics lists valid metric types (AUPRC, AUROC, Spearman)
scorings lists valid scoring functions from the config
Required rules exist for each metric type

Recommendation: Add validation or document assumptions.

🔍 Code Quality Observations

Positive:

✅ Consistent TSV output format with float formatting
✅ Good use of parquet for intermediate data
✅ Docstrings on all new rules
✅ PDF + PNG outputs for all plots (good for publications)
✅ Correlation-based column ordering in heatmaps (smart UX)

Areas for Improvement:

Duplicate Code (DRY violation):

Lines 139-146 and 144-146 nearly identical - reorder logic duplicated for Pearson and Spearman
Lines 148-160 and 162-174 nearly identical - heatmap plotting code duplicated

Suggestion: Extract shared heatmap plotting logic:

def plot_correlation_heatmap(corr_df, output_png, output_pdf, title, label):
    # ... shared plotting code ...

Type Safety: Parquet column access (line 75: df.iloc[0, 0]) assumes specific structure. Consider using column names for robustness.

Error Handling: No try-except blocks for file I/O or correlation computation. If a metric file is malformed, the entire workflow fails opaquely.

🎯 Alignment with CLAUDE.md

Guideline	Status	Notes
Experiments directory standards	✅	Correctly uses relaxed standards for exploratory analysis
Simplify relentlessly	⚠️	Some code duplication could be extracted
DRY principle	⚠️	Heatmap plotting code duplicated
No fake tests	N/A	No tests added (acceptable for experiments/)
Documentation	✅	Docstrings present on all rules
Configuration-driven	✅	Good use of Snakemake config

🔒 Security Concerns

Low Risk: This is exploratory analysis code. No security vulnerabilities identified.

No user input handling
No external API calls
File paths constructed from config (trusted source)

📊 Performance Considerations

Sequential correlation computation (lines 108-116): For large metric matrices, this could be slow. Pandas .corr() is already optimized, but consider parallelization if metric count grows.
File I/O in loop (line 74): Reading 100+ small TSV files sequentially. Could benefit from parallel reads, but likely not a bottleneck for current scale.
Memory usage: Wide pivot (line 95) loads all metrics into memory. Should be fine for current scale (~10 datasets × 8 models × 2 scorings = ~160 rows).

📝 Test Coverage

Note: Per CLAUDE.md, experiments directory has relaxed testing standards.

Recommendation for future: Consider adding lightweight validation:

Smoke test: Run workflow on tiny synthetic dataset
Schema test: Validate output parquet schemas
Visual test: Generate expected plots and compare checksums

🚀 Suggestions for Future Improvements

Add missing numpy imports (critical - should be hotfixed)
Extract duplicate heatmap code (refactoring)
Add config validation at workflow start
Document magic numbers (3 columns, top 9 pairs)
Add error handling for malformed metric files
Consider adding --configfile support for experimenting with different dataset configs

📋 Pre-Merge Checklist (Retrospective)

Since this is already merged, here's what should have been verified:

❌ Tests pass: No tests for experiments (acceptable per CLAUDE.md)
❓ Code runs successfully: Needs verification - numpy import missing
✅ Code formatted: N/A for experiments directory
⚠️ Real functionality: Likely works if numpy was imported in shell environment
✅ Documentation: Adequate docstrings
⚠️ No unnecessary complexity: Some duplication exists
❌ No linked issue: No issue reference found (violates agentic git flow)

🎓 Agentic Git Flow Compliance

Violation: This PR lacks an associated GitHub issue (CLAUDE.md Phase 2). Per the workflow:

Should have created issue draft in Markdown
Human review and approval
Create issue with gh issue create
PR title should reference issue: "Closes #N: Evals correlation"

Impact: Makes it harder to track requirements and design decisions.

✅ Final Verdict

Overall Assessment: Good exploratory feature addition with clean design, but contains a critical runtime bug (missing numpy import) and minor code quality issues.

Immediate Action Required:

Hotfix the missing import numpy as np statements

Recommended Follow-up:

Create issue to track DRY refactoring
Add basic smoke test to catch import errors
Document correlation analysis in experiments/evals/README.md

Code Quality Score: 7/10

Loses points for missing import (critical bug)
Loses points for code duplication
Gains points for good structure and visualization quality

Review conducted following CLAUDE.md standards for the experiments directory (relaxed standards for exploratory analysis).

gonzalobenegas added 2 commits December 12, 2025 18:48

Refactor evals

ba7733d

Add correlation analysis between metrics

5793f6b

gonzalobenegas merged commit eb9d04b into main Dec 12, 2025
5 checks passed

gonzalobenegas deleted the evals-correlation branch December 12, 2025 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals correlation#15

Evals correlation#15
gonzalobenegas merged 2 commits intomainfrom
evals-correlation

gonzalobenegas commented Dec 12, 2025

Uh oh!

Uh oh!

claude bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gonzalobenegas commented Dec 12, 2025

Uh oh!

Uh oh!

claude bot commented Dec 12, 2025

PR Review: Evals Correlation Analysis

Overview

✅ Strengths

⚠️ Issues & Concerns

1. Missing Import Statements (Critical Bug)

2. Hardcoded Magic Numbers

3. Potential Index Errors

4. Edge Case: Empty Correlation Pairs

5. Configuration Validation Missing

🔍 Code Quality Observations

Positive:

Areas for Improvement:

🎯 Alignment with CLAUDE.md

🔒 Security Concerns

📊 Performance Considerations

📝 Test Coverage

🚀 Suggestions for Future Improvements

📋 Pre-Merge Checklist (Retrospective)

🎓 Agentic Git Flow Compliance

✅ Final Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant