Skip to content

Evals correlation#15

Merged
gonzalobenegas merged 2 commits intomainfrom
evals-correlation
Dec 12, 2025
Merged

Evals correlation#15
gonzalobenegas merged 2 commits intomainfrom
evals-correlation

Conversation

@gonzalobenegas
Copy link
Copy Markdown
Member

Calculate and plot correlations between metrics

image

@gonzalobenegas gonzalobenegas merged commit eb9d04b into main Dec 12, 2025
5 checks passed
@gonzalobenegas gonzalobenegas deleted the evals-correlation branch December 12, 2025 19:15
@claude
Copy link
Copy Markdown

claude bot commented Dec 12, 2025

PR Review: Evals Correlation Analysis

Overview

This PR adds correlation analysis capabilities to the evals pipeline, computing and visualizing relationships between different evaluation metrics across training steps. The PR is already merged, so this is a post-merge review for documentation purposes.


✅ Strengths

  1. Well-structured configuration refactoring: The move from a flat list of datasets to dataset_configs (config.yaml:35-51) is cleaner and more maintainable. Each dataset now explicitly declares its metrics and scorings.

  2. Good separation of concerns: New correlation rules are isolated in metrics.smk, with clear helper functions in Snakefile for managing wildcards and file generation.

  3. Comprehensive visualization: Multiple complementary plots (heatmaps, time series, scatter pairs) provide thorough insight into metric correlations.

  4. Appropriate for experiments directory: This exploratory analysis fits well in experiments/ with relaxed standards per CLAUDE.md guidelines.


⚠️ Issues & Concerns

1. Missing Import Statements (Critical Bug)

Location: metrics.smk:54, 185, 233

The plotting rules use numpy (np.array, np.polyfit, np.poly1d, np.linspace) but never import it. This will cause runtime failures:

# Line 202: axes = np.array([axes])  # NameError: name 'np' is not defined
# Line 272: z = np.polyfit(...)      # NameError: name 'np' is not defined

Fix needed: Add import numpy as np to the run: block of each plotting rule.

2. Hardcoded Magic Numbers

Location: metrics.smk:197, 254, 257-258

  • Line 197: ncols = min(3, n_combos) - Why 3 columns?
  • Line 254: top_pairs = corr_pairs[:min(9, len(corr_pairs))] - Why top 9 pairs?
  • Lines 257-258: 3x3 grid hardcoded

Recommendation: Extract to config or add explanatory comments justifying these choices.

3. Potential Index Errors

Location: metrics.smk:200-203

fig, axes = plt.subplots(nrows, ncols, figsize=(6*ncols, 4*nrows))
if nrows == 1 and ncols == 1:
    axes = np.array([axes])
axes = axes.flatten()

Issue: When nrows == 1 but ncols > 1, axes is a 1D array and doesn't need wrapping. The condition should be:

if nrows == 1 or ncols == 1:
    axes = np.atleast_1d(axes).flatten()

Or simply always use axes.flatten() after handling the single-plot case.

4. Edge Case: Empty Correlation Pairs

Location: metrics.smk:245-254

If metric_cols has fewer than 2 elements, corr_pairs will be empty, leading to empty plots. Consider adding a guard:

if len(metric_cols) < 2:
    # Create empty plots or skip
    return

5. Configuration Validation Missing

Location: Snakefile:17-41

The get_all_metric_files() function silently iterates over dataset_configs without validating that:

  • metrics lists valid metric types (AUPRC, AUROC, Spearman)
  • scorings lists valid scoring functions from the config
  • Required rules exist for each metric type

Recommendation: Add validation or document assumptions.


🔍 Code Quality Observations

Positive:

  • ✅ Consistent TSV output format with float formatting
  • ✅ Good use of parquet for intermediate data
  • ✅ Docstrings on all new rules
  • ✅ PDF + PNG outputs for all plots (good for publications)
  • ✅ Correlation-based column ordering in heatmaps (smart UX)

Areas for Improvement:

Duplicate Code (DRY violation):

  • Lines 139-146 and 144-146 nearly identical - reorder logic duplicated for Pearson and Spearman
  • Lines 148-160 and 162-174 nearly identical - heatmap plotting code duplicated

Suggestion: Extract shared heatmap plotting logic:

def plot_correlation_heatmap(corr_df, output_png, output_pdf, title, label):
    # ... shared plotting code ...

Type Safety: Parquet column access (line 75: df.iloc[0, 0]) assumes specific structure. Consider using column names for robustness.

Error Handling: No try-except blocks for file I/O or correlation computation. If a metric file is malformed, the entire workflow fails opaquely.


🎯 Alignment with CLAUDE.md

Guideline Status Notes
Experiments directory standards Correctly uses relaxed standards for exploratory analysis
Simplify relentlessly ⚠️ Some code duplication could be extracted
DRY principle ⚠️ Heatmap plotting code duplicated
No fake tests N/A No tests added (acceptable for experiments/)
Documentation Docstrings present on all rules
Configuration-driven Good use of Snakemake config

🔒 Security Concerns

Low Risk: This is exploratory analysis code. No security vulnerabilities identified.

  • No user input handling
  • No external API calls
  • File paths constructed from config (trusted source)

📊 Performance Considerations

  1. Sequential correlation computation (lines 108-116): For large metric matrices, this could be slow. Pandas .corr() is already optimized, but consider parallelization if metric count grows.

  2. File I/O in loop (line 74): Reading 100+ small TSV files sequentially. Could benefit from parallel reads, but likely not a bottleneck for current scale.

  3. Memory usage: Wide pivot (line 95) loads all metrics into memory. Should be fine for current scale (~10 datasets × 8 models × 2 scorings = ~160 rows).


📝 Test Coverage

Note: Per CLAUDE.md, experiments directory has relaxed testing standards.

Recommendation for future: Consider adding lightweight validation:

  • Smoke test: Run workflow on tiny synthetic dataset
  • Schema test: Validate output parquet schemas
  • Visual test: Generate expected plots and compare checksums

🚀 Suggestions for Future Improvements

  1. Add missing numpy imports (critical - should be hotfixed)
  2. Extract duplicate heatmap code (refactoring)
  3. Add config validation at workflow start
  4. Document magic numbers (3 columns, top 9 pairs)
  5. Add error handling for malformed metric files
  6. Consider adding --configfile support for experimenting with different dataset configs

📋 Pre-Merge Checklist (Retrospective)

Since this is already merged, here's what should have been verified:

  • Tests pass: No tests for experiments (acceptable per CLAUDE.md)
  • Code runs successfully: Needs verification - numpy import missing
  • Code formatted: N/A for experiments directory
  • ⚠️ Real functionality: Likely works if numpy was imported in shell environment
  • Documentation: Adequate docstrings
  • ⚠️ No unnecessary complexity: Some duplication exists
  • No linked issue: No issue reference found (violates agentic git flow)

🎓 Agentic Git Flow Compliance

Violation: This PR lacks an associated GitHub issue (CLAUDE.md Phase 2). Per the workflow:

  1. Should have created issue draft in Markdown
  2. Human review and approval
  3. Create issue with gh issue create
  4. PR title should reference issue: "Closes #N: Evals correlation"

Impact: Makes it harder to track requirements and design decisions.


✅ Final Verdict

Overall Assessment: Good exploratory feature addition with clean design, but contains a critical runtime bug (missing numpy import) and minor code quality issues.

Immediate Action Required:

  • Hotfix the missing import numpy as np statements

Recommended Follow-up:

  • Create issue to track DRY refactoring
  • Add basic smoke test to catch import errors
  • Document correlation analysis in experiments/evals/README.md

Code Quality Score: 7/10

  • Loses points for missing import (critical bug)
  • Loses points for code duplication
  • Gains points for good structure and visualization quality

Review conducted following CLAUDE.md standards for the experiments directory (relaxed standards for exploratory analysis).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant