Skip to content

Simplify aggregation scripts (+ more quality of life changes) #41

Open
hrdkbhatnagar wants to merge 1 commit into
mainfrom
qol_multigpu
Open

Simplify aggregation scripts (+ more quality of life changes) #41
hrdkbhatnagar wants to merge 1 commit into
mainfrom
qol_multigpu

Conversation

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator

This is a big one

Main things:

  • a new consolidated aggregation pipeline that replaces the old 13 scripts with 3
  • added ability to rerun just the judge (inspired from the new_judge_v2 branch)

Aggregation pipeline refactor (scripts/) Fixes #24

Consolidated 13 per-stage scripts into 3 + 1 data file:

  • scripts/collect.py — single pass per method dir; reads metrics.json + contamination/disallowed-model judgements + time_taken.txt; applies hardcoded baseline-zeroshot fallback for contaminated/errored cells; writes final_{method}.csv, contamination_{method}.csv, time_overview.csv.

  • scripts/aggregate.pyaggregate_per_cell / aggregate_leaderboard / aggregate_time. Replaces aggregate_avg_stddev*.py, aggregate_time_avg_stddev.py, aggregate_summary.py, aggregate_together.py, compute_single_metrics*.py, compute_baseline_metrics*.py.

  • scripts/utils.py — shared helpers (CSV I/O, stats, dir walking, metric/contamination/time loaders) deduplicated from across the old pipeline.

  • scripts/baselines.json — hardcoded zero-shot + few-shot baselines so baselines no longer need to be recomputed every run.

  • scripts/verify.py — regression used to confirm the new pipeline matches the old one byte-for-byte (93/93 CSVs identical) (Lets delete this and the old aggregation scripts after the review)

Judge-rerun tool (src/disallowed_usage_judge/)

For recovering runs where the post-eval contamination/disallowed-model judge step crashed mid-run (e.g. API quota hit, Quota exceeded error).

  • src/disallowed_usage_judge/run_judge.sh - standalone judge runner; mirrors the inline judge in run_task.sh exactly (single GPT-5.1-Codex via CODEX_API_KEY, but operates on an existing result dir and writes outputs with _rerun suffix so originals are preserved.
  • src/disallowed_usage_judge/rerun_judge/ - orchestration: list_results.py (with --only-missing-judgement filter for the case this was built for), rerun_single.sh, commit_rerun_judge.sh (HTCondor
    batch with --method/--benchmark/--skip-existing/--latest-only/ --dry-run), rerun_judge.sub, aggregate_rerun_results.py

New agents (agents/)

  • API line: codex_xhigh, codex_xhigh_reprompt, claude_reprompt.

New containers (containers/)

Definition files for gpt_5_2_codex

dev_utils fixes

  • terminated_finder.py — was reading every error.log in full and regexing the entire content for SIGTERM/SIGKILL markers; now reads a 4KB head slice and uses a tight regex (run_task\.sh: line \d+: \d+ Killed)
    that won't false-positive on tokenizer vocab tokens like. ~10x faster and fixes a false positive.

  • limit_hit_list.py — added two error patterns: Reconnecting... 5/5 (Codex CLI exhausted stream-reconnect retries) and Quota exceeded. Check your plan (the human-readable form of insufficient_quota that Codex
    emits in turn.failed).

Other scripts

  • scripts/rerun_eval_n_times.sh + aggregate_metrics_runs.py - re-evaluate
    a finished run N times to capture decoding-noise mean/std/stderr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify experiment aggregation

1 participant