Simplify aggregation scripts (+ more quality of life changes) by hrdkbhatnagar · Pull Request #41 · aisa-group/PostTrainBench

hrdkbhatnagar · 2026-05-05T19:52:56Z

This is a big one

Main things:

a new consolidated aggregation pipeline that replaces the old 13 scripts with 3
added ability to rerun just the judge (inspired from the new_judge_v2 branch)

Aggregation pipeline refactor (`scripts/`) Fixes #24

Consolidated 13 per-stage scripts into 3 + 1 data file:

scripts/collect.py — single pass per method dir; reads metrics.json + contamination/disallowed-model judgements + time_taken.txt; applies hardcoded baseline-zeroshot fallback for contaminated/errored cells; writes final_{method}.csv, contamination_{method}.csv, time_overview.csv.
scripts/aggregate.py — aggregate_per_cell / aggregate_leaderboard / aggregate_time. Replaces aggregate_avg_stddev*.py, aggregate_time_avg_stddev.py, aggregate_summary.py, aggregate_together.py, compute_single_metrics*.py, compute_baseline_metrics*.py.
scripts/utils.py — shared helpers (CSV I/O, stats, dir walking, metric/contamination/time loaders) deduplicated from across the old pipeline.
scripts/baselines.json — hardcoded zero-shot + few-shot baselines so baselines no longer need to be recomputed every run.
scripts/verify.py — regression used to confirm the new pipeline matches the old one byte-for-byte (93/93 CSVs identical) (Lets delete this and the old aggregation scripts after the review)

Judge-rerun tool (`src/disallowed_usage_judge/`)

For recovering runs where the post-eval contamination/disallowed-model judge step crashed mid-run (e.g. API quota hit, Quota exceeded error).

src/disallowed_usage_judge/run_judge.sh - standalone judge runner; mirrors the inline judge in run_task.sh exactly (single GPT-5.1-Codex via CODEX_API_KEY, but operates on an existing result dir and writes outputs with _rerun suffix so originals are preserved.
src/disallowed_usage_judge/rerun_judge/ - orchestration: list_results.py (with --only-missing-judgement filter for the case this was built for), rerun_single.sh, commit_rerun_judge.sh (HTCondor
batch with --method/--benchmark/--skip-existing/--latest-only/ --dry-run), rerun_judge.sub, aggregate_rerun_results.py

New agents (`agents/`)

API line: codex_xhigh, codex_xhigh_reprompt, claude_reprompt.

New containers (`containers/`)

Definition files for gpt_5_2_codex

dev_utils fixes

terminated_finder.py — was reading every error.log in full and regexing the entire content for SIGTERM/SIGKILL markers; now reads a 4KB head slice and uses a tight regex (run_task\.sh: line \d+: \d+ Killed)
that won't false-positive on tokenizer vocab tokens like. ~10x faster and fixes a false positive.
limit_hit_list.py — added two error patterns: Reconnecting... 5/5 (Codex CLI exhausted stream-reconnect retries) and Quota exceeded. Check your plan (the human-readable form of insufficient_quota that Codex
emits in turn.failed).

Other scripts

scripts/rerun_eval_n_times.sh + aggregate_metrics_runs.py - re-evaluate
a finished run N times to capture decoding-noise mean/std/stderr.

…agents add rerun judge

refactor aggregation scripts completely fix dev_utils scriptsadd new …

2d6f1d5

…agents add rerun judge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify aggregation scripts (+ more quality of life changes) #41

Simplify aggregation scripts (+ more quality of life changes) #41
hrdkbhatnagar wants to merge 1 commit into
mainfrom
qol_multigpu

hrdkbhatnagar commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hrdkbhatnagar commented May 5, 2026

Aggregation pipeline refactor (scripts/) Fixes #24

Judge-rerun tool (src/disallowed_usage_judge/)

New agents (agents/)

New containers (containers/)

dev_utils fixes

Other scripts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aggregation pipeline refactor (`scripts/`) Fixes #24

Judge-rerun tool (`src/disallowed_usage_judge/`)

New agents (`agents/`)

New containers (`containers/`)