Simplify aggregation scripts (+ more quality of life changes) #41
Open
hrdkbhatnagar wants to merge 1 commit into
Open
Simplify aggregation scripts (+ more quality of life changes) #41hrdkbhatnagar wants to merge 1 commit into
hrdkbhatnagar wants to merge 1 commit into
Conversation
…agents add rerun judge
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a big one
Main things:
Aggregation pipeline refactor (
scripts/) Fixes #24Consolidated 13 per-stage scripts into 3 + 1 data file:
scripts/collect.py— single pass per method dir; readsmetrics.json+ contamination/disallowed-model judgements +time_taken.txt; applies hardcoded baseline-zeroshot fallback for contaminated/errored cells; writesfinal_{method}.csv,contamination_{method}.csv,time_overview.csv.scripts/aggregate.py—aggregate_per_cell/aggregate_leaderboard/aggregate_time. Replacesaggregate_avg_stddev*.py,aggregate_time_avg_stddev.py,aggregate_summary.py,aggregate_together.py,compute_single_metrics*.py,compute_baseline_metrics*.py.scripts/utils.py— shared helpers (CSV I/O, stats, dir walking, metric/contamination/time loaders) deduplicated from across the old pipeline.scripts/baselines.json— hardcoded zero-shot + few-shot baselines so baselines no longer need to be recomputed every run.scripts/verify.py— regression used to confirm the new pipeline matches the old one byte-for-byte (93/93 CSVs identical) (Lets delete this and the old aggregation scripts after the review)Judge-rerun tool (
src/disallowed_usage_judge/)For recovering runs where the post-eval contamination/disallowed-model judge step crashed mid-run (e.g. API quota hit,
Quota exceedederror).src/disallowed_usage_judge/run_judge.sh- standalone judge runner; mirrors the inline judge inrun_task.shexactly (single GPT-5.1-Codex viaCODEX_API_KEY, but operates on an existing result dir and writes outputs with_rerunsuffix so originals are preserved.src/disallowed_usage_judge/rerun_judge/- orchestration:list_results.py(with--only-missing-judgementfilter for the case this was built for),rerun_single.sh,commit_rerun_judge.sh(HTCondorbatch with
--method/--benchmark/--skip-existing/--latest-only/--dry-run),rerun_judge.sub,aggregate_rerun_results.pyNew agents (
agents/)codex_xhigh,codex_xhigh_reprompt,claude_reprompt.New containers (
containers/)Definition files for
gpt_5_2_codexdev_utils fixes
terminated_finder.py— was reading everyerror.login full and regexing the entire content for SIGTERM/SIGKILL markers; now reads a 4KB head slice and uses a tight regex (run_task\.sh: line \d+: \d+ Killed)that won't false-positive on tokenizer vocab tokens like. ~10x faster and fixes a false positive.
limit_hit_list.py— added two error patterns:Reconnecting... 5/5(Codex CLI exhausted stream-reconnect retries) andQuota exceeded. Check your plan(the human-readable form ofinsufficient_quotathat Codexemits in
turn.failed).Other scripts
scripts/rerun_eval_n_times.sh+aggregate_metrics_runs.py- re-evaluatea finished run N times to capture decoding-noise mean/std/stderr.