finish task 4 canary analysis#4
Conversation
yuehu99
commented
Apr 24, 2026
- per_bucket_agree_rates.json
- evidnce_token_overleap.json: overall for all sentences; by_topic is grouped by topic
- summary_ft_canary.md AND results)dataset_canary
There was a problem hiding this comment.
@yuehu99 add code . github is for code
There was a problem hiding this comment.
Pull request overview
This PR adds “task 4” canary analysis artifacts: scripts to compute agreement/token-overlap statistics and a vLLM evaluation runner, plus the generated evaluation summaries/metrics/logs for the dataset canary split.
Changes:
- Add scripts to flatten
dataset_final.jsonl, compute per-topic agree rates, and compute evidence/answer token overlap. - Add a bash runner to generate/score vLLM outputs for 3 models (base vs finetuned) and generate a summary table.
- Add generated result artifacts (markdown summaries, metrics JSON, and run logs).
Reviewed changes
Copilot reviewed 12 out of 26 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| canary_analysis/summary_ft_canary.md | Adds a markdown summary table of evaluation metrics. |
| canary_analysis/run_dataset_final_vllm.sh | Adds a runner script for vLLM generation + scoring + summary creation. |
| canary_analysis/results_dataset_canary/summary.md | Adds a markdown summary table for dataset_final evaluation results. |
| canary_analysis/results_dataset_canary/qwen2_5_14b/ft_chemqa.metrics.json | Stores scored metrics for Qwen finetuned. |
| canary_analysis/results_dataset_canary/qwen2_5_14b/ft_chemqa.log | Stores vLLM run log for Qwen finetuned. |
| canary_analysis/results_dataset_canary/qwen2_5_14b/base_chemqa.metrics.json | Stores scored metrics for Qwen base. |
| canary_analysis/results_dataset_canary/qwen2_5_14b/base_chemqa.log | Stores vLLM run log for Qwen base. |
| canary_analysis/results_dataset_canary/llama3_1_8b/ft_chemqa.metrics.json | Stores scored metrics for Llama finetuned. |
| canary_analysis/results_dataset_canary/llama3_1_8b/ft_chemqa.log | Stores vLLM run log for Llama finetuned. |
| canary_analysis/results_dataset_canary/llama3_1_8b/base_chemqa.metrics.json | Stores scored metrics for Llama base. |
| canary_analysis/results_dataset_canary/llama3_1_8b/base_chemqa.log | Stores vLLM run log for Llama base. |
| canary_analysis/results_dataset_canary/gemma3_12b/ft_chemqa.metrics.json | Stores scored metrics for Gemma finetuned. |
| canary_analysis/results_dataset_canary/gemma3_12b/ft_chemqa.log | Stores vLLM run log for Gemma finetuned. |
| canary_analysis/results_dataset_canary/gemma3_12b/base_chemqa.metrics.json | Stores scored metrics for Gemma base. |
| canary_analysis/results_dataset_canary/gemma3_12b/base_chemqa.log | Stores vLLM run log for Gemma base. |
| canary_analysis/prep_dataset_final.py | Adds a script to flatten dataset_final QA pairs into the eval schema. |
| canary_analysis/per_bucket_agree_rates.py | Adds per-topic “agree” rate computation and JSON output generation. |
| canary_analysis/evidence_token_overlap.py | Adds evidence/answer token overlap analysis and JSON output generation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| os.makedirs("canary_analysiscanary_analysis", exist_ok=True) | ||
|
|
||
| with open("canary_analysiscanary_analysis/per_bucket_agree_rates.json", "w") as f: | ||
| json.dump(sorted_output, f, indent=2) | ||
|
|
||
| print("Saved to canary_analysiscanary_analysis/per_bucket_agree_rates.json") No newline at end of file |
There was a problem hiding this comment.
The os.makedirs/output path is misspelled as canary_analysiscanary_analysis/... (missing a path separator). This will write outputs into an unintended directory. Use canary_analysis/per_bucket_agree_rates.json (or build the path with os.path.join).
| os.makedirs("canary_analysiscanary_analysis", exist_ok=True) | |
| with open("canary_analysiscanary_analysis/per_bucket_agree_rates.json", "w") as f: | |
| json.dump(sorted_output, f, indent=2) | |
| print("Saved to canary_analysiscanary_analysis/per_bucket_agree_rates.json") | |
| output_dir = "canary_analysis" | |
| output_path = os.path.join(output_dir, "per_bucket_agree_rates.json") | |
| os.makedirs(output_dir, exist_ok=True) | |
| with open(output_path, "w") as f: | |
| json.dump(sorted_output, f, indent=2) | |
| print(f"Saved to {output_path}") |
| output[topic] = { | ||
| "main": main, | ||
| "canary": canary, | ||
| "delta": round(main["agree_rate"] - canary["agree_rate"], 4), | ||
| "difference": round(main["agree_rate"] - canary["agree_rate"], 4), | ||
| "delta_pp": round((main["agree_rate"] - canary["agree_rate"]) * 100, 2) |
There was a problem hiding this comment.
delta and difference are computed identically, which is redundant and can confuse downstream consumers. Keep just one field (or define difference differently) and document the intended semantics.
| output[topic] = { | |
| "main": main, | |
| "canary": canary, | |
| "delta": round(main["agree_rate"] - canary["agree_rate"], 4), | |
| "difference": round(main["agree_rate"] - canary["agree_rate"], 4), | |
| "delta_pp": round((main["agree_rate"] - canary["agree_rate"]) * 100, 2) | |
| # Signed difference in agree rate: main - canary. | |
| delta = main["agree_rate"] - canary["agree_rate"] | |
| output[topic] = { | |
| "main": main, | |
| "canary": canary, | |
| "delta": round(delta, 4), | |
| "delta_pp": round(delta * 100, 2) |
| # === paths === | ||
| main_path = "dataset_gold.jsonl" | ||
| canary_path = "dataset_final.jsonl" | ||
|
|
||
| main_stats = compute_agree_rates(main_path) | ||
| canary_stats = compute_agree_rates(canary_path) | ||
|
|
There was a problem hiding this comment.
This script hardcodes input filenames (dataset_gold.jsonl / dataset_final.jsonl) and executes work at import-time. Consider moving the logic under a main() with if __name__ == "__main__" and adding CLI args for the input/output paths, so it’s reproducible from different working directories.
| main_path = "dataset_gold.jsonl" | ||
| canary_path = "dataset_final.jsonl" | ||
| output_path = "canary_analysis/evidence_token_overlap.json" | ||
|
|
There was a problem hiding this comment.
main_path/canary_path are hardcoded and there are no CLI flags, which makes the script dependent on the current working directory and local filenames. Consider adding argparse options for input/output paths to make the analysis reproducible.
| def parse_args(): | ||
| p = argparse.ArgumentParser() | ||
| p.add_argument("--input_file", default="/data/yue/Luis_data/dataset_final.jsonl") | ||
| p.add_argument("--output_file", | ||
| default="/data/yue/Luis_data/ChemQA/data/processed/dataset_final_flat.jsonl") |
There was a problem hiding this comment.
The default --input_file / --output_file point to absolute paths under /data/..., which makes the script non-portable when run from other machines or CI. Prefer defaults relative to the repo (or require these args explicitly).
| def parse_args(): | |
| p = argparse.ArgumentParser() | |
| p.add_argument("--input_file", default="/data/yue/Luis_data/dataset_final.jsonl") | |
| p.add_argument("--output_file", | |
| default="/data/yue/Luis_data/ChemQA/data/processed/dataset_final_flat.jsonl") | |
| def _repo_root() -> Path: | |
| return Path(__file__).resolve().parent.parent | |
| def parse_args(): | |
| repo_root = _repo_root() | |
| p = argparse.ArgumentParser() | |
| p.add_argument("--input_file", default=str(repo_root / "data" / "dataset_final.jsonl")) | |
| p.add_argument("--output_file", | |
| default=str(repo_root / "data" / "processed" / "dataset_final_flat.jsonl")) |
| | model | variant | n | CIDEr | BLEU-1 | BLEU-4 | ROUGE-L | | ||
| |---|---|---|---|---|---|---| | ||
| | gemma3_12b | base | 1509 | 0.0078 | 0.1743 | 0.0195 | 0.1404 | | ||
| | gemma3_12b | finetuned | 1509 | 0.3017 | 0.3930 | 0.1218 | 0.2855 | | ||
| | llama3_1_8b | base | 1509 | 0.0072 | 0.1739 | 0.0244 | 0.1464 | | ||
| | llama3_1_8b | finetuned | 1509 | 0.2753 | 0.3806 | 0.1133 | 0.2771 | | ||
| | qwen2_5_14b | base | 1509 | 0.0021 | 0.1690 | 0.0220 | 0.1555 | | ||
| | qwen2_5_14b | finetuned | 1509 | 0.3143 | 0.3956 | 0.1247 | 0.2863 | No newline at end of file |
There was a problem hiding this comment.
The Markdown table is prefixed with || (double pipe) on every row, which prevents GitHub from rendering it as a table. Use a single leading | per row (and remove the extra leading space) so the table renders correctly.
| | model | variant | n | CIDEr | BLEU-1 | BLEU-4 | ROUGE-L | | ||
| |---|---|---|---|---|---|---| | ||
| | gemma3_12b | base | 1509 | 0.0078 | 0.1743 | 0.0195 | 0.1404 | | ||
| | gemma3_12b | finetuned | 1509 | 0.3017 | 0.3930 | 0.1218 | 0.2855 | | ||
| | llama3_1_8b | base | 1509 | 0.0072 | 0.1739 | 0.0244 | 0.1464 | | ||
| | llama3_1_8b | finetuned | 1509 | 0.2753 | 0.3806 | 0.1133 | 0.2771 | | ||
| | qwen2_5_14b | base | 1509 | 0.0021 | 0.1690 | 0.0220 | 0.1555 | | ||
| | qwen2_5_14b | finetuned | 1509 | 0.3143 | 0.3956 | 0.1247 | 0.2863 | |
There was a problem hiding this comment.
The results table rows start with || (double pipe), so the Markdown table won't render correctly. Use a single leading | for each row.
| # 1) Flatten qa_pairs -> {prompt, response} | ||
| python3 $REPO/eval/scripts/prep_dataset_final.py \ | ||
| --input_file "$RAW" --output_file "$TEST" |
There was a problem hiding this comment.
This script invokes $REPO/eval/scripts/prep_dataset_final.py, but this repository does not have an eval/scripts directory (the closest is finetune/eval/scripts/), and the new prep script is located at canary_analysis/prep_dataset_final.py. Update the path (or compute it relative to this script) so the command can actually run in this repo.
There was a problem hiding this comment.
@luistafoi if results are fine then comment "good to merge"