finish task 4 canary analysis by yuehu99 · Pull Request #4 · vinash85/Chem2TextQA

yuehu99 · 2026-04-24T22:04:31Z

per_bucket_agree_rates.json
evidnce_token_overleap.json: overall for all sentences; by_topic is grouped by topic
summary_ft_canary.md AND results)dataset_canary

kushalviit · 2026-04-24T22:11:08Z

@yuehu99 add code . github is for code

Copilot

Pull request overview

This PR adds “task 4” canary analysis artifacts: scripts to compute agreement/token-overlap statistics and a vLLM evaluation runner, plus the generated evaluation summaries/metrics/logs for the dataset canary split.

Changes:

Add scripts to flatten dataset_final.jsonl, compute per-topic agree rates, and compute evidence/answer token overlap.
Add a bash runner to generate/score vLLM outputs for 3 models (base vs finetuned) and generate a summary table.
Add generated result artifacts (markdown summaries, metrics JSON, and run logs).

Reviewed changes

Copilot reviewed 12 out of 26 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
canary_analysis/summary_ft_canary.md	Adds a markdown summary table of evaluation metrics.
canary_analysis/run_dataset_final_vllm.sh	Adds a runner script for vLLM generation + scoring + summary creation.
canary_analysis/results_dataset_canary/summary.md	Adds a markdown summary table for dataset_final evaluation results.
canary_analysis/results_dataset_canary/qwen2_5_14b/ft_chemqa.metrics.json	Stores scored metrics for Qwen finetuned.
canary_analysis/results_dataset_canary/qwen2_5_14b/ft_chemqa.log	Stores vLLM run log for Qwen finetuned.
canary_analysis/results_dataset_canary/qwen2_5_14b/base_chemqa.metrics.json	Stores scored metrics for Qwen base.
canary_analysis/results_dataset_canary/qwen2_5_14b/base_chemqa.log	Stores vLLM run log for Qwen base.
canary_analysis/results_dataset_canary/llama3_1_8b/ft_chemqa.metrics.json	Stores scored metrics for Llama finetuned.
canary_analysis/results_dataset_canary/llama3_1_8b/ft_chemqa.log	Stores vLLM run log for Llama finetuned.
canary_analysis/results_dataset_canary/llama3_1_8b/base_chemqa.metrics.json	Stores scored metrics for Llama base.
canary_analysis/results_dataset_canary/llama3_1_8b/base_chemqa.log	Stores vLLM run log for Llama base.
canary_analysis/results_dataset_canary/gemma3_12b/ft_chemqa.metrics.json	Stores scored metrics for Gemma finetuned.
canary_analysis/results_dataset_canary/gemma3_12b/ft_chemqa.log	Stores vLLM run log for Gemma finetuned.
canary_analysis/results_dataset_canary/gemma3_12b/base_chemqa.metrics.json	Stores scored metrics for Gemma base.
canary_analysis/results_dataset_canary/gemma3_12b/base_chemqa.log	Stores vLLM run log for Gemma base.
canary_analysis/prep_dataset_final.py	Adds a script to flatten dataset_final QA pairs into the eval schema.
canary_analysis/per_bucket_agree_rates.py	Adds per-topic “agree” rate computation and JSON output generation.
canary_analysis/evidence_token_overlap.py	Adds evidence/answer token overlap analysis and JSON output generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T22:24:50Z

+os.makedirs("canary_analysiscanary_analysis", exist_ok=True)
+
+with open("canary_analysiscanary_analysis/per_bucket_agree_rates.json", "w") as f:
+    json.dump(sorted_output, f, indent=2)
+
+print("Saved to canary_analysiscanary_analysis/per_bucket_agree_rates.json")


The os.makedirs/output path is misspelled as canary_analysiscanary_analysis/... (missing a path separator). This will write outputs into an unintended directory. Use canary_analysis/per_bucket_agree_rates.json (or build the path with os.path.join).

Suggested change

os.makedirs("canary_analysiscanary_analysis", exist_ok=True)

with open("canary_analysiscanary_analysis/per_bucket_agree_rates.json", "w") as f:

json.dump(sorted_output, f, indent=2)

print("Saved to canary_analysiscanary_analysis/per_bucket_agree_rates.json")

output_dir = "canary_analysis"

output_path = os.path.join(output_dir, "per_bucket_agree_rates.json")

os.makedirs(output_dir, exist_ok=True)

with open(output_path, "w") as f:

json.dump(sorted_output, f, indent=2)

print(f"Saved to {output_path}")

Copilot · 2026-04-24T22:24:51Z

+    output[topic] = {
+        "main": main,
+        "canary": canary,
+        "delta": round(main["agree_rate"] - canary["agree_rate"], 4),
+        "difference": round(main["agree_rate"] - canary["agree_rate"], 4),
+        "delta_pp": round((main["agree_rate"] - canary["agree_rate"]) * 100, 2)


delta and difference are computed identically, which is redundant and can confuse downstream consumers. Keep just one field (or define difference differently) and document the intended semantics.

Suggested change

output[topic] = {

"main": main,

"canary": canary,

"delta": round(main["agree_rate"] - canary["agree_rate"], 4),

"difference": round(main["agree_rate"] - canary["agree_rate"], 4),

"delta_pp": round((main["agree_rate"] - canary["agree_rate"]) * 100, 2)

# Signed difference in agree rate: main - canary.

delta = main["agree_rate"] - canary["agree_rate"]

output[topic] = {

"main": main,

"canary": canary,

"delta": round(delta, 4),

"delta_pp": round(delta * 100, 2)

Copilot · 2026-04-24T22:24:51Z

+# === paths ===
+main_path = "dataset_gold.jsonl"
+canary_path = "dataset_final.jsonl"
+
+main_stats = compute_agree_rates(main_path)
+canary_stats = compute_agree_rates(canary_path)
+


This script hardcodes input filenames (dataset_gold.jsonl / dataset_final.jsonl) and executes work at import-time. Consider moving the logic under a main() with if __name__ == "__main__" and adding CLI args for the input/output paths, so it’s reproducible from different working directories.

Copilot · 2026-04-24T22:24:51Z

+    main_path = "dataset_gold.jsonl"
+    canary_path = "dataset_final.jsonl"
+    output_path = "canary_analysis/evidence_token_overlap.json"
+


main_path/canary_path are hardcoded and there are no CLI flags, which makes the script dependent on the current working directory and local filenames. Consider adding argparse options for input/output paths to make the analysis reproducible.

Copilot · 2026-04-24T22:24:52Z

+def parse_args():
+    p = argparse.ArgumentParser()
+    p.add_argument("--input_file", default="/data/yue/Luis_data/dataset_final.jsonl")
+    p.add_argument("--output_file",
+                   default="/data/yue/Luis_data/ChemQA/data/processed/dataset_final_flat.jsonl")


The default --input_file / --output_file point to absolute paths under /data/..., which makes the script non-portable when run from other machines or CI. Prefer defaults relative to the repo (or require these args explicitly).

Suggested change

def parse_args():

p = argparse.ArgumentParser()

p.add_argument("--input_file", default="/data/yue/Luis_data/dataset_final.jsonl")

p.add_argument("--output_file",

default="/data/yue/Luis_data/ChemQA/data/processed/dataset_final_flat.jsonl")

def _repo_root() -> Path:

return Path(__file__).resolve().parent.parent

def parse_args():

repo_root = _repo_root()

p = argparse.ArgumentParser()

p.add_argument("--input_file", default=str(repo_root / "data" / "dataset_final.jsonl"))

p.add_argument("--output_file",

default=str(repo_root / "data" / "processed" / "dataset_final_flat.jsonl"))

Copilot · 2026-04-24T22:24:52Z

+| model | variant | n | CIDEr | BLEU-1 | BLEU-4 | ROUGE-L |
+|---|---|---|---|---|---|---|
+| gemma3_12b | base | 1509 | 0.0078 | 0.1743 | 0.0195 | 0.1404 |
+| gemma3_12b | finetuned | 1509 | 0.3017 | 0.3930 | 0.1218 | 0.2855 |
+| llama3_1_8b | base | 1509 | 0.0072 | 0.1739 | 0.0244 | 0.1464 |
+| llama3_1_8b | finetuned | 1509 | 0.2753 | 0.3806 | 0.1133 | 0.2771 |
+| qwen2_5_14b | base | 1509 | 0.0021 | 0.1690 | 0.0220 | 0.1555 |
+| qwen2_5_14b | finetuned | 1509 | 0.3143 | 0.3956 | 0.1247 | 0.2863 |


The Markdown table is prefixed with || (double pipe) on every row, which prevents GitHub from rendering it as a table. Use a single leading | per row (and remove the extra leading space) so the table renders correctly.

Copilot · 2026-04-24T22:24:52Z

+| model | variant | n | CIDEr | BLEU-1 | BLEU-4 | ROUGE-L |
+|---|---|---|---|---|---|---|
+| gemma3_12b | base | 1509 | 0.0078 | 0.1743 | 0.0195 | 0.1404 |
+| gemma3_12b | finetuned | 1509 | 0.3017 | 0.3930 | 0.1218 | 0.2855 |
+| llama3_1_8b | base | 1509 | 0.0072 | 0.1739 | 0.0244 | 0.1464 |
+| llama3_1_8b | finetuned | 1509 | 0.2753 | 0.3806 | 0.1133 | 0.2771 |
+| qwen2_5_14b | base | 1509 | 0.0021 | 0.1690 | 0.0220 | 0.1555 |
+| qwen2_5_14b | finetuned | 1509 | 0.3143 | 0.3956 | 0.1247 | 0.2863 |


The results table rows start with || (double pipe), so the Markdown table won't render correctly. Use a single leading | for each row.

Copilot · 2026-04-24T22:24:52Z

+# 1) Flatten qa_pairs -> {prompt, response}
+python3 $REPO/eval/scripts/prep_dataset_final.py \
+  --input_file "$RAW" --output_file "$TEST"


This script invokes $REPO/eval/scripts/prep_dataset_final.py, but this repository does not have an eval/scripts directory (the closest is finetune/eval/scripts/), and the new prep script is located at canary_analysis/prep_dataset_final.py. Update the path (or compute it relative to this script) so the command can actually run in this repo.

kushalviit · 2026-04-25T00:44:49Z

@luistafoi if results are fine then comment "good to merge"

finish task 4 canary analysis

0948e82

yuehu99 requested a review from luistafoi April 24, 2026 22:04

kushalviit reviewed Apr 24, 2026

View reviewed changes

code of task 4 canary analysis

e7d70bd

Macaulay001 requested a review from Copilot April 24, 2026 22:20

Copilot started reviewing on behalf of Macaulay001 April 24, 2026 22:21 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

kushalviit reviewed Apr 25, 2026

View reviewed changes

Comment thread canary_analysis/per_bucket_agree_rates.py

kushalviit Apr 25, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finish task 4 canary analysis#4

finish task 4 canary analysis#4
yuehu99 wants to merge 2 commits into
mainfrom
task4_canary_analysis

yuehu99 commented Apr 24, 2026

Uh oh!

kushalviit Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

kushalviit Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuehu99 commented Apr 24, 2026

Uh oh!

kushalviit Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

kushalviit Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants