Skip to content

finish task 4 canary analysis#4

Open
yuehu99 wants to merge 2 commits into
mainfrom
task4_canary_analysis
Open

finish task 4 canary analysis#4
yuehu99 wants to merge 2 commits into
mainfrom
task4_canary_analysis

Conversation

@yuehu99

@yuehu99 yuehu99 commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator
  1. per_bucket_agree_rates.json
  2. evidnce_token_overleap.json: overall for all sentences; by_topic is grouped by topic
  3. summary_ft_canary.md AND results)dataset_canary

@yuehu99 yuehu99 requested a review from luistafoi April 24, 2026 22:04

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuehu99 add code . github is for code

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “task 4” canary analysis artifacts: scripts to compute agreement/token-overlap statistics and a vLLM evaluation runner, plus the generated evaluation summaries/metrics/logs for the dataset canary split.

Changes:

  • Add scripts to flatten dataset_final.jsonl, compute per-topic agree rates, and compute evidence/answer token overlap.
  • Add a bash runner to generate/score vLLM outputs for 3 models (base vs finetuned) and generate a summary table.
  • Add generated result artifacts (markdown summaries, metrics JSON, and run logs).

Reviewed changes

Copilot reviewed 12 out of 26 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
canary_analysis/summary_ft_canary.md Adds a markdown summary table of evaluation metrics.
canary_analysis/run_dataset_final_vllm.sh Adds a runner script for vLLM generation + scoring + summary creation.
canary_analysis/results_dataset_canary/summary.md Adds a markdown summary table for dataset_final evaluation results.
canary_analysis/results_dataset_canary/qwen2_5_14b/ft_chemqa.metrics.json Stores scored metrics for Qwen finetuned.
canary_analysis/results_dataset_canary/qwen2_5_14b/ft_chemqa.log Stores vLLM run log for Qwen finetuned.
canary_analysis/results_dataset_canary/qwen2_5_14b/base_chemqa.metrics.json Stores scored metrics for Qwen base.
canary_analysis/results_dataset_canary/qwen2_5_14b/base_chemqa.log Stores vLLM run log for Qwen base.
canary_analysis/results_dataset_canary/llama3_1_8b/ft_chemqa.metrics.json Stores scored metrics for Llama finetuned.
canary_analysis/results_dataset_canary/llama3_1_8b/ft_chemqa.log Stores vLLM run log for Llama finetuned.
canary_analysis/results_dataset_canary/llama3_1_8b/base_chemqa.metrics.json Stores scored metrics for Llama base.
canary_analysis/results_dataset_canary/llama3_1_8b/base_chemqa.log Stores vLLM run log for Llama base.
canary_analysis/results_dataset_canary/gemma3_12b/ft_chemqa.metrics.json Stores scored metrics for Gemma finetuned.
canary_analysis/results_dataset_canary/gemma3_12b/ft_chemqa.log Stores vLLM run log for Gemma finetuned.
canary_analysis/results_dataset_canary/gemma3_12b/base_chemqa.metrics.json Stores scored metrics for Gemma base.
canary_analysis/results_dataset_canary/gemma3_12b/base_chemqa.log Stores vLLM run log for Gemma base.
canary_analysis/prep_dataset_final.py Adds a script to flatten dataset_final QA pairs into the eval schema.
canary_analysis/per_bucket_agree_rates.py Adds per-topic “agree” rate computation and JSON output generation.
canary_analysis/evidence_token_overlap.py Adds evidence/answer token overlap analysis and JSON output generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +68 to +73
os.makedirs("canary_analysiscanary_analysis", exist_ok=True)

with open("canary_analysiscanary_analysis/per_bucket_agree_rates.json", "w") as f:
json.dump(sorted_output, f, indent=2)

print("Saved to canary_analysiscanary_analysis/per_bucket_agree_rates.json") No newline at end of file

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os.makedirs/output path is misspelled as canary_analysiscanary_analysis/... (missing a path separator). This will write outputs into an unintended directory. Use canary_analysis/per_bucket_agree_rates.json (or build the path with os.path.join).

Suggested change
os.makedirs("canary_analysiscanary_analysis", exist_ok=True)
with open("canary_analysiscanary_analysis/per_bucket_agree_rates.json", "w") as f:
json.dump(sorted_output, f, indent=2)
print("Saved to canary_analysiscanary_analysis/per_bucket_agree_rates.json")
output_dir = "canary_analysis"
output_path = os.path.join(output_dir, "per_bucket_agree_rates.json")
os.makedirs(output_dir, exist_ok=True)
with open(output_path, "w") as f:
json.dump(sorted_output, f, indent=2)
print(f"Saved to {output_path}")

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +60
output[topic] = {
"main": main,
"canary": canary,
"delta": round(main["agree_rate"] - canary["agree_rate"], 4),
"difference": round(main["agree_rate"] - canary["agree_rate"], 4),
"delta_pp": round((main["agree_rate"] - canary["agree_rate"]) * 100, 2)

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta and difference are computed identically, which is redundant and can confuse downstream consumers. Keep just one field (or define difference differently) and document the intended semantics.

Suggested change
output[topic] = {
"main": main,
"canary": canary,
"delta": round(main["agree_rate"] - canary["agree_rate"], 4),
"difference": round(main["agree_rate"] - canary["agree_rate"], 4),
"delta_pp": round((main["agree_rate"] - canary["agree_rate"]) * 100, 2)
# Signed difference in agree rate: main - canary.
delta = main["agree_rate"] - canary["agree_rate"]
output[topic] = {
"main": main,
"canary": canary,
"delta": round(delta, 4),
"delta_pp": round(delta * 100, 2)

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +45
# === paths ===
main_path = "dataset_gold.jsonl"
canary_path = "dataset_final.jsonl"

main_stats = compute_agree_rates(main_path)
canary_stats = compute_agree_rates(canary_path)

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script hardcodes input filenames (dataset_gold.jsonl / dataset_final.jsonl) and executes work at import-time. Consider moving the logic under a main() with if __name__ == "__main__" and adding CLI args for the input/output paths, so it’s reproducible from different working directories.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +70
main_path = "dataset_gold.jsonl"
canary_path = "dataset_final.jsonl"
output_path = "canary_analysis/evidence_token_overlap.json"

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main_path/canary_path are hardcoded and there are no CLI flags, which makes the script dependent on the current working directory and local filenames. Consider adding argparse options for input/output paths to make the analysis reproducible.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +17
def parse_args():
p = argparse.ArgumentParser()
p.add_argument("--input_file", default="/data/yue/Luis_data/dataset_final.jsonl")
p.add_argument("--output_file",
default="/data/yue/Luis_data/ChemQA/data/processed/dataset_final_flat.jsonl")

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default --input_file / --output_file point to absolute paths under /data/..., which makes the script non-portable when run from other machines or CI. Prefer defaults relative to the repo (or require these args explicitly).

Suggested change
def parse_args():
p = argparse.ArgumentParser()
p.add_argument("--input_file", default="/data/yue/Luis_data/dataset_final.jsonl")
p.add_argument("--output_file",
default="/data/yue/Luis_data/ChemQA/data/processed/dataset_final_flat.jsonl")
def _repo_root() -> Path:
return Path(__file__).resolve().parent.parent
def parse_args():
repo_root = _repo_root()
p = argparse.ArgumentParser()
p.add_argument("--input_file", default=str(repo_root / "data" / "dataset_final.jsonl"))
p.add_argument("--output_file",
default=str(repo_root / "data" / "processed" / "dataset_final_flat.jsonl"))

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +8
| model | variant | n | CIDEr | BLEU-1 | BLEU-4 | ROUGE-L |
|---|---|---|---|---|---|---|
| gemma3_12b | base | 1509 | 0.0078 | 0.1743 | 0.0195 | 0.1404 |
| gemma3_12b | finetuned | 1509 | 0.3017 | 0.3930 | 0.1218 | 0.2855 |
| llama3_1_8b | base | 1509 | 0.0072 | 0.1739 | 0.0244 | 0.1464 |
| llama3_1_8b | finetuned | 1509 | 0.2753 | 0.3806 | 0.1133 | 0.2771 |
| qwen2_5_14b | base | 1509 | 0.0021 | 0.1690 | 0.0220 | 0.1555 |
| qwen2_5_14b | finetuned | 1509 | 0.3143 | 0.3956 | 0.1247 | 0.2863 | No newline at end of file

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Markdown table is prefixed with || (double pipe) on every row, which prevents GitHub from rendering it as a table. Use a single leading | per row (and remove the extra leading space) so the table renders correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +12
| model | variant | n | CIDEr | BLEU-1 | BLEU-4 | ROUGE-L |
|---|---|---|---|---|---|---|
| gemma3_12b | base | 1509 | 0.0078 | 0.1743 | 0.0195 | 0.1404 |
| gemma3_12b | finetuned | 1509 | 0.3017 | 0.3930 | 0.1218 | 0.2855 |
| llama3_1_8b | base | 1509 | 0.0072 | 0.1739 | 0.0244 | 0.1464 |
| llama3_1_8b | finetuned | 1509 | 0.2753 | 0.3806 | 0.1133 | 0.2771 |
| qwen2_5_14b | base | 1509 | 0.0021 | 0.1690 | 0.0220 | 0.1555 |
| qwen2_5_14b | finetuned | 1509 | 0.3143 | 0.3956 | 0.1247 | 0.2863 |

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results table rows start with || (double pipe), so the Markdown table won't render correctly. Use a single leading | for each row.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +27
# 1) Flatten qa_pairs -> {prompt, response}
python3 $REPO/eval/scripts/prep_dataset_final.py \
--input_file "$RAW" --output_file "$TEST"

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script invokes $REPO/eval/scripts/prep_dataset_final.py, but this repository does not have an eval/scripts directory (the closest is finetune/eval/scripts/), and the new prep script is located at canary_analysis/prep_dataset_final.py. Update the path (or compute it relative to this script) so the command can actually run in this repo.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants