Skip to content

Task 12: Related Work benchmark critique (C6)#5

Open
spfded wants to merge 1 commit into
mainfrom
task-12-related-work
Open

Task 12: Related Work benchmark critique (C6)#5
spfded wants to merge 1 commit into
mainfrom
task-12-related-work

Conversation

@spfded

@spfded spfded commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Adds task-12-related-work/RELATED_WORK.md with a 9-dimension comparison table across six related datasets (ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, SciQA) and a 2-page critique positioning Chem2TextQA.

Addresses reviewer comment C6 (track demands explicit positioning).

Compared dimensions: size, Q&A format, label provenance, grounding strategy, SMILES-as-first-class input, cross-validation signal, held-out contamination test, scaffold split, availability.

Positioning summary. No prior dataset combines: SMILES-first input + open-ended QA + literature-evidence grounding + two-model cross-validation (Gemini → Kimi → Gemma judge) + held-out canary compounds for contamination testing. Chem2TextQA sits between ChemBench (small, expert, benchmark) and SMolInstruct (large, closed-form, structural) — training-scale for functional reasoning with source attribution.

Do not merge — per vinash's instructions, he will handle the merge. Reviewer: @luistafoi.

Add RELATED_WORK.md with a 9-dimension comparison table across six related
datasets (ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA,
SciQA) plus a 2-page critique positioning Chem2TextQA as the evidence-grounded,
two-model-cross-validated training corpus that none of the comparators
provide. Dimensions cover size, Q&A format, label provenance, grounding,
SMILES-as-input, cross-validation signal, contamination test, scaffold split,
and availability. Draft is annotated with reviewer notes for the author pass.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a draft “Related Work” write-up to explicitly position Chem2TextQA against adjacent chemistry/biomedical QA datasets in response to reviewer comment C6.

Changes:

  • Introduces a 9-dimension comparison table covering ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, and SciQA.
  • Adds short critique/positioning subsections for each related dataset plus a synthesis section.
  • Includes source links and a short “open questions” list for an author pass.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +169 to +171
Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json`
from the repo in the paper.
4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues in this reviewer-notes block: (1) the list numbering has two items labeled "4." (the final one should be "5."), and (2) it says to cite scaffold_split_report.json "from the repo", but that report is generated under data/qa_pipeline/... by scripts/compute_scaffold_splits.py and doesn’t appear to be tracked. Consider either committing a stable copy under outputs/ or updating the note to reference the generation command/path.

Suggested change
Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json`
from the repo in the paper.
4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's
Murcko, 70/15/15, 5,916 scaffolds. `scaffold_split_report.json` is
generated by `scripts/compute_scaffold_splits.py` under
`data/qa_pipeline/...`; in the paper, either cite a stable copy committed
under `outputs/` or reference the generation command/path instead.
5. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's

Copilot uses AI. Check for mistakes.
|---|---|---|---|---|---|---|---|
| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |
| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |
| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model naming here ("Gemini-3-Flash", "Kimi-K2.5", "Gemma-4-31B") is inconsistent with the rest of the repo docs/code, which refer to these as "Gemini 3 Flash preview", "Kimi K2.5", and "Gemma 4 31B". Aligning the names avoids confusion when cross-referencing pipeline phases and contamination docs.

Suggested change
| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |
| **Label provenance** | Two-model cross-validation: Gemini 3 Flash preview generates, Kimi K2.5 re-answers blind, Gemma 4 31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +26
| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |
| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |
| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table 1 introduces dataset size/label-provenance figures that conflict with the repo’s existing comparison table in DATASHEET.md (e.g., ChemBench ~2.7K here vs ~7K there; SciQA 2,565 here vs ~200K there; ChemLLMBench described here as inherited labels vs "human curated" in the datasheet). Please reconcile these numbers/descriptions (either update this table or adjust the datasheet in a follow-up) so reviewers don’t see contradictory positioning within the repo.

Suggested change
| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |
| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |
| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |
| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | Benchmark size is reported at different granularities across repo docs/paper summaries (subset-level QA counts vs larger benchmark totals); reviewer-facing takeaway: a few-thousand to ~7 K chemistry QA items, depending on counting convention | 8 tasks, with task sizes varying by inherited source dataset/split | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | Reported at multiple granularities: a small curated QA set (2,565 items in one release summary) built over a much larger scholarly/KG-backed resource; use the cited split/count in the corresponding section |
| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |
| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Benchmark/task definitions are human curated, but many gold labels are inherited from upstream datasets (e.g., MoleculeNet, USPTO) rather than newly annotated end-to-end for this benchmark | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oma need your code
@luistafoi make sure oma pushes code and if results are fine then comment "good to merge"

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +24
| Dimension | **Chem2TextQA** | ChemBench (Mirza et al., *Nat. Chem.* 2025) | ChemLLMBench (Guo et al., NeurIPS 2023) | Mol-Instructions (Fang et al., ICLR 2024) | SMolInstruct / LlaSMol (Yu et al., COLM 2024) | PubMedQA (Jin et al., EMNLP 2019) | SciQA (Auer et al., *Sci. Rep.* 2023) |
|---|---|---|---|---|---|---|---|
| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table rows start with a double pipe (||), which creates an empty first column in GitHub-flavored Markdown and can misalign the entire table. Use a single leading pipe (|) consistently for the header, separator, and all rows so the table renders with the intended columns.

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +172
4. Scaffold-split details pulled from `USAGE_FINETUNING.md`: MoleculeNet
Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json`
from the repo in the paper.
4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's
(both are "prior-art on literature-grounded QA, neither uses SMILES").

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbered list repeats item 4.; renumber the final item to keep the list unambiguous (e.g., change the second 4. to 5.).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants