Task 12: Related Work benchmark critique (C6) by spfded · Pull Request #5 · vinash85/Chem2TextQA

spfded · 2026-04-24T22:05:39Z

Adds task-12-related-work/RELATED_WORK.md with a 9-dimension comparison table across six related datasets (ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, SciQA) and a 2-page critique positioning Chem2TextQA.

Addresses reviewer comment C6 (track demands explicit positioning).

Compared dimensions: size, Q&A format, label provenance, grounding strategy, SMILES-as-first-class input, cross-validation signal, held-out contamination test, scaffold split, availability.

Positioning summary. No prior dataset combines: SMILES-first input + open-ended QA + literature-evidence grounding + two-model cross-validation (Gemini → Kimi → Gemma judge) + held-out canary compounds for contamination testing. Chem2TextQA sits between ChemBench (small, expert, benchmark) and SMolInstruct (large, closed-form, structural) — training-scale for functional reasoning with source attribution.

Do not merge — per vinash's instructions, he will handle the merge. Reviewer: @luistafoi.

Add RELATED_WORK.md with a 9-dimension comparison table across six related datasets (ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, SciQA) plus a 2-page critique positioning Chem2TextQA as the evidence-grounded, two-model-cross-validated training corpus that none of the comparators provide. Dimensions cover size, Q&A format, label provenance, grounding, SMILES-as-input, cross-validation signal, contamination test, scaffold split, and availability. Draft is annotated with reviewer notes for the author pass.

Copilot

Pull request overview

Adds a draft “Related Work” write-up to explicitly position Chem2TextQA against adjacent chemistry/biomedical QA datasets in response to reviewer comment C6.

Changes:

Introduces a 9-dimension comparison table covering ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, and SciQA.
Adds short critique/positioning subsections for each related dataset plus a synthesis section.
Includes source links and a short “open questions” list for an author pass.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T22:24:32Z

+   Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json`
+   from the repo in the paper.
+4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's


Two issues in this reviewer-notes block: (1) the list numbering has two items labeled "4." (the final one should be "5."), and (2) it says to cite scaffold_split_report.json "from the repo", but that report is generated under data/qa_pipeline/... by scripts/compute_scaffold_splits.py and doesn’t appear to be tracked. Consider either committing a stable copy under outputs/ or updating the note to reference the generation command/path.

Suggested change

Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json`

from the repo in the paper.

4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's

Murcko, 70/15/15, 5,916 scaffolds. `scaffold_split_report.json` is

generated by `scripts/compute_scaffold_splits.py` under

`data/qa_pipeline/...`; in the paper, either cite a stable copy committed

under `outputs/` or reference the generation command/path instead.

5. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's

Copilot · 2026-04-24T22:24:32Z

+|---|---|---|---|---|---|---|---|
+| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |
+| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |
+| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |


Model naming here ("Gemini-3-Flash", "Kimi-K2.5", "Gemma-4-31B") is inconsistent with the rest of the repo docs/code, which refer to these as "Gemini 3 Flash preview", "Kimi K2.5", and "Gemma 4 31B". Aligning the names avoids confusion when cross-referencing pipeline phases and contamination docs.

Suggested change

| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

| **Label provenance** | Two-model cross-validation: Gemini 3 Flash preview generates, Kimi K2.5 re-answers blind, Gemma 4 31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

Copilot · 2026-04-24T22:24:32Z

+| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |
+| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |
+| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |


Table 1 introduces dataset size/label-provenance figures that conflict with the repo’s existing comparison table in DATASHEET.md (e.g., ChemBench ~2.7K here vs ~7K there; SciQA 2,565 here vs ~200K there; ChemLLMBench described here as inherited labels vs "human curated" in the datasheet). Please reconcile these numbers/descriptions (either update this table or adjust the datasheet in a follow-up) so reviewers don’t see contradictory positioning within the repo.

Suggested change

| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |

| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |

| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | Benchmark size is reported at different granularities across repo docs/paper summaries (subset-level QA counts vs larger benchmark totals); reviewer-facing takeaway: a few-thousand to ~7 K chemistry QA items, depending on counting convention | 8 tasks, with task sizes varying by inherited source dataset/split | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | Reported at multiple granularities: a small curated QA set (2,565 items in one release summary) built over a much larger scholarly/KG-backed resource; use the cited split/count in the corresponding section |

| **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG |

| **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Benchmark/task definitions are human curated, but many gold labels are inherited from upstream datasets (e.g., MoleculeNet, USPTO) rather than newly annotated end-to-end for this benchmark | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG |

kushalviit · 2026-04-25T00:43:30Z

@oma need your code
@luistafoi make sure oma pushes code and if results are fine then comment "good to merge"

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T02:32:23Z

+| Dimension | **Chem2TextQA** | ChemBench (Mirza et al., *Nat. Chem.* 2025) | ChemLLMBench (Guo et al., NeurIPS 2023) | Mol-Instructions (Fang et al., ICLR 2024) | SMolInstruct / LlaSMol (Yu et al., COLM 2024) | PubMedQA (Jin et al., EMNLP 2019) | SciQA (Auer et al., *Sci. Rep.* 2023) |
+|---|---|---|---|---|---|---|---|
+| **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) |


The table rows start with a double pipe (||), which creates an empty first column in GitHub-flavored Markdown and can misalign the entire table. Use a single leading pipe (|) consistently for the header, separator, and all rows so the table renders with the intended columns.

Copilot · 2026-04-25T02:32:23Z

+4. Scaffold-split details pulled from `USAGE_FINETUNING.md`: MoleculeNet
+   Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json`
+   from the repo in the paper.
+4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's
+   (both are "prior-art on literature-grounded QA, neither uses SMILES").


The numbered list repeats item 4.; renumber the final item to keep the list unambiguous (e.g., change the second 4. to 5.).

spfded requested a review from luistafoi April 24, 2026 22:05

Macaulay001 requested a review from Copilot April 24, 2026 22:21

Copilot started reviewing on behalf of Macaulay001 April 24, 2026 22:21 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

kushalviit reviewed Apr 25, 2026

View reviewed changes

Macaulay001 requested a review from Copilot April 25, 2026 02:21

Copilot started reviewing on behalf of Macaulay001 April 25, 2026 02:21 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task 12: Related Work benchmark critique (C6)#5

Task 12: Related Work benchmark critique (C6)#5
spfded wants to merge 1 commit into
mainfrom
task-12-related-work

spfded commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

kushalviit Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

spfded commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

kushalviit Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants