Task 12: Related Work benchmark critique (C6)#5
Conversation
Add RELATED_WORK.md with a 9-dimension comparison table across six related datasets (ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, SciQA) plus a 2-page critique positioning Chem2TextQA as the evidence-grounded, two-model-cross-validated training corpus that none of the comparators provide. Dimensions cover size, Q&A format, label provenance, grounding, SMILES-as-input, cross-validation signal, contamination test, scaffold split, and availability. Draft is annotated with reviewer notes for the author pass.
There was a problem hiding this comment.
Pull request overview
Adds a draft “Related Work” write-up to explicitly position Chem2TextQA against adjacent chemistry/biomedical QA datasets in response to reviewer comment C6.
Changes:
- Introduces a 9-dimension comparison table covering ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, and SciQA.
- Adds short critique/positioning subsections for each related dataset plus a synthesis section.
- Includes source links and a short “open questions” list for an author pass.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json` | ||
| from the repo in the paper. | ||
| 4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's |
There was a problem hiding this comment.
Two issues in this reviewer-notes block: (1) the list numbering has two items labeled "4." (the final one should be "5."), and (2) it says to cite scaffold_split_report.json "from the repo", but that report is generated under data/qa_pipeline/... by scripts/compute_scaffold_splits.py and doesn’t appear to be tracked. Consider either committing a stable copy under outputs/ or updating the note to reference the generation command/path.
| Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json` | |
| from the repo in the paper. | |
| 4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's | |
| Murcko, 70/15/15, 5,916 scaffolds. `scaffold_split_report.json` is | |
| generated by `scripts/compute_scaffold_splits.py` under | |
| `data/qa_pipeline/...`; in the paper, either cite a stable copy committed | |
| under `outputs/` or reference the generation command/path instead. | |
| 5. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's |
| |---|---|---|---|---|---|---|---| | ||
| | **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) | | ||
| | **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG | | ||
| | **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG | |
There was a problem hiding this comment.
Model naming here ("Gemini-3-Flash", "Kimi-K2.5", "Gemma-4-31B") is inconsistent with the rest of the repo docs/code, which refer to these as "Gemini 3 Flash preview", "Kimi K2.5", and "Gemma 4 31B". Aligning the names avoids confusion when cross-referencing pipeline phases and contamination docs.
| | **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG | | |
| | **Label provenance** | Two-model cross-validation: Gemini 3 Flash preview generates, Kimi K2.5 re-answers blind, Gemma 4 31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG | |
| | **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) | | ||
| | **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG | | ||
| | **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG | |
There was a problem hiding this comment.
Table 1 introduces dataset size/label-provenance figures that conflict with the repo’s existing comparison table in DATASHEET.md (e.g., ChemBench ~2.7K here vs ~7K there; SciQA 2,565 here vs ~200K there; ChemLLMBench described here as inherited labels vs "human curated" in the datasheet). Please reconcile these numbers/descriptions (either update this table or adjust the datasheet in a follow-up) so reviewers don’t see contradictory positioning within the repo.
| | **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) | | |
| | **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG | | |
| | **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Mostly repurposed from existing datasets (MoleculeNet, USPTO, etc.); gold labels inherited | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG | | |
| | **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | Benchmark size is reported at different granularities across repo docs/paper summaries (subset-level QA counts vs larger benchmark totals); reviewer-facing takeaway: a few-thousand to ~7 K chemistry QA items, depending on counting convention | 8 tasks, with task sizes varying by inherited source dataset/split | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | Reported at multiple granularities: a small curated QA set (2,565 items in one release summary) built over a much larger scholarly/KG-backed resource; use the cited split/count in the corresponding section | | |
| | **Q&A format** | Open-ended natural-language QA grounded in one evidence sentence | Mixed: MCQ + numerical + open-ended free-text | Task-specific (classification, generation, SMILES-to-X, reaction prediction) | Free-form instruction/response (template + GPT-paraphrased) | Task-specific input→output strings (SMILES⇄name, reaction, properties) | 3-way yes/no/maybe + long-answer conclusion | SPARQL-backed factoid QA over a scholarly KG | | |
| | **Label provenance** | Two-model cross-validation: Gemini-3-Flash generates, Kimi-K2.5 re-answers blind, Gemma-4-31B judges → `agree` subset only | 35-author human-expert curation, largely drawn from chemistry textbooks and exams | Benchmark/task definitions are human curated, but many gold labels are inherited from upstream datasets (e.g., MoleculeNet, USPTO) rather than newly annotated end-to-end for this benchmark | Template + GPT-3.5 paraphrase; gold from source biochem databases; "stringent quality control" but no second-model check | Gold labels inherited from public sources (PubChem, MoleculeNet, USPTO, ChEBI-20); human review for templates | Original PMC abstract conclusions ("long answer") mapped to yes/no/maybe; 1 K hand-labeled | Knowledge-graph-derived; SPARQL queries verified against ORKG | |
There was a problem hiding this comment.
@oma need your code
@luistafoi make sure oma pushes code and if results are fine then comment "good to merge"
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Dimension | **Chem2TextQA** | ChemBench (Mirza et al., *Nat. Chem.* 2025) | ChemLLMBench (Guo et al., NeurIPS 2023) | Mol-Instructions (Fang et al., ICLR 2024) | SMolInstruct / LlaSMol (Yu et al., COLM 2024) | PubMedQA (Jin et al., EMNLP 2019) | SciQA (Auer et al., *Sci. Rep.* 2023) | | ||
| |---|---|---|---|---|---|---|---| | ||
| | **Size** | 15,547 compounds + 120 canaries, ~211 K raw Q&A, ~189 K agree-only | ~2,700 QA pairs | 8 tasks, thousands of items per task (MoleculeNet-derived) | ~706 K instructions (148 K mol + 505 K protein + 53 K text) | ~3.3 M samples, 1.6 M distinct molecules, 14 tasks | 1 K expert + 61 K unlabeled + 211 K auto (~273 K) | 2,565 QA pairs (100 manual + 2,465 template-generated) | |
There was a problem hiding this comment.
The table rows start with a double pipe (||), which creates an empty first column in GitHub-flavored Markdown and can misalign the entire table. Use a single leading pipe (|) consistently for the header, separator, and all rows so the table renders with the intended columns.
| 4. Scaffold-split details pulled from `USAGE_FINETUNING.md`: MoleculeNet | ||
| Murcko, 70/15/15, 5,916 scaffolds. Cite `scaffold_split_report.json` | ||
| from the repo in the paper. | ||
| 4. If space is tight, the SciQA paragraph can be collapsed into PubMedQA's | ||
| (both are "prior-art on literature-grounded QA, neither uses SMILES"). |
There was a problem hiding this comment.
The numbered list repeats item 4.; renumber the final item to keep the list unambiguous (e.g., change the second 4. to 5.).
Adds
task-12-related-work/RELATED_WORK.mdwith a 9-dimension comparison table across six related datasets (ChemBench, ChemLLMBench, Mol-Instructions, SMolInstruct, PubMedQA, SciQA) and a 2-page critique positioning Chem2TextQA.Addresses reviewer comment C6 (track demands explicit positioning).
Compared dimensions: size, Q&A format, label provenance, grounding strategy, SMILES-as-first-class input, cross-validation signal, held-out contamination test, scaffold split, availability.
Positioning summary. No prior dataset combines: SMILES-first input + open-ended QA + literature-evidence grounding + two-model cross-validation (Gemini → Kimi → Gemma judge) + held-out canary compounds for contamination testing. Chem2TextQA sits between ChemBench (small, expert, benchmark) and SMolInstruct (large, closed-form, structural) — training-scale for functional reasoning with source attribution.
Do not merge — per vinash's instructions, he will handle the merge. Reviewer: @luistafoi.