Skip to content

An open paper list covers LLMs reasoning benchmark for science problems.

Notifications You must be signed in to change notification settings

amair-lab/Awesome-LM-Science-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

Awesome-LM-Science-Bench

An open benchmark list covers LLM's reasoning benchmark for science problems, we focus on LLM evaluation datasets in natural sciences.

Hi👋, if you find this repo helpful, welcome to give a star ⭐️!

As many benchmarks are being released, we will update this repo frequently and welcome contributions from the 🏠community!

(last update: Feb 2025)


General Science & Multidisciplinary Benchmarks

SciEx [2024 June]

  • Description: A multilingual, multimodal benchmark using university computer science exam questions. Includes free-form questions with images and varying difficulty.
  • Purpose: Assesses LLMs' ability to handle scientific tasks in university exams, including algorithm writing, database querying, and mathematical proofs.
  • Relevance: Essential for evaluating LLMs in academic and research settings, with human expert grading provided for performance evaluation.
  • Performance: Even top LLMs face challenges with free-form exams in SciEx, indicating ongoing areas for development.
  • Source: SciEx Benchmark arXiv

SciBench [2023 July]

  • Description: A benchmark suite for evaluating college-level scientific problem-solving abilities, featuring problems from mathematics, chemistry, and physics.
  • Purpose: To rigorously test LLMs' reasoning on complex scientific problems at the university level.
  • Relevance: Vital for pushing the boundaries of LLMs in scientific research and discovery, highlighting areas for improvement in advanced reasoning.
  • Results: Current LLMs show limited performance, indicating substantial room for improvement in collegiate-level scientific problem-solving.
  • Source: arXiv:2307.10635

SciKnowEval [2024 June]

  • Description: A benchmark designed to evaluate LLMs across five levels of scientific knowledge, from memory to reasoning, in chemistry and physics. It includes 70,000 scientific problems.
  • Purpose: Establishes a framework for systematically assessing the depth of scientific knowledge in LLMs.
  • Relevance: Essential for the detailed evaluation of LLMs in scientific domains, aiming to standardize scientific knowledge benchmarking.
  • Source: arXiv:2406.09098

Advanced Reasoning Benchmark (ARB) [2023]

  • Description: Focuses on advanced reasoning problems across disciplines like physics and chemistry.
  • Purpose: Assesses LLMs' logical deduction and complex problem-solving capabilities in scientific contexts.
  • Relevance: Crucial for evaluating the inferential abilities of LLMs in scientific reasoning.
  • Source: OpenReview: ARB

Massive Multitask Language Understanding (MMLU) [2020 September]

  • Description: Measures general knowledge across 57 diverse subjects, spanning STEM, social sciences, and humanities.
  • Purpose: Evaluates LLMs' broad understanding and reasoning capabilities across a wide array of disciplines.
  • Relevance: Suitable for assessing AI systems requiring extensive world knowledge and versatile problem-solving skills.
  • Source: Measuring Massive Multitask Language Understanding
  • Resources:

General Language Understanding Evaluation (GLUE) [2018 April]

AI2 Reasoning Challenge (ARC) [2018 March]

SciQ [2017 July]

Biology Benchmarks

LAB-Bench: Language Agent Biology Benchmark [2024 July]

  • Description: A dataset of over 2,400 multiple-choice questions for biology research capabilities, covering literature recall, figure interpretation, database navigation, and sequence manipulation.
  • Purpose: Evaluates AI systems on practical biology research tasks, aiming to develop AI assistants for scientific research.
  • Relevance: Crucial for accelerating scientific discovery by enhancing LLMs in biology-related research tasks. Performance is compared against human biology experts.
  • Source: LAB-Bench: Measuring Capabilities of Language Models for Biology
  • Resources:

BioLLMBench [2023 December]

Chemistry Benchmarks

ChemQA [2024]

  • Description: A multimodal question-answering dataset focused on chemistry reasoning, featuring 5 QA tasks.
  • Purpose: Evaluates LLMs on chemistry-specific tasks like atom counting, molecular weight calculation, and retrosynthesis planning.
  • Relevance: Essential for AI applications in chemistry education, research, and complex chemical problem-solving.
  • Source: GitHub - materials-data-facility/matchem-llm
  • Resources:

ChemBench [2024]

  • Description: Features over 7000 questions covering a wide range of chemistry topics.
  • Purpose: Assesses LLMs' chemistry knowledge and reasoning skills across various chemical domains.
  • Relevance: Important for evaluating AI systems designed for chemistry education and research.
  • Source: GitHub - materials-data-facility/matchem-llm

ChemSafetyBench: LLM Safety in Chemistry [2024 November]

  • Description: A benchmark specifically designed to evaluate the safety aspects of LLMs in chemistry-related contexts.
  • Purpose: Assesses the safety and reliability of LLMs for chemistry applications, focusing on preventing harmful outputs.
  • Relevance: Crucial for ensuring the responsible and safe deployment of LLMs in chemistry and related fields.
  • Source: ChemSafetyBench: Benchmarking LLM Safety on Chemistry

ChemLLMBench [2024 - NeurIPS 2023 Datasets and Benchmarks Track]

  • Description: A comprehensive benchmark covering eight distinct chemistry tasks.
  • Purpose: Provides a thorough evaluation of LLMs' capabilities across different chemistry-related tasks.
  • Relevance: Useful for advancing AI applications in chemistry research, development, and education.
  • Source: https://github.com/ChemFoundationModels/ChemLLMBench

SMolInstruct: Instruction tuning dataset for chemistry [2024]

  • Description: An instruction-tuning dataset focused on small molecules, including over 3M samples across 14 tasks like name conversion, property prediction, and reaction prediction.
  • Purpose: Enhances LLMs' ability to follow chemistry-specific instructions and improves their performance in chemical tasks.
  • Relevance: Important for developing instruction-tuned LLMs for assisting in chemical research and development.
  • Source: https://openreview.net/forum?id=lY6XTF9tPv

ChemBench4k [2024]

  • Description: Includes 4100 high-quality single-choice questions across nine core chemistry tasks.
  • Purpose: Evaluates LLMs' chemistry knowledge and reasoning through a large set of curated questions.
  • Relevance: Crucial for assessing LLMs' competency in chemistry, particularly in education and knowledge evaluation.
  • Source: https://huggingface.co/datasets/AI4Chem/ChemBench4K

Fine-tuning Large Language Models for Chemical Text Mining [2024]

  • Description: A study and resources for fine-tuning LLMs on chemical text mining tasks like compound recognition and reaction labeling.
  • Purpose: Demonstrates the effectiveness of fine-tuning LLMs for complex chemical information extraction from text.
  • Relevance: Valuable for chemical research by improving LLMs' ability to extract knowledge from chemical literature.
  • Source: Chem. Sci., 2024

ChemLit-QA [2024]

  • Description: An expert-validated, open-source dataset with over 1,000 entries designed for chemistry Retrieval-Augmented Generation (RAG) and fine-tuning tasks.
  • Purpose: Benchmarks LLMs in chemistry-specific RAG tasks, evaluating their ability to generate context-aware, factual answers from chemistry literature.
  • Relevance: Aids in developing and evaluating LLMs for chemistry research, particularly in tasks requiring information retrieval and synthesis from scientific text.
  • Resources: ChemLit-QA GitHub

ScholarChemQA [2024 July]

  • Description: A large-scale Question Answering dataset constructed from chemical research papers, featuring multi-choice questions based on paper titles and abstracts.
  • Purpose: Evaluates LLMs' ability to answer research-level chemical questions, reflecting real-world challenges in chemical information processing.
  • Relevance: Benchmarks LLMs on understanding and reasoning over chemical research literature, highlighting areas for improvement in complex chemical QA.
  • Source: arXiv:2407.16931

Materials Science Benchmarks

Leveraging Large Language Models for Explaining Material Synthesis Mechanisms [2024 - NeurIPS AI4Mat]

  • Description: A benchmark dataset of 775 semi-manually created multiple-choice questions focused on gold nanoparticle (AuNPs) synthesis mechanisms.
  • Purpose: Evaluates LLMs' reasoning about material synthesis mechanisms and their understanding of physicochemical principles.
  • Relevance: Highlights the potential of LLMs in understanding scientific mechanisms and provides tools for exploring synthesis methods.
  • Source: https://github.com/amair-lab/Physicochemical-LMs

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction [2024 November]

  • Description: The largest benchmark for evaluating LLMs in predicting crystalline material properties.
  • Purpose: Assesses LLMs' capabilities in materials science, specifically in predicting material properties.
  • Relevance: Useful for AI-driven materials research and development, focusing on property prediction tasks.
  • Source: LLM4Mat-Bench: Benchmarking Large Language Models for Materials
  • Source (Alternative): arXiv

MaterialBENCH: Evaluating College-Level Materials Science Knowledge [2024 September]

  • Description: A college-level benchmark dataset for materials science, designed to assess knowledge equivalent to that of an undergraduate in the field.
  • Purpose: Evaluates LLMs' understanding of materials science concepts and problem-solving abilities at the college level.
  • Relevance: Useful for assessing LLMs' readiness for materials science education and research tasks.
  • Source: arXiv

MatSci-NLP [2023]

  • Description: A comprehensive benchmark for NLP models in materials science, covering tasks like property prediction and information extraction from literature.
  • Purpose: Evaluates NLP models, including LLMs, in materials science-specific tasks, encouraging generalization across different tasks.
  • Relevance: A cornerstone benchmark for assessing LLM capabilities in the field of materials science and NLP applications.
  • Source: MatSci-NLP

LLM4Mat-Bench: Benchmarking Large Language Models for Materials [2024 November]

  • Description: The largest benchmark focused on evaluating LLMs for predicting properties of crystalline materials.
  • Purpose: Specifically assesses LLMs' predictive capabilities in materials science for crystalline structures.
  • Relevance: Essential for advancing AI in materials research and development, particularly in property prediction.
  • Source: arXiv

Medical Benchmarks

Large Language Model Benchmarks in Medical Tasks [2024 October]

  • Description: A survey of benchmark datasets for medical LLM tasks, covering text, image, and multimodal data. Includes benchmarks for EHRs, doctor-patient dialogues, medical QA, and image captioning.
  • Purpose: Evaluates LLMs in various medical tasks, contributing to the advancement of medical AI.
  • Relevance: Vital for progressing multimodal medical AI and enhancing healthcare through AI applications.
  • Source: arXiv

Physics Benchmarks

Physics GRE: Testing an LLM’s performance on the Physics GRE [2023 December]

  • Description: Evaluates LLMs' performance on the Physics GRE exam, covering undergraduate physics topics.
  • Purpose: Assesses the capabilities and limitations of LLMs in physics education and their understanding of undergraduate-level physics.
  • Relevance: Important for understanding the potential and risks of using LLMs as educational tools for physics students.
  • Source: arXiv

If you find this markdown helpful, welcome to give a star ⭐️ to the original repository! And contributions to expand this benchmark list are highly welcome!

About

An open paper list covers LLMs reasoning benchmark for science problems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published