U-MATH and $μ$ -MATH evaluation code

This repository contains the official evaluation code for the U-MATH and $μ$ -MATH benchmarks. These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of Large Language Models (LLMs) on university-level problems.

Overview

U-MATH provides a set of 1,100 university-level mathematical problems, while µ-MATH complements it with a meta-evaluation framework focusing on solution judgment with 1084 LLM solutions.

U-MATH Evaluation Results

$μ$ -MATH Evaluation Results

Structure and Usage

This repository provides scripts for solving and evaluating the U-MATH and μ-MATH datasets.

File Structure

solve_u_math.py: Script to generate solutions for U-MATH problems using an OpenAI-compatible endpoint (e.g. gpt-4o or VLLM).
judge_u_math.py: Script to evaluate the correctness of U-MATH solutions.
judge_mu_math.py: Script to evaluate the quality of LLM judgments for μ-MATH solutions.
README.md: This file.
requirements.txt: List of dependencies required for running the scripts.

Download the repository and install the dependencies:

git clone https://github.com/toloka/u-math.git
cd u-math
pip install -r requirements.txt

Solve U-MATH Problems

To generate solutions for U-MATH problems, run the following command:

python solve_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --output_file predictions_u_math.json

Judge U-MATH Solutions

To evaluate the correctness of U-MATH solutions, run the following command:

python judge_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --predictions_file predictions_u_math.json --output_file judgments_u_math.json

Evaluate Judge on μ-MATH

To evaluate the quality of LLM judgments for μ-MATH solutions, run the following command:

python judge_mu_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --output_file judgments_mu_math.json

Licensing Information

The contents of the μ-MATH's machine-generated model_output column are subject to the underlying LLMs' licensing terms.
Contents of all the other dataset U-MATH and μ-MATH fields, as well as the code, are available under the MIT license.

Citation

If you use U-MATH or μ-MATH in your research, please cite the paper:

@inproceedings{umath2024,
title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs},
author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga},
year={2024}
}

Contact

For inquiries, please contact kchernyshev@toloka.ai

Name	Name	Last commit message	Last commit date
Latest commit utterstep some fixes to ensure everything is running properly (#2 ) Feb 7, 2025 d94d9f3 · Feb 7, 2025 History 3 Commits
.gitignore	.gitignore	some fixes to ensure everything is running properly (#2 )	Feb 7, 2025
LICENSE	LICENSE	chore: add license	Dec 4, 2024
README.md	README.md	some fixes to ensure everything is running properly (#2 )	Feb 7, 2025
judge_mu_math.py	judge_mu_math.py	some fixes to ensure everything is running properly (#2 )	Feb 7, 2025
judge_u_math.py	judge_u_math.py	some fixes to ensure everything is running properly (#2 )	Feb 7, 2025
prompts.py	prompts.py	feat: add evaluation files (#1 )	Dec 5, 2024
requirements.txt	requirements.txt	some fixes to ensure everything is running properly (#2 )	Feb 7, 2025
solve_u_math.py	solve_u_math.py	some fixes to ensure everything is running properly (#2 )	Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

U-MATH and $μ$ -MATH evaluation code

Overview

U-MATH Evaluation Results

$μ$ -MATH Evaluation Results

Structure and Usage

Solve U-MATH Problems

Judge U-MATH Solutions

Evaluate Judge on μ-MATH

Licensing Information

Citation

Contact

About

Releases

Packages

Contributors 3

Languages

License

Toloka/u-math

Folders and files

Latest commit

History

Repository files navigation

U-MATH and μ -MATH evaluation code

Overview

U-MATH Evaluation Results

μ -MATH Evaluation Results

Structure and Usage

Solve U-MATH Problems

Judge U-MATH Solutions

Evaluate Judge on μ-MATH

Licensing Information

Citation

Contact

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

U-MATH and $μ$ -MATH evaluation code

$μ$ -MATH Evaluation Results

Packages