MattEval

An extremely simple, batteries-included benchmarking system for evaluating LLM performance. Define your model call as a Python function, import it in main.py, select a benchmark, and run the evaluation.

Included Benchmarks

HumanEval
GPQA
MATH (requires an OpenAI API key for correctness evaluation)
Dharma
AIME (2024 test subset - 10 questions that don't require vision capabilities)

Prompt Modifiers

Some benchmarks require outputs to follow a specific format (i.e. MATH wants a boxed answer). Given we are now in the era of systems, rather than pure models, and prompting strategies vary, you can adjust how we ask for the format by adjusting the prompt modifiers in main.py.

Extensibility

Adding new benchmarks is straightforward, especially for multiple choice formats. Simply upload your benchmark data and add it to eval.py.

That's it.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.pythonlibs		.pythonlibs
.upm		.upm
aime		aime
humaneval		humaneval
math		math
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
eval.py		eval.py
gpqa.csv		gpqa.csv
main.py		main.py
ore.py		ore.py
requirements.txt		requirements.txt
stock_model.py		stock_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MattEval

Included Benchmarks

Prompt Modifiers

Extensibility

About

Uh oh!

Releases

Packages

Languages

amixk/FEval

Folders and files

Latest commit

History

Repository files navigation

MattEval

Included Benchmarks

Prompt Modifiers

Extensibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages