GLM-Simple-Evals

GLM-Simple-Evals is an internal evaluation toolkit for large language models developed by Z.AI, based on OpenAI's simple-evals project. We have open-sourced this framework to help the developer community reproduce the performance of Z.AI's released GLM-4.5 and GLM-4.6 models across various evaluation tasks.

Supported Evaluation Tasks

Currently supports the following evaluation benchmarks, covering reasoning, coding, mathematics, and other domains:

AIME
GPQA
HLE
LiveCodeBench
MATH 500
SciCode
MMLU Pro

Model Integration

This project supports the following model inference methods:

Z.AI official zai-sdk
OpenAI-compatible API interface

Quick Start

We provide an eval_example.sh example script. You only need to configure the API key and other necessary parameters (such as model endpoints, verification model endpoints, etc.) and download the required data to begin evaluation.

For detailed evaluation guidelines, please refer to the following documentation:

GLM-4.6
GLM-4.5 & GLM-4.5-Air

Requirements

We recommend using Python 3.10. Install the required dependencies with:

pip install -r requirements.txt

Data Preparation

Download the glm-simple-evals-dataset and place it in the ./data directory.
Download SciCode test cases from the SciCode official repository. Download the test data from the Google Drive link and place it at ./data/scicode/test_data.h5. Note: Please comply with the license terms and usage conditions specified in the original repository when using this dataset.

Usage Guide

1. HLE

In the HLE evaluation task, gpt-4o is required for result verification. Execute the following command to run the evaluation:

python3 evaluate.py \
    --model_name "glm-4.6" \
    --backbone "zai" \
    --zai_api_key "xxxxxx" \
    --save_dir "/temp/eval_results" \
    --tasks hle \
    --proc_num 60 \
    --auto_extract_answer \
    --max_new_tokens 128000 \
    --checker_model_name "gpt-4o" \
    --checker_url "xxxx" \
    --checker_api_key "xxxx" \
    --stream \

2. LiveCodeBench

In the LiveCodeBench evaluation task, you need to specify the test date range as 2407_2501. Execute the following command to run the evaluation:

python3 evaluate.py \
    --model_name "glm-4.6" \
    --backbone "zai" \
    --zai_api_key "xxxxxx" \
    --save_dir "/temp/eval_results" \
    --tasks lcb \
    --lcb_date "2407_2501" \
    --proc_num 60 \
    --auto_extract_answer \
    --max_new_tokens 128000 \
    --stream \
    --top_p 0.95 \

3. Other Benchmarks

For other evaluation tasks (AIME, GPQA, MATH 500, SciCode, MMLU Pro), the verification model uses Meta-Llama-3.1-70B-Instruct. Execute the following command to run the evaluation:

python3 evaluate.py \
    --model_name "glm-4.6" \
    --backbone "zai" \
    --zai_api_key "xxxxxx" \
    --save_dir "/temp/eval_results" \
    --tasks aime2024 \  # gpqa math500 mmlu_pro scicode
    --proc_num 60 \
    --checker_model_name "Meta-Llama-3.1-70B-Instruct" \
    --checker_url "xxxxx" \
    --auto_extract_answer \
    --max_new_tokens 128000 \
    --stream \

Parameter Reference

The following are descriptions of commonly used parameters in the evaluation script:

--model_name: Target model name for evaluation
--backbone: Model inference method, "zai" (Z.AI official SDK) or "openai" (OpenAI-compatible APIs)
--zai_api_key: API key for Z.AI API platform
--openai_api_key: API key for other OpenAI-compatible providers
--openai_base_url: API URL for other OpenAI-compatible providers
--save_dir: Output directory for evaluation results
--tasks: Evaluation benchmarks to run
--proc_num: Number of concurrent processes
--auto_extract_answer: Enable automatic answer extraction
--max_new_tokens: Maximum number of generation tokens
--checker_model_name: Verification model name
--checker_url: API endpoint for verification model
--checker_api_key: API key for verification model
--stream: Enable streaming output
--lcb_date: Test date range for LiveCodeBench evaluation, supports '2407_2501' or 'v6'

Parameter Reference

Ensure you have obtained the necessary API keys and have properly configured them in the evaluation script.
Different evaluation tasks may require specific verification models.
Evaluation results will be saved in the specified --save_dir directory.
Please adjust the --proc_num parameter according to your hardware resources to achieve optimal evaluation efficiency.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
evals		evals
news		news
samplers		samplers
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
README_zh.md		README_zh.md
eval_example.sh		eval_example.sh
evaluate.py		evaluate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GLM-Simple-Evals

Supported Evaluation Tasks

Model Integration

Quick Start

Requirements

Data Preparation

Usage Guide

1. HLE

2. LiveCodeBench

3. Other Benchmarks

Parameter Reference

Parameter Reference

About

Uh oh!

Releases

Packages

Languages

zai-org/glm-simple-evals

Folders and files

Latest commit

History

Repository files navigation

GLM-Simple-Evals

Supported Evaluation Tasks

Model Integration

Quick Start

Requirements

Data Preparation

Usage Guide

1. HLE

2. LiveCodeBench

3. Other Benchmarks

Parameter Reference

Parameter Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages