Skip to content

lingzhq/RFT-Math-Eval

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue1*†, Zhiqi Chen1*, Rui Lu1, Andrew Zhao1, Zhaokai Wang2, Yang Yue1, Shiji Song1, Gao Huang1‡

1 Tsinghua University, LeapLab 2 Shanghai Jiao Tong University

* Equal Contribution  Project Lead  Corresponding Author

Paper PDF Project Page

News

  • 🎉🎉 2025.6.21: We are thrilled to announce that our paper Limit-of-RLVR has garnered over 120 citations on Semantic Scholar just two months after its release on 2025.4.21! 🎉🎉
  • 2025.6.20: Released evaluation code for DeepCoder.
  • 2025.5.17: Updated the paper on arXiv with new experiments involving DAPO and DeepScaler. Added detailed analysis on entropy, KL divergence, and the impact of rollout numbers.
  • 2025.5.24: Released evaluation code for Math and updated the README to reflect these changes.

Overview

Recent breakthroughs in reasoning-focused large language models (LLMs)—like OpenAI-o1, DeepSeek-R1, and Kimi-1.5—have heavily relied on Reinforcement Learning with Verifiable Rewards (RLVR). RLVR replaces human annotations with automated rewards (such as verified math answers or passed code tests) to enable scalable self-improvement. While RLVR enhances behaviors like self-reflection and iterative refinement, a critical question remains in the pursuit of continually self-evolving reasoning abilities:

Does RLVR actually expand LLMs' reasoning capabilities, or does it merely optimize existing ones?

To answer this, we evaluate models using the pass@k metric—where success requires only one correct solution among k attempts.

Video Overview

Video: pass@k curves of base models and their zero-RL-trained counterparts across multiple mathematical benchmarks. When k is small, RL-trained models outperform their base versions. However, as k increases to the tens or hundreds, base models consistently catch up with RL-trained models across all benchmarks and LLM families without exception. Eventually, base models surpass RL-trained models.

Our conclusion:

  1. RL-trained models perform worse than base models in pass@k at large k.

  2. RL boosts sampling efficiency but reduces the reasoning capacity boundary.

  3. RLVR algorithms perform similarly and remain far from optimal.

  4. RLVR and distillation are fundamentally different.

Mutiple Sampling in vLLM

In our experiments, we ultilize two key mechanisms of vLLM to ensure response diversity across different runs and within a single run's multiple samplings:

1. Cross-Run Diversity via Seed Control

When initializing the LLM engine:

LLM(seed=args.seed, ...)

vLLM uses the provided seed to initialize its internal random number generator. This means that different runs with different seeds will produce completely different response sequences, and changing the seed (e.g., --seed 1 vs --seed 2) creates distinct sampling trajectories.

2. Intra-Run Diversity

When performing multiple samplings in a single run (e.g., --n_sampling 32):

SamplingParams(n=32, T=0.6, ...)  # Per-call sampling

vLLM automatically manages randomness by progressing the random state sequentially for each sampling call, and maintaining independent sampling trajectories even with identical parameters, thus ensuring diversity across samplings without manual seed adjustment.

Evaluation

Math

Enter math and read README.

Code

Enter code and read README.

Acknowledgements

Our evaluation code is based on the following open-source projects:

We also extend our gratitude for the open-sourced checkpoints from:

Citation

If you use this work, please cite:

@article{yue2025limit-of-rlvr,
  title={Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?},
  author={Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao},
  journal={arXiv preprint arXiv:2504.13837},
  year={2025}
}

About

repo for paper https://arxiv.org/abs/2504.13837

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 80.4%
  • Jupyter Notebook 14.6%
  • Shell 5.0%