Yang Yue1*†, Zhiqi Chen1*, Rui Lu1, Andrew Zhao1, Zhaokai Wang2, Yang Yue1, Shiji Song1, Gao Huang1‡
1 Tsinghua University, LeapLab 2 Shanghai Jiao Tong University
* Equal Contribution † Project Lead ‡ Corresponding Author
- 🎉🎉 2025.6.21: We are thrilled to announce that our paper Limit-of-RLVR has garnered over 120 citations on Semantic Scholar just two months after its release on 2025.4.21! 🎉🎉
- 2025.6.20: Released evaluation code for DeepCoder.
- 2025.5.17: Updated the paper on arXiv with new experiments involving DAPO and DeepScaler. Added detailed analysis on entropy, KL divergence, and the impact of rollout numbers.
- 2025.5.24: Released evaluation code for Math and updated the README to reflect these changes.
Recent breakthroughs in reasoning-focused large language models (LLMs)—like OpenAI-o1, DeepSeek-R1, and Kimi-1.5—have heavily relied on Reinforcement Learning with Verifiable Rewards (RLVR). RLVR replaces human annotations with automated rewards (such as verified math answers or passed code tests) to enable scalable self-improvement. While RLVR enhances behaviors like self-reflection and iterative refinement, a critical question remains in the pursuit of continually self-evolving reasoning abilities:
Does RLVR actually expand LLMs' reasoning capabilities, or does it merely optimize existing ones?
To answer this, we evaluate models using the pass@k metric—where success requires only one correct solution among k attempts.
Video: pass@k curves of base models and their zero-RL-trained counterparts across multiple mathematical benchmarks. When k is small, RL-trained models outperform their base versions. However, as k increases to the tens or hundreds, base models consistently catch up with RL-trained models across all benchmarks and LLM families without exception. Eventually, base models surpass RL-trained models.
Our conclusion:
-
RL-trained models perform worse than base models in pass@k at large k.
-
RL boosts sampling efficiency but reduces the reasoning capacity boundary.
-
RLVR algorithms perform similarly and remain far from optimal.
-
RLVR and distillation are fundamentally different.
In our experiments, we ultilize two key mechanisms of vLLM to ensure response diversity across different runs and within a single run's multiple samplings:
When initializing the LLM engine:
LLM(seed=args.seed, ...)vLLM uses the provided seed to initialize its internal random number generator. This means that different runs with different seeds will produce completely different response sequences, and changing the seed (e.g., --seed 1 vs --seed 2) creates distinct sampling trajectories.
When performing multiple samplings in a single run (e.g., --n_sampling 32):
SamplingParams(n=32, T=0.6, ...) # Per-call samplingvLLM automatically manages randomness by progressing the random state sequentially for each sampling call, and maintaining independent sampling trajectories even with identical parameters, thus ensuring diversity across samplings without manual seed adjustment.
Enter math and read README.
Enter code and read README.
Our evaluation code is based on the following open-source projects:
We also extend our gratitude for the open-sourced checkpoints from:
- DAPO: BytedTsinghua-SIA/DAPO-Qwen-32B
- Oat-Zero: sail/Qwen2.5-Math-7B-Oat-Zero
If you use this work, please cite:
@article{yue2025limit-of-rlvr,
title={Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?},
author={Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao},
journal={arXiv preprint arXiv:2504.13837},
year={2025}
}