Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue^1*†, Zhiqi Chen^1*, Rui Lu¹, Andrew Zhao¹, Zhaokai Wang², Yang Yue¹, Shiji Song¹, Gao Huang^1‡

¹ Tsinghua University, LeapLab ² Shanghai Jiao Tong University

^* Equal Contribution ^† Project Lead ^‡ Corresponding Author

News

🎉🎉 2025.6.21: We are thrilled to announce that our paper Limit-of-RLVR has garnered over 120 citations on Semantic Scholar just two months after its release on 2025.4.21! 🎉🎉
2025.6.20: Released evaluation code for DeepCoder.
2025.5.17: Updated the paper on arXiv with new experiments involving DAPO and DeepScaler. Added detailed analysis on entropy, KL divergence, and the impact of rollout numbers.
2025.5.24: Released evaluation code for Math and updated the README to reflect these changes.

Overview

Recent breakthroughs in reasoning-focused large language models (LLMs)—like OpenAI-o1, DeepSeek-R1, and Kimi-1.5—have heavily relied on Reinforcement Learning with Verifiable Rewards (RLVR). RLVR replaces human annotations with automated rewards (such as verified math answers or passed code tests) to enable scalable self-improvement. While RLVR enhances behaviors like self-reflection and iterative refinement, a critical question remains in the pursuit of continually self-evolving reasoning abilities:

Does RLVR actually expand LLMs' reasoning capabilities, or does it merely optimize existing ones?

To answer this, we evaluate models using the pass@k metric—where success requires only one correct solution among k attempts.

$Video Overview$

Video: pass@k curves of base models and their zero-RL-trained counterparts across multiple mathematical benchmarks. When k is small, RL-trained models outperform their base versions. However, as k increases to the tens or hundreds, base models consistently catch up with RL-trained models across all benchmarks and LLM families without exception. Eventually, base models surpass RL-trained models.

Our conclusion:

RL-trained models perform worse than base models in pass@k at large k.
RL boosts sampling efficiency but reduces the reasoning capacity boundary.
RLVR algorithms perform similarly and remain far from optimal.
RLVR and distillation are fundamentally different.

Mutiple Sampling in vLLM

In our experiments, we ultilize two key mechanisms of vLLM to ensure response diversity across different runs and within a single run's multiple samplings:

1. Cross-Run Diversity via Seed Control

When initializing the LLM engine:

LLM(seed=args.seed, ...)

vLLM uses the provided seed to initialize its internal random number generator. This means that different runs with different seeds will produce completely different response sequences, and changing the seed (e.g., --seed 1 vs --seed 2) creates distinct sampling trajectories.

2. Intra-Run Diversity

When performing multiple samplings in a single run (e.g., --n_sampling 32):

SamplingParams(n=32, T=0.6, ...)  # Per-call sampling

vLLM automatically manages randomness by progressing the random state sequentially for each sampling call, and maintaining independent sampling trajectories even with identical parameters, thus ensuring diversity across samplings without manual seed adjustment.

Evaluation

Math

Enter math and read README.

Code

Enter code and read README.

Acknowledgements

Our evaluation code is based on the following open-source projects:

We also extend our gratitude for the open-sourced checkpoints from:

DAPO: BytedTsinghua-SIA/DAPO-Qwen-32B
Oat-Zero: sail/Qwen2.5-Math-7B-Oat-Zero

Citation

If you use this work, please cite:

@article{yue2025limit-of-rlvr,
  title={Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?},
  author={Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao},
  journal={arXiv preprint arXiv:2504.13837},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
code/DeepCoder		code/DeepCoder
math		math
static		static
.gitignore		.gitignore
Limit_of_RLVR.pdf		Limit_of_RLVR.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

News

Overview

Mutiple Sampling in vLLM

1. Cross-Run Diversity via Seed Control

2. Intra-Run Diversity

Evaluation

Math

Code

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

lingzhq/RFT-Math-Eval

Folders and files

Latest commit

History

Repository files navigation

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

News

Overview

Mutiple Sampling in vLLM

1. Cross-Run Diversity via Seed Control

2. Intra-Run Diversity

Evaluation

Math

Code

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages