Guide-GRPO: LLM Reasoning Enhancement Inspired by DeepSeek 🚀

TL;DR

This project explores an approach to language model optimization by focusing on guidance tokens in reasoning chains. Building upon the DeepSeekRL-Extended framework, we investigate how strategic token generation might improve model reasoning capabilities. Our implementation aims to be memory-efficient, requiring only 24GB GPU VRAM for training, making it accessible for research on consumer-grade hardware.

🔍 Motivation

Language models often struggle with maintaining consistent reasoning chains in their responses. We hypothesize that by explicitly focusing on key transition points (guidance tokens) in the generation process, we might be able to improve the coherence and accuracy of model outputs. This project is currently in an experimental phase, aiming to test this hypothesis.

🌟 Key Features

🎯 Focused Generation: Explores the impact of guidance tokens on reasoning
🚄 Resource Consideration: Designed to run on a single 24GB GPU
🎨 Three-Phase Approach: Structured generation process with specific focus on transition points
📊 Experimental Framework: Tools for analyzing the effectiveness of guided generation

🎓 How It Works

Three-Phase Generation Strategy

Prefix Phase: Generate initial text until punctuation
Guidance Phase: Generate critical steering tokens
Postfix Phase: Complete the remaining text

Example of Guidance Tokens

Case 1: "So" as a Conclusion Marker

Input: "Bryan did 3*15= 45 push-ups in total without getting tired. He did 45-5= 40 push-ups in his third set."

Guidance: "So he"  # Generated guidance token

Complete: "So he did 45+40= 85 push-ups in total."

Case 2: "But" as a Contrast Marker

Input: "Bryan started with 3 sets of 15 push-ups each, so 3 * 15 = 45 push-ups"

Guidance: "But at the end of"  # Generated guidance token

Complete: "But at the end of the third set, he did 5 fewer, 45 - 5 = 40 push-ups"

🛠️ Installation

# Clone the repository
git clone [email protected]:cnsdqd-dyb/Guide-GRPO.git
cd Guide-GRPO

# Install dependencies
pip install -r requirements.txt

🚀 Quick Start

# Run training
python train.py

📈 Memory Usage

Target: ~24GB VRAM usage (RTX 4090), but memory optimization is still in progress.

📊 Preliminary Results

Note: Initial experimental results, pending further validation and testing. More comprehensive evaluation is ongoing.

🔧 Key Parameters

Parameter	Description	Default
`model_name`	Base model to fine-tune	`Qwen/Qwen2.5-1.5B-Instruct`
`num_chains`	Number of parallel chains	16
`temperature`	Sampling temperature	1.0
`learning_rate`	Initial learning rate	5e-6
`max_guide_tokens`	number of guidance tokens	8
`max_completion_tokens`	number of completion tokens	786

📚 Citation

@misc{guide-grpo,
  title={Guide-GRPO: LLM Reasoning Enhancement Inspired by DeepSeek},
  author={dongyubo},
  year={2025},
  publisher={GitHub}}
}

Acknowledgments

This project builds upon the work from DeepSeekRL-Extended. We are grateful for their contributions to the field.

📄 License

MIT License - see LICENSE for details

TODO

Research and Development

Establish baseline performance metrics
Optimize training stability and memory usage
Investigate reward decay relationships for guide positions
Address implementation challenges
Conduct systematic ablation studies

Contributing

We welcome contributions and feedback! Please feel free to submit a Pull Request or open an issue.

Made with ❤️ by ReLER Lab

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluator.py		evaluator.py
llms.py		llms.py
plotter.py		plotter.py
requirements.txt		requirements.txt
rldatasets.py		rldatasets.py
run.sh		run.sh
train.py		train.py
training_score.png		training_score.png
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guide-GRPO: LLM Reasoning Enhancement Inspired by DeepSeek 🚀

TL;DR

🔍 Motivation

🌟 Key Features

🎓 How It Works

Three-Phase Generation Strategy

Example of Guidance Tokens

Case 1: "So" as a Conclusion Marker

Case 2: "But" as a Contrast Marker

🛠️ Installation

🚀 Quick Start

📈 Memory Usage

📊 Preliminary Results

🔧 Key Parameters

📚 Citation

Acknowledgments

📄 License

TODO

Contributing

About

Releases

Packages

Languages

License

cnsdqd-dyb/Guide-GRPO

Folders and files

Latest commit

History

Repository files navigation

Guide-GRPO: LLM Reasoning Enhancement Inspired by DeepSeek 🚀

TL;DR

🔍 Motivation

🌟 Key Features

🎓 How It Works

Three-Phase Generation Strategy

Example of Guidance Tokens

Case 1: "So" as a Conclusion Marker

Case 2: "But" as a Contrast Marker

🛠️ Installation

🚀 Quick Start

📈 Memory Usage

📊 Preliminary Results

🔧 Key Parameters

📚 Citation

Acknowledgments

📄 License

TODO

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages