This project explores an approach to language model optimization by focusing on guidance tokens in reasoning chains. Building upon the DeepSeekRL-Extended framework, we investigate how strategic token generation might improve model reasoning capabilities. Our implementation aims to be memory-efficient, requiring only 24GB GPU VRAM for training, making it accessible for research on consumer-grade hardware.
Language models often struggle with maintaining consistent reasoning chains in their responses. We hypothesize that by explicitly focusing on key transition points (guidance tokens) in the generation process, we might be able to improve the coherence and accuracy of model outputs. This project is currently in an experimental phase, aiming to test this hypothesis.
- 🎯 Focused Generation: Explores the impact of guidance tokens on reasoning
- 🚄 Resource Consideration: Designed to run on a single 24GB GPU
- 🎨 Three-Phase Approach: Structured generation process with specific focus on transition points
- 📊 Experimental Framework: Tools for analyzing the effectiveness of guided generation
- Prefix Phase: Generate initial text until punctuation
- Guidance Phase: Generate critical steering tokens
- Postfix Phase: Complete the remaining text
Input: "Bryan did 3*15= 45 push-ups in total without getting tired. He did 45-5= 40 push-ups in his third set."
Guidance: "So he" # Generated guidance token
Complete: "So he did 45+40= 85 push-ups in total."
Input: "Bryan started with 3 sets of 15 push-ups each, so 3 * 15 = 45 push-ups"
Guidance: "But at the end of" # Generated guidance token
Complete: "But at the end of the third set, he did 5 fewer, 45 - 5 = 40 push-ups"
# Clone the repository
git clone [email protected]:cnsdqd-dyb/Guide-GRPO.git
cd Guide-GRPO
# Install dependencies
pip install -r requirements.txt
# Run training
python train.py
Target: ~24GB VRAM usage (RTX 4090), but memory optimization is still in progress.
Note: Initial experimental results, pending further validation and testing. More comprehensive evaluation is ongoing.
Parameter | Description | Default |
---|---|---|
model_name |
Base model to fine-tune | Qwen/Qwen2.5-1.5B-Instruct |
num_chains |
Number of parallel chains | 16 |
temperature |
Sampling temperature | 1.0 |
learning_rate |
Initial learning rate | 5e-6 |
max_guide_tokens |
number of guidance tokens | 8 |
max_completion_tokens |
number of completion tokens | 786 |
@misc{guide-grpo,
title={Guide-GRPO: LLM Reasoning Enhancement Inspired by DeepSeek},
author={dongyubo},
year={2025},
publisher={GitHub}}
}
This project builds upon the work from DeepSeekRL-Extended. We are grateful for their contributions to the field.
MIT License - see LICENSE for details
Research and Development
- Establish baseline performance metrics
- Optimize training stability and memory usage
- Investigate reward decay relationships for guide positions
- Address implementation challenges
- Conduct systematic ablation studies
We welcome contributions and feedback! Please feel free to submit a Pull Request or open an issue.
Made with ❤️ by ReLER Lab