💡 Key Findings | 📈 Scaling Results | 🔥 Models(infer & SFT) | 📝 Open Source List
- [2025-11-01] 🎉 Released the parathinker-math-6K dataset and training scripts.
- [2025-10-02] 🚀 Updated the inference engine and released the improved ParaThinker-1.5B model.
- Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes.
- However, this approach hits a bottleneck where further computation offers only marginal gains, due to "Tunnel Vision" where imperfect initial steps lock the model into suboptimal paths.
- We introduce ParaThinker, an end-to-end framework that trains LLMs to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer.
- Scaling compute in parallel (width) proves more effective and efficient than sequentially (depth).
Here are the core insights from our analysis and evaluations:
📈 Superior Accuracy Gains: On challenging reasoning benchmarks (AIME 2024/2025, AMC 2023, MATH-500), ParaThinker achieves 12.3% improvement for 1.5B models and 7.5% for 7B models on average with 8 parallel paths.
✅ Overcomes Tunnel Vision: The bottleneck in sequential reasoning arises from early token choices committing to flawed paths; parallelism enables diverse exploration to break through.
🧠 Native Parallelism in a Single Pass: Using specialized control tokens (), thought-specific positional embeddings, and two-phase attention, ParaThinker generates and integrates paths end-to-end without external verifiers.
⚡ Minimal Latency Overhead: Adds only 7.1% latency on average, leveraging batching for hardware efficiency; 16 paths take <2x time of a single path.
🧱 Scalable SFT Training: Supervised fine-tuning with paths from a teacher model enables generalization to more paths at inference.
🔁 Smaller Models Outperform Larger Ones: ParaThinker-equipped small LLMs surpass larger sequential counterparts, offering a new scaling dimension.
We would release the full code for training and inference, along with evaluation scripts. Checkpoints for ParaThinker-1.5B are available on 🤗 HuggingFace.
Evaluated on math reasoning tasks, scaling parallel paths P from 1 to 8.
ParaThinker models based on DeepSeek-R1-Distill-Qwen versions:
| Model | Description | Download |
|---|---|---|
| ParaThinker-1.5B | Fine-tuned for parallel reasoning | 🤗 Leslie04/ParaThinker-1.5B |
| ParaThinker-7B | Higher-capacity for complex tasks | 🤗 Leslie04/ParaThinker-7B (coming soon) |
For efficient parallel inference using our customized vLLM engine, refer to the Inference Submodule README. This submodule implements the native parallel thinking inference engine, leveraging PagedAttention for KV cache reuse. Also see the quick start example in inference/examples/parathinker/example.py for usage.
We use custom LLaMA-Factory to train native parallel thinking model.
Build Conda Environment: The following is a simplest script to build a conda environment for ParaThinker training:
set -e
eval "$(conda shell.bash hook)"
if ! conda env list | grep -q "parathinker-sft"; then
conda create -n parathinker-sft python=3.11
fi
conda activate parathinker-sft
cd ./train/LLaMA-Factory
pip install -e ".[torch,metrics]"
cd ../transformers
pip install -e .Dataset Installation and SFT Running: Install parathinker-math-6K dataset and then use the example training script to quickly start a SFT on deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
- Inference Engine based on vLLM
- ParaThinker-1.5B Model
- ParaThinker-7B Model
- SFT dataset and training script based on llama-factory
- Evaluation script


