This repository contains code to finetune a language model (Qwen/Qwen2.5-3B-Instruct) to solve 4x4 Mini Sudoku puzzles with GRPO (Generative Reinforcement Policy Optimization). The project demonstrates how to train a model to:
- Follow a specific XML output format
- Apply logical reasoning to solve Sudoku puzzles
- Output valid 4x4 Sudoku solutions
Clone the repository and install dependencies:
git clone https://github.com/Asad-Shahab/sudoku_finetuning.git
cd sudoku_finetuning
pip install -r requirements.txt
The project uses a custom dataset of 4x4 Mini Sudoku puzzles. You can generate the dataset with:
python dataset.py
This will:
- Generate 1000 unique 4x4 Mini Sudoku puzzles
- Split them into training (90%) and validation (10%) sets
- Save the formatted data in the
sudoku_dataset
directory
To verify the dataset's integrity:
python verify_dataset.py
Each example in the dataset follows this structure:
{
"prompt": [
{
"role": "system",
"content": "\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n"
},
{
"role": "user",
"content": "Solve this 4x4 Mini Sudoku puzzle:\n3 1 4 2\n_ 4 1 3\n1 2 _ _\n4 3 2 1"
}
],
"answer": "<reasoning> </reasoning>\n<answer>\n3 1 4 2\n2 4 1 3\n1 2 3 4\n4 3 2 1\n</answer>"
}
The finetuning process uses GRPO with several reward functions to train the model:
correctness_reward_func
: Rewards correct Sudoku solutionsint_reward_func
: Checks if all numbers in the solution are valid (1-4)strict_format_reward_func
andsoft_format_reward_func
: Verify XML formattingxmlcount_reward_func
: Rewards proper XML tag placement
To run the finetuning:
python finetune.py
For long training sessions, you can use tmux:
tmux new -s sudoku_training
python finetune.py
# Detach with Ctrl+b, then d
# Reconnect later with: tmux attach -t sudoku_training
To run inference with the finetuned model:
python inference.py
This will load the trained model and allow you to enter 4x4 Sudoku puzzles for the model to solve interactively.
To test the base model performance (without finetuning):
python pretrain_test.py
- This project uses the unsloth library for efficient finetuning
- TRL (Transformer Reinforcement Learning) library for GRPO implementation
- Weights & Biases for experiment tracking