Sudoku Finetuning

This repository contains code to finetune a language model (Qwen/Qwen2.5-3B-Instruct) to solve 4x4 Mini Sudoku puzzles with GRPO (Generative Reinforcement Policy Optimization). The project demonstrates how to train a model to:

Follow a specific XML output format
Apply logical reasoning to solve Sudoku puzzles
Output valid 4x4 Sudoku solutions

Setup and Installation

Clone the repository and install dependencies:

git clone https://github.com/Asad-Shahab/sudoku_finetuning.git
cd sudoku_finetuning
pip install -r requirements.txt

Dataset Generation

The project uses a custom dataset of 4x4 Mini Sudoku puzzles. You can generate the dataset with:

python dataset.py

This will:

Generate 1000 unique 4x4 Mini Sudoku puzzles
Split them into training (90%) and validation (10%) sets
Save the formatted data in the sudoku_dataset directory

To verify the dataset's integrity:

python verify_dataset.py

Dataset Format

Each example in the dataset follows this structure:

{
  "prompt": [
    {
      "role": "system",
      "content": "\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n"
    },
    {
      "role": "user",
      "content": "Solve this 4x4 Mini Sudoku puzzle:\n3 1 4 2\n_ 4 1 3\n1 2 _ _\n4 3 2 1"
    }
  ],
  "answer": "<reasoning>  </reasoning>\n<answer>\n3 1 4 2\n2 4 1 3\n1 2 3 4\n4 3 2 1\n</answer>"
}

Finetuning

The finetuning process uses GRPO with several reward functions to train the model:

correctness_reward_func: Rewards correct Sudoku solutions
int_reward_func: Checks if all numbers in the solution are valid (1-4)
strict_format_reward_func and soft_format_reward_func: Verify XML formatting
xmlcount_reward_func: Rewards proper XML tag placement

To run the finetuning:

python finetune.py

For long training sessions, you can use tmux:

tmux new -s sudoku_training
python finetune.py
# Detach with Ctrl+b, then d
# Reconnect later with: tmux attach -t sudoku_training

Inference

To run inference with the finetuned model:

python inference.py

This will load the trained model and allow you to enter 4x4 Sudoku puzzles for the model to solve interactively.

To test the base model performance (without finetuning):

python pretrain_test.py

Acknowledgments

This project uses the unsloth library for efficient finetuning
TRL (Transformer Reinforcement Learning) library for GRPO implementation
Weights & Biases for experiment tracking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sudoku Finetuning

Setup and Installation

Dataset Generation

Dataset Format

Finetuning

Inference

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
sudoku_dataset		sudoku_dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
finetune.py		finetune.py
inference.py		inference.py
pretrain_test.py		pretrain_test.py
requirements.txt		requirements.txt
todo.txt		todo.txt
verify_dataset.py		verify_dataset.py

License

Asad-Shahab/sudokuLLM

Folders and files

Latest commit

History

Repository files navigation

Sudoku Finetuning

Setup and Installation

Dataset Generation

Dataset Format

Finetuning

Inference

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages