Fine-tuning Gemma 3 for Object Detection

| 🧠 Model Space | 📦 Release Collection |

This repository demonstrates how to adapt Gemma 3, a powerful vision-language model (VLM), for object detection. By treating bounding boxes as discrete <locXXXX> tokens, we enable the model to reason about spatial information in a language-native way — inspired by PaliGemma.

🔍 Model Predictions

Here's a glimpse of what our fine-tuned model can do. These images are generated by the predict.py script:

Detected License Plates	Detected License Plates

🧠 Vision-Language Object Detection with Location Tokens

Most traditional object detection models output continuous bounding box coordinates using regression heads. In contrast, we follow the PaliGemma approach of treating bounding boxes as sequences of discrete tokens (e.g., <loc0512>), allowing detection to be framed as text generation.

However, unlike PaliGemma, Gemma 3 does not natively include these spatial tokens in its tokenizer.

We support two fine-tuning modes:

1. Without Extending the Tokenizer

The model is trained with <locXXXX> tokens even though they are not in the tokenizer vocabulary. This forces it to learn spatial grounding implicitly. Although lightweight, it presents interesting results.

2. With Extended Tokenizer

By using the flag --include_loc_tokens, we extend the tokenizer to explicitly include all <locXXXX> tokens (from <loc0000> to <loc1023>) and fine-tune the embeddings for them. After this, we fine-tune the entire model, following a two-stage training procedure. This enables the model to learn spatial grounding more effectively. Learn more here.

💡 Both modes are supported — toggle with --include_loc_tokens in train.py.

📁 Dataset

We use the ariG23498/license-detection-paligemma dataset, a modified version of keremberke/license-plate-object-detection, reformatted to match the expectations of text-based object detection.

Each bounding box is encoded as a sequence of location tokens like <loc0123>, following the PaliGemma format.

To reproduce the dataset or modify it for your use case, refer to the script create_dataset.py.

⚙️ Setup and Installation

Get your environment ready to fine-tune Gemma 3:

git clone https://github.com/ariG23498/gemma3-object-detection.git
uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt

🔧 Usage

Follow these steps to configure, train, and run predictions:

Configure your training via config.py. All major parameters — like learning rate, model path, and dataset — are defined here.

Train the model using:

python train.py --include_loc_tokens

Toggle --include_loc_tokens based on your strategy (see explanation above).

Run inference with:

python infer.py

This script uses the fine-tuned model to detect license plates and generates images in the outputs/ folder.

🗺️ Roadmap

Here are some tasks that we would want to investigate further.

Low Rank Adaptation Training.
Quantized Low Rank Adaptation Training.
Train with a bigger object detection dataset.

🤝 Contributions

We welcome contributions to enhance this project! If you have ideas for improvements, bug fixes, or new features, please:

Fork the repository.
Create a new branch for your feature or fix:

git checkout -b feature/my-new-feature

Implement your changes.
Commit your changes with clear messages:

git commit -am 'Add some amazing feature'

Push your branch to your fork:

git push origin feature/my-new-feature

Open a Pull Request against the main repository.

📜 Citation

If you use our work, please cite us.

@misc{gosthipaty_gemma3_object_detection_2025,
  author = {Aritra Roy Gosthipaty and Sergio Paniego},
  title = {Fine-tuning Gemma 3 for Object Detection},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ariG23498/gemma3-object-detection.git}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fine-tuning Gemma 3 for Object Detection

🔍 Model Predictions

🧠 Vision-Language Object Detection with Location Tokens

1. Without Extending the Tokenizer

2. With Extended Tokenizer

📁 Dataset

⚙️ Setup and Installation

🔧 Usage

🗺️ Roadmap

🤝 Contributions

📜 Citation

📚 References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
outputs		outputs
.gitignore		.gitignore
README.md		README.md
config.py		config.py
create_dataset.py		create_dataset.py
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

dora-rs/gemma3-object-detection

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning Gemma 3 for Object Detection

🔍 Model Predictions

🧠 Vision-Language Object Detection with Location Tokens

1. Without Extending the Tokenizer

2. With Extended Tokenizer

📁 Dataset

⚙️ Setup and Installation

🔧 Usage

🗺️ Roadmap

🤝 Contributions

📜 Citation

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages