| 🧠 Model Space | 📦 Release Collection |
This repository demonstrates how to adapt Gemma 3, a powerful vision-language model (VLM), for object detection. By treating bounding boxes as discrete <locXXXX>
tokens, we enable the model to reason about spatial information in a language-native way — inspired by PaliGemma.
Here's a glimpse of what our fine-tuned model can do. These images are generated by the predict.py
script:
Detected License Plates | Detected License Plates |
---|---|
![]() |
![]() |
![]() |
![]() |
Most traditional object detection models output continuous bounding box coordinates using regression heads. In contrast, we follow the PaliGemma approach of treating bounding boxes as sequences of discrete tokens (e.g., <loc0512>
), allowing detection to be framed as text generation.
However, unlike PaliGemma, Gemma 3 does not natively include these spatial tokens in its tokenizer.
We support two fine-tuning modes:
The model is trained with <locXXXX>
tokens even though they are not in the tokenizer vocabulary. This forces it to learn spatial grounding implicitly. Although lightweight, it presents interesting results.
By using the flag --include_loc_tokens
, we extend the tokenizer to explicitly include all <locXXXX>
tokens (from <loc0000>
to <loc1023>
) and fine-tune the embeddings for them. After this, we fine-tune the entire model, following a two-stage training procedure. This enables the model to learn spatial grounding more effectively. Learn more here.
💡 Both modes are supported — toggle with
--include_loc_tokens
intrain.py
.
We use the ariG23498/license-detection-paligemma
dataset, a modified version of keremberke/license-plate-object-detection
, reformatted to match the expectations of text-based object detection.
Each bounding box is encoded as a sequence of location tokens like <loc0123>
, following the PaliGemma format.
To reproduce the dataset or modify it for your use case, refer to the script create_dataset.py
.
Get your environment ready to fine-tune Gemma 3:
git clone https://github.com/ariG23498/gemma3-object-detection.git
uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
Follow these steps to configure, train, and run predictions:
Configure your training via config.py
. All major parameters — like learning rate, model path, and dataset — are defined here.
- Train the model using:
python train.py --include_loc_tokens
Toggle --include_loc_tokens
based on your strategy (see explanation above).
Run inference with:
python infer.py
This script uses the fine-tuned model to detect license plates and generates images in the outputs/ folder.
Here are some tasks that we would want to investigate further.
- Low Rank Adaptation Training.
- Quantized Low Rank Adaptation Training.
- Train with a bigger object detection dataset.
We welcome contributions to enhance this project! If you have ideas for improvements, bug fixes, or new features, please:
- Fork the repository.
- Create a new branch for your feature or fix:
git checkout -b feature/my-new-feature
- Implement your changes.
- Commit your changes with clear messages:
git commit -am 'Add some amazing feature'
- Push your branch to your fork:
git push origin feature/my-new-feature
- Open a Pull Request against the main repository.
If you use our work, please cite us.
@misc{gosthipaty_gemma3_object_detection_2025,
author = {Aritra Roy Gosthipaty and Sergio Paniego},
title = {Fine-tuning Gemma 3 for Object Detection},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ariG23498/gemma3-object-detection.git}}
}