Skip to content

dora-rs/gemma3-object-detection

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-tuning Gemma 3 for Object Detection

| 🧠 Model Space | 📦 Release Collection |

This repository demonstrates how to adapt Gemma 3, a powerful vision-language model (VLM), for object detection. By treating bounding boxes as discrete <locXXXX> tokens, we enable the model to reason about spatial information in a language-native way — inspired by PaliGemma.

🔍 Model Predictions

Here's a glimpse of what our fine-tuned model can do. These images are generated by the predict.py script:

Detected License Plates Detected License Plates

🧠 Vision-Language Object Detection with Location Tokens

Most traditional object detection models output continuous bounding box coordinates using regression heads. In contrast, we follow the PaliGemma approach of treating bounding boxes as sequences of discrete tokens (e.g., <loc0512>), allowing detection to be framed as text generation.

However, unlike PaliGemma, Gemma 3 does not natively include these spatial tokens in its tokenizer.

We support two fine-tuning modes:

1. Without Extending the Tokenizer

The model is trained with <locXXXX> tokens even though they are not in the tokenizer vocabulary. This forces it to learn spatial grounding implicitly. Although lightweight, it presents interesting results.

2. With Extended Tokenizer

By using the flag --include_loc_tokens, we extend the tokenizer to explicitly include all <locXXXX> tokens (from <loc0000> to <loc1023>) and fine-tune the embeddings for them. After this, we fine-tune the entire model, following a two-stage training procedure. This enables the model to learn spatial grounding more effectively. Learn more here.

💡 Both modes are supported — toggle with --include_loc_tokens in train.py.

Token strategy illustration

📁 Dataset

We use the ariG23498/license-detection-paligemma dataset, a modified version of keremberke/license-plate-object-detection, reformatted to match the expectations of text-based object detection.

Each bounding box is encoded as a sequence of location tokens like <loc0123>, following the PaliGemma format.

To reproduce the dataset or modify it for your use case, refer to the script create_dataset.py.

⚙️ Setup and Installation

Get your environment ready to fine-tune Gemma 3:

git clone https://github.com/ariG23498/gemma3-object-detection.git
uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt

🔧 Usage

Follow these steps to configure, train, and run predictions:

Configure your training via config.py. All major parameters — like learning rate, model path, and dataset — are defined here.

  1. Train the model using:
python train.py --include_loc_tokens

Toggle --include_loc_tokens based on your strategy (see explanation above).

Run inference with:

python infer.py

This script uses the fine-tuned model to detect license plates and generates images in the outputs/ folder.

🗺️ Roadmap

Here are some tasks that we would want to investigate further.

  1. Low Rank Adaptation Training.
  2. Quantized Low Rank Adaptation Training.
  3. Train with a bigger object detection dataset.

🤝 Contributions

We welcome contributions to enhance this project! If you have ideas for improvements, bug fixes, or new features, please:

  1. Fork the repository.
  2. Create a new branch for your feature or fix:
git checkout -b feature/my-new-feature
  1. Implement your changes.
  2. Commit your changes with clear messages:
git commit -am 'Add some amazing feature'
  1. Push your branch to your fork:
git push origin feature/my-new-feature
  1. Open a Pull Request against the main repository.

📜 Citation

If you use our work, please cite us.

@misc{gosthipaty_gemma3_object_detection_2025,
  author = {Aritra Roy Gosthipaty and Sergio Paniego},
  title = {Fine-tuning Gemma 3 for Object Detection},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ariG23498/gemma3-object-detection.git}}
}

📚 References

About

Fine tune Gemma 3 on an object detection task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%