-
Notifications
You must be signed in to change notification settings - Fork 665
Description
We have seen some promising preliminary results on segmentation using the DinoV3 ViTs16 backbone with the ViTAdapter+Mask2Former head.
However, this setup (~89M params with hidden_dim=384) is larger and slower than our deployed model (~37M params), and we would like to compare them at the approximately same inference speed.
Reducing hidden_dim will reduce the model size and inference speed, but not sufficiently. Even going as low as hidden_dim=32, which seems to be the lower bound, the model is significantly slower than ours, despite having fewer parameters (~29M).
I wanted to try using fewer activation layers from the backbone, but currently it seems that the head is hardcoded for using exactly 4. There is an option for passing BackboneLayersSet.LAST to build_segmentation_decoder(), but the Mask2former head is not currently compatible with that, it seems.
The ConvNext backbone, on the other hand, seems to only support the BackboneLayersSet.LAST + linear decoder configuration. And even this setup (using convnext tiny) is just a tad too slow for deployment. (The bigger models will still be useful for other purposes, so big thanks for that!)
So my questions are:
- Is convnext-tiny + linear head the absolute fastest setup? Is it realistic to make this even faster?
- Is it at all realistic to envision a Dinov3 model with a Mask2former head that is as fast or slightly faster than this? Perhaps by distilling an even smaller backbone, or something?