Reducing size and increasing inference speed

We have seen some promising preliminary results on segmentation using the DinoV3 ViTs16 backbone with the ViTAdapter+Mask2Former head.
 
However, this setup (~89M params with hidden_dim=384) is larger and slower than our deployed model (~37M params), and we would like to compare them at the approximately same inference speed.
 
Reducing `hidden_dim` will reduce the model size and inference speed, but not sufficiently. Even going as low as `hidden_dim=32`, which seems to be the lower bound, the model is significantly slower than ours, despite having fewer parameters (~29M).
 
I wanted to try using fewer activation layers from the backbone, but currently it seems that the head is hardcoded for using exactly 4. There is an option for passing `BackboneLayersSet.LAST` to `build_segmentation_decoder()`, but the Mask2former head is not currently compatible with that, it seems.
 
The ConvNext backbone, on the other hand, seems to only support the `BackboneLayersSet.LAST` + linear decoder configuration. And even this setup (using convnext tiny) is just a tad too slow for deployment. (The bigger models will still be useful for other purposes, so big thanks for that!)
 
So my questions are:
1. Is convnext-tiny + linear head the absolute fastest setup? Is it realistic to make this even faster?
1. Is it at all realistic to envision a Dinov3 model with a Mask2former head that is as fast or slightly faster than this? Perhaps by distilling an even smaller backbone, or something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reducing size and increasing inference speed #291

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reducing size and increasing inference speed #291

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions