Skip to content

Reducing size and increasing inference speed #291

@Kodemannen

Description

@Kodemannen

We have seen some promising preliminary results on segmentation using the DinoV3 ViTs16 backbone with the ViTAdapter+Mask2Former head.

However, this setup (~89M params with hidden_dim=384) is larger and slower than our deployed model (~37M params), and we would like to compare them at the approximately same inference speed.

Reducing hidden_dim will reduce the model size and inference speed, but not sufficiently. Even going as low as hidden_dim=32, which seems to be the lower bound, the model is significantly slower than ours, despite having fewer parameters (~29M).

I wanted to try using fewer activation layers from the backbone, but currently it seems that the head is hardcoded for using exactly 4. There is an option for passing BackboneLayersSet.LAST to build_segmentation_decoder(), but the Mask2former head is not currently compatible with that, it seems.

The ConvNext backbone, on the other hand, seems to only support the BackboneLayersSet.LAST + linear decoder configuration. And even this setup (using convnext tiny) is just a tad too slow for deployment. (The bigger models will still be useful for other purposes, so big thanks for that!)

So my questions are:

  1. Is convnext-tiny + linear head the absolute fastest setup? Is it realistic to make this even faster?
  2. Is it at all realistic to envision a Dinov3 model with a Mask2former head that is as fast or slightly faster than this? Perhaps by distilling an even smaller backbone, or something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions