Unused arguments: n_queries and attn_pooler_heads in MultimodalCfg #434

fedshyvana · 2023-02-18T02:42:12Z

Hi, thanks @gpucce for implementing CoCa! I am reading the code, and under MultimodalCfg (used by CoCa to construct the multimodal decoder) there is n_queries and attn_pooler_heads (here), but correct me if I am wrong, I don't see they are used anywhere? It seems currently the cross-attention layers in the multimodal decoder simply attends to a sequence of arbitrarily long img embedding tokens. Whereas I believe in lucidrains implementation, they're first reduced to a sequence of fixed length by using a set number of queries (e.g. 256). Any clarification would be appreciated! Thanks.

gpucce · 2023-02-18T03:21:52Z

Hi, you are right that the arguments are not passed down to the vision transformer, I will fix that.

While for the query shape, in attentional pooler, I think there isn't a difference, also in lucidrains implementation, the queries that are passed to the attentional pooler are 256 same as here and both here and there they have a length that is decided at model initialisation, should be "embed_dim" here and "dim" in coca-pytorch, and they are used to compute attention with the output of the image encoder in both cases.

Does it make sense to you?

gpucce · 2023-02-18T03:31:10Z

Also, to avoid further confusion, while those argument are outdated the equivalent ones can be passed to the vision transformers and should be used correctly, thanks for pointing it out!

fedshyvana · 2023-02-18T05:33:40Z

Hi @gpucce, thank you so much for the reply and clarification! Yes the use of n_queries/attn_pooler_heads in the visual_cfg seems clear. But for clarification, this currently only applies to models instantiated from the Visiontransformer class correct? i.e. if we use TimmModel, an AttentionalPooler + attention-pooling setup will not be added and the arguments are ignored. Would it make sense to add the usage of AttentionalPooler to TimmModel as well so that the behavior of the two arguments are consistent across all visual encoder models? Thanks again.

gpucce · 2023-02-18T19:28:42Z

Hi @gpucce, thank you so much for the reply and clarification! Yes the use of n_queries/attn_pooler_heads in the visual_cfg seems clear. But for clarification, this currently only applies to models instantiated from the Visiontransformer class correct? i.e. if we use TimmModel, an AttentionalPooler + attention-pooling setup will not be added and the arguments are ignored. Would it make sense to add the usage of AttentionalPooler to TimmModel as well so that the behavior of the two arguments are consistent across all visual encoder models? Thanks again.

Yeah indeed, I want to add support for the rest of the models but I don't think I will be able to do it shortly unfortunately :(

I will write here once I start!

gpucce mentioned this issue Feb 22, 2023

Make coca and HF work together #447

Draft

4 tasks

bryant1410 mentioned this issue Mar 14, 2024

Remove unused MultimodalCfg.attn_pooler_heads #843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unused arguments: n_queries and attn_pooler_heads in MultimodalCfg #434

Unused arguments: n_queries and attn_pooler_heads in MultimodalCfg #434

fedshyvana commented Feb 18, 2023 •

edited

Loading

gpucce commented Feb 18, 2023

gpucce commented Feb 18, 2023

fedshyvana commented Feb 18, 2023

gpucce commented Feb 18, 2023

Unused arguments: n_queries and attn_pooler_heads in MultimodalCfg #434

Unused arguments: n_queries and attn_pooler_heads in MultimodalCfg #434

Comments

fedshyvana commented Feb 18, 2023 • edited Loading

gpucce commented Feb 18, 2023

gpucce commented Feb 18, 2023

fedshyvana commented Feb 18, 2023

gpucce commented Feb 18, 2023

fedshyvana commented Feb 18, 2023 •

edited

Loading