-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unused arguments: n_queries and attn_pooler_heads in MultimodalCfg #434
Comments
Hi, you are right that the arguments are not passed down to the vision transformer, I will fix that. While for the query shape, in attentional pooler, I think there isn't a difference, also in lucidrains implementation, the queries that are passed to the attentional pooler are 256 same as here and both here and there they have a length that is decided at model initialisation, should be "embed_dim" here and "dim" in coca-pytorch, and they are used to compute attention with the output of the image encoder in both cases. Does it make sense to you? |
Also, to avoid further confusion, while those argument are outdated the equivalent ones can be passed to the vision transformers and should be used correctly, thanks for pointing it out! |
Hi @gpucce, thank you so much for the reply and clarification! Yes the use of n_queries/attn_pooler_heads in the visual_cfg seems clear. But for clarification, this currently only applies to models instantiated from the Visiontransformer class correct? i.e. if we use TimmModel, an AttentionalPooler + attention-pooling setup will not be added and the arguments are ignored. Would it make sense to add the usage of AttentionalPooler to TimmModel as well so that the behavior of the two arguments are consistent across all visual encoder models? Thanks again. |
Yeah indeed, I want to add support for the rest of the models but I don't think I will be able to do it shortly unfortunately :( I will write here once I start! |
Hi, thanks @gpucce for implementing CoCa! I am reading the code, and under MultimodalCfg (used by CoCa to construct the multimodal decoder) there is n_queries and attn_pooler_heads (here), but correct me if I am wrong, I don't see they are used anywhere? It seems currently the cross-attention layers in the multimodal decoder simply attends to a sequence of arbitrarily long img embedding tokens. Whereas I believe in lucidrains implementation, they're first reduced to a sequence of fixed length by using a set number of queries (e.g. 256). Any clarification would be appreciated! Thanks.
The text was updated successfully, but these errors were encountered: