Questions about the efficiency and design of mixed attention

Hi, first of all thank you for the great work and for releasing this project.

I have a couple of questions regarding the mixed attention mechanism:

1. **Efficiency / computational overhead**
   Have you conducted any experiments with only causal mask?
   If yes, after introducing mixed attention, does the attention computation become less efficient compared to the causal-only attention pattern? For example, are there any drawbacks in terms of latency, throughput, or memory usage?

   Besides, have you conducted any experiments to quantitatively evaluate the efficiency impact (e.g., training speed, inference latency, FLOPs, etc.) after enabling mixed attention?

   And is it mainly mitigated through optimized CUDA kernels or other low-level optimizations?

2. **Attention computation design**

   From the implementation, it seems that mixed attention mainly modifies the **attention mask**, while the image tokens and text tokens are still processed together in the same attention computation (i.e., the attention matrix and softmax are computed jointly).

   I’m curious whether you have explored an alternative design where **image and text attention are computed separately**, each with their own softmax, and then combined afterwards.

   Intuitively, computing them jointly vs. separately could lead to different normalization behaviors in softmax and potentially different interactions between modalities. Have you studied or compared these two approaches in practice?

Looking forward to your insights!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the efficiency and design of mixed attention #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about the efficiency and design of mixed attention #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions