-
Notifications
You must be signed in to change notification settings - Fork 24
Questions about the efficiency and design of mixed attention #8
Description
Hi, first of all thank you for the great work and for releasing this project.
I have a couple of questions regarding the mixed attention mechanism:
-
Efficiency / computational overhead
Have you conducted any experiments with only causal mask?
If yes, after introducing mixed attention, does the attention computation become less efficient compared to the causal-only attention pattern? For example, are there any drawbacks in terms of latency, throughput, or memory usage?Besides, have you conducted any experiments to quantitatively evaluate the efficiency impact (e.g., training speed, inference latency, FLOPs, etc.) after enabling mixed attention?
And is it mainly mitigated through optimized CUDA kernels or other low-level optimizations?
-
Attention computation design
From the implementation, it seems that mixed attention mainly modifies the attention mask, while the image tokens and text tokens are still processed together in the same attention computation (i.e., the attention matrix and softmax are computed jointly).
I’m curious whether you have explored an alternative design where image and text attention are computed separately, each with their own softmax, and then combined afterwards.
Intuitively, computing them jointly vs. separately could lead to different normalization behaviors in softmax and potentially different interactions between modalities. Have you studied or compared these two approaches in practice?
Looking forward to your insights!