Skip to content

Questions about the efficiency and design of mixed attention #8

@zhangym0213

Description

@zhangym0213

Hi, first of all thank you for the great work and for releasing this project.

I have a couple of questions regarding the mixed attention mechanism:

  1. Efficiency / computational overhead
    Have you conducted any experiments with only causal mask?
    If yes, after introducing mixed attention, does the attention computation become less efficient compared to the causal-only attention pattern? For example, are there any drawbacks in terms of latency, throughput, or memory usage?

    Besides, have you conducted any experiments to quantitatively evaluate the efficiency impact (e.g., training speed, inference latency, FLOPs, etc.) after enabling mixed attention?

    And is it mainly mitigated through optimized CUDA kernels or other low-level optimizations?

  2. Attention computation design

    From the implementation, it seems that mixed attention mainly modifies the attention mask, while the image tokens and text tokens are still processed together in the same attention computation (i.e., the attention matrix and softmax are computed jointly).

    I’m curious whether you have explored an alternative design where image and text attention are computed separately, each with their own softmax, and then combined afterwards.

    Intuitively, computing them jointly vs. separately could lead to different normalization behaviors in softmax and potentially different interactions between modalities. Have you studied or compared these two approaches in practice?

Looking forward to your insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions