MultiHeadAttentionWrapper should instantiate CausalSelfAttention with d_out = d_out // num_heads? #609

henrythe9th · 2025-04-09T01:57:31Z

henrythe9th
Apr 9, 2025

Since the MultiHeadAttentionWrapper class calls torch.cat([head(x) for head in self.heads], dim=-1)
shouldn't we be instantiating CausalSelfAttention with d_out = d_out // num_heads so that the final MultiHeadAttentionWrapper output has the same shape and d_out as was specified in the input?

In other words, is this a clearer implementation?

class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalSelfAttention(d_in, d_out // num_heads, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )
        self.out_proj = nn.Linear(d_out, d_out)

    def forward(self, x):
        context_vec = torch.cat([head(x) for head in self.heads], dim=-1)
        return self.out_proj(context_vec)

casinca · 2025-04-09T11:17:40Z

casinca
Apr 9, 2025

I believe the confusion lies in how we are interpreting d_out. In the specific case of MultiHeadAttentionWrapper from Sebastian's impl, d_out keyword is actually head_dim (seen from the classic MHA perspective.)

In your impl, d_out is the actual d_out as you'd see it from the classic MHA.

It's true that it's clearer in the sense that the d_out keyword is what we'd expect it to be. Sebastian's impl d_out argument will be for you d_out * num_heads.

1 reply

rasbt Apr 18, 2025
Maintainer

Thanks for the comments @henrythe9th and @casinca . I don't disagree with you here, and I remember toying with that idea. It's been a while, but I think I ultimately decided to go with the current way because I thought it might be easier to see what's happening (i.e., that we are stacking multiple heads). With d_out = d_out // num_heads the input and output would have been the same and it would have been harder to see.

But I can also see your point doing it differently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiHeadAttentionWrapper should instantiate CausalSelfAttention with d_out = d_out // num_heads? #609

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

MultiHeadAttentionWrapper should instantiate CausalSelfAttention with d_out = d_out // num_heads? #609

henrythe9th Apr 9, 2025

Replies: 1 comment · 1 reply

casinca Apr 9, 2025

rasbt Apr 18, 2025 Maintainer

henrythe9th
Apr 9, 2025

Replies: 1 comment 1 reply

casinca
Apr 9, 2025

rasbt Apr 18, 2025
Maintainer