Skip to content

Conversation

@xinyu-intel
Copy link
Contributor

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the MLA (Multi-head Latent Attention) implementation by making weight tensors contiguous during the weight loading phase, eliminating the need for runtime transpose operations during graph execution.

Key Changes:

  • Added process_weights_after_loading method to make W_UV and W_UK_T tensors contiguous after weight loading

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xinyu-intel xinyu-intel force-pushed the dev/xinyu/contiguous-mla-weight branch from cf7a643 to 7f3a3b9 Compare November 28, 2025 09:05
Signed-off-by: Xinyu Chen <[email protected]>
@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant