Skip to content

Conversation

xiangze-arm
Copy link

Add following functions for Arm Device:

  • moeFfnLayer
  • mlaContextAttention
  • mlaAbsorbAttention
  • layernormWithStride
  • mlaQKVGemm
  • slice
  • dispatch

Upgrade torch version from 2.1.2 to 2.6.0 for Arm backend
Add DeepSeek V2 lite support
Add DeepSeek V3 support by packing FP8 weights to INT4 and compute with KleidiAI
Improve performance of activation op with gate
Add optimized MoE path for a8w4
Merge shared expert into moeFfnLayer for DeepSeek V3
Optimize flash decoding by split q dim for mla absorb attention

This is a stacked PR based on #142 because flash attention/flash decoding are used in mla implementation. Code changes for deepseek support are in the second commit 335f3ab.

xiangze-arm and others added 2 commits August 19, 2025 11:08
- Implement flash attention for context attention
- Implement flash decoding for decoder self attention
- Avoid cache assem and use blocked kv cache directly
- Compute GQA by group of heads in flash decoding

Signed-off-by: Zhang Xiangze <[email protected]>
Co-authored-by: Ruifeng Wang <[email protected]>
Add following functions for Arm Device:
  - moeFfnLayer
  - mlaContextAttention
  - mlaAbsorbAttention
  - layernormWithStride
  - mlaQKVGemm
  - slice
  - dispatch
Upgrade torch version from 2.1.2 to 2.6.0 for Arm backend
Add DeepSeek V2 lite support
Add DeepSeek V3 support by packing FP8 weights to INT4 and compute with KleidiAI
Improve performance of activation op with gate
Add optimized MoE path for a8w4
Merge shared expert into moeFfnLayer for DeepSeek V3
Optimize flash decoding by split q dim for mla absorb attention

Signed-off-by: Zhang Xiangze <[email protected]>
Co-authored-by: Tianyu Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant