Add Deepseek V3 support for Arm #144
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add following functions for Arm Device:
Upgrade torch version from 2.1.2 to 2.6.0 for Arm backend
Add DeepSeek V2 lite support
Add DeepSeek V3 support by packing FP8 weights to INT4 and compute with KleidiAI
Improve performance of activation op with gate
Add optimized MoE path for a8w4
Merge shared expert into moeFfnLayer for DeepSeek V3
Optimize flash decoding by split q dim for mla absorb attention
This is a stacked PR based on #142 because flash attention/flash decoding are used in mla implementation. Code changes for deepseek support are in the second commit 335f3ab.