Skip to content

Conversation

Nancheng-11
Copy link
Collaborator

feature - add fmha ut & fix build

feature - add torch mla in pymodel

fix - align deepseekv2 output using hack layer!!

fix - align deeseek v2 output using lite-chat

feature - support prefill & decode mla cpp ops

refactor - mv flashinfer mla ops to fmha.py

fix - add deps in BUILD

@CLAassistant
Copy link

CLAassistant commented Oct 10, 2025

CLA assistant check
All committers have signed the CLA.

@Nancheng-11 Nancheng-11 force-pushed the feature/pymodel_deepseek branch from 1074bfb to bf3806f Compare October 10, 2025 08:16
@LLLLKKKK
Copy link
Collaborator

需要 smoke 测试

@Nancheng-11 Nancheng-11 force-pushed the feature/pymodel_deepseek branch from bf3806f to f0b8cc7 Compare October 10, 2025 12:39
@Nancheng-11
Copy link
Collaborator Author

需要 smoke 测试

smoke test和一些镜像依赖包后续一并提交到main-internal分支

@Nancheng-11 Nancheng-11 force-pushed the feature/pymodel_deepseek branch from 04eeba7 to 51212f5 Compare October 11, 2025 13:23
@LLLLKKKK
Copy link
Collaborator

需要 smoke 测试

smoke test和一些镜像依赖包后续一并提交到main-internal分支

提交到 open_merge 分支一起跑 ci。

@Nancheng-11 Nancheng-11 force-pushed the feature/pymodel_deepseek branch 10 times, most recently from c12c735 to f860657 Compare October 16, 2025 06:29
@LLLLKKKK LLLLKKKK enabled auto-merge (rebase) October 16, 2025 07:25
self.token_per_block = token_per_block

def prepare(self, attention_inputs: PyAttentionInputs):
return rtp_llm_ops.FlashInferMlaAttnParams().fill_mla_params(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rtp_llm_ops.fill_mla_params 或者 rtp_llm_ops.mla.fill_mla_params 这样就行?

Copy link
Collaborator Author

@Nancheng-11 Nancheng-11 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

试过不行诶,参考的这个代码
pybind11::class_<FlashInferPrefillOp>(m, "FlashInferPrefillOp") .def(pybind11::init<GptInitParameter>(), py::arg("gpt_init_parameter")) .def("support", &FlashInferPrefillOp::support, py::arg("attn_inputs")) .def("prepare", &FlashInferPrefillOp::prepare, py::arg("attn_inputs")) .def("forward", &FlashInferPrefillOp::forward, py::arg("q"), py::arg("kv_cache"), py::arg("params"));
在python侧也是先生成FlashInferPrefillOp(config.gpt_init_params)类,再调用的fmha_impl.forward
除非把这个FlashInferMlaAttnParams类变成一个function,之前是这么写的,后来改成的类


PREFILL_MHA_IMPS.append(FlashInferPrefillImpl)

class MlaFlashInferPrefillImpl(FMHAPrefillImplBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一般情况 prefill 应该直接用 flashattention ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLA的prefill在flashinfer中推荐是用BatchPrefillWithRaggedKVCacheWrapper这个,之前C++的MLA prefill是用的TRTV2,也可以直接把参数前处理包一层用TRTV2
这俩哪个合适,可能需要做性能对比

namespace rtp_llm {

MlaParams
FlashInferMlaAttnParams::fillParams(torch::Tensor t_prefix_lengths,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个应该也不是 MLA 专用的?是通用的 flashinfer params ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是在通用的基础上改的,抽成MlaAttnParams的想法是后面如果Mla params有特殊修改的话,就在这个单独的FlashInferMlaAttnParams里修改

@Nancheng-11 Nancheng-11 force-pushed the feature/pymodel_deepseek branch 4 times, most recently from f49b636 to d881269 Compare October 17, 2025 06:57
@Nancheng-11 Nancheng-11 force-pushed the feature/pymodel_deepseek branch from 97ed9be to c0cead4 Compare October 17, 2025 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants