-
Notifications
You must be signed in to change notification settings - Fork 99
feature - adapt deepseek in model py #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1074bfb
to
bf3806f
Compare
需要 smoke 测试 |
bf3806f
to
f0b8cc7
Compare
smoke test和一些镜像依赖包后续一并提交到main-internal分支 |
f0b8cc7
to
7c8e9da
Compare
04eeba7
to
51212f5
Compare
提交到 open_merge 分支一起跑 ci。 |
c12c735
to
f860657
Compare
self.token_per_block = token_per_block | ||
|
||
def prepare(self, attention_inputs: PyAttentionInputs): | ||
return rtp_llm_ops.FlashInferMlaAttnParams().fill_mla_params( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rtp_llm_ops.fill_mla_params 或者 rtp_llm_ops.mla.fill_mla_params 这样就行?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
试过不行诶,参考的这个代码
pybind11::class_<FlashInferPrefillOp>(m, "FlashInferPrefillOp") .def(pybind11::init<GptInitParameter>(), py::arg("gpt_init_parameter")) .def("support", &FlashInferPrefillOp::support, py::arg("attn_inputs")) .def("prepare", &FlashInferPrefillOp::prepare, py::arg("attn_inputs")) .def("forward", &FlashInferPrefillOp::forward, py::arg("q"), py::arg("kv_cache"), py::arg("params"));
在python侧也是先生成FlashInferPrefillOp(config.gpt_init_params)类,再调用的fmha_impl.forward
除非把这个FlashInferMlaAttnParams类变成一个function,之前是这么写的,后来改成的类
|
||
PREFILL_MHA_IMPS.append(FlashInferPrefillImpl) | ||
|
||
class MlaFlashInferPrefillImpl(FMHAPrefillImplBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一般情况 prefill 应该直接用 flashattention ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLA的prefill在flashinfer中推荐是用BatchPrefillWithRaggedKVCacheWrapper这个,之前C++的MLA prefill是用的TRTV2,也可以直接把参数前处理包一层用TRTV2
这俩哪个合适,可能需要做性能对比
namespace rtp_llm { | ||
|
||
MlaParams | ||
FlashInferMlaAttnParams::fillParams(torch::Tensor t_prefix_lengths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个应该也不是 MLA 专用的?是通用的 flashinfer params ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是在通用的基础上改的,抽成MlaAttnParams的想法是后面如果Mla params有特殊修改的话,就在这个单独的FlashInferMlaAttnParams里修改
f49b636
to
d881269
Compare
97ed9be
to
c0cead4
Compare
feature - add fmha ut & fix build
feature - add torch mla in pymodel
fix - align deepseekv2 output using hack layer!!
fix - align deeseek v2 output using lite-chat
feature - support prefill & decode mla cpp ops
refactor - mv flashinfer mla ops to fmha.py
fix - add deps in BUILD