[QUESTION] dsv32 mqa_logits kernel not considering causal masking?

Thanks for dsv32 great work!

By analysis the `fp8_mqa_logits` and `fp8_paged_mqa_logits` function, looks like after the q@k, we don't consider causal masking before topk?

I know the q/k feed to mqa_logits kernel is different from the q/k feed to MLA attention kernel, but we use the output of the mqa_logits kernel (and topk 2048) as indexer into the real MLA's kvcache, hence during the real attention computation we need consider causal, in prefill or decode(MTP) case.

vLLM [prefill dispatch](https://github.com/vllm-project/vllm/blob/08d26a1b7edc200d8d117491eac3e28c0428e571/vllm/model_executor/models/deepseek_v2.py#L654C26-L654C39) using `torch.ops._C.top_k_per_row` seems not considering causal, [decode dispatch](https://github.com/vllm-project/vllm/blob/08d26a1b7edc200d8d117491eac3e28c0428e571/vllm/model_executor/models/deepseek_v2.py#L708) looks like considered causal in MTP case after the logits kernel, before topk.

not sure if it is suppose to let the framework side to do causal before topk, or actually causal is not important during the indexer kernel?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] dsv32 mqa_logits kernel not considering causal masking? #209

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] dsv32 mqa_logits kernel not considering causal masking? #209

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions