Why vcache fix to "None" #46

FlyingPotatoZ · 2025-02-27T01:23:35Z

FlashMLA/flash_mla/flash_mla_interface.py

Line 58 in 480405a

None,

Why does def flash_mla_with_kvcache fix vcache to None when calling flash_mla_cuda.fwd_kvcache_mla? In the fwd_kvcache_mla C code, when the vcache pointer is null, vcache is equal to kcache.
at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache;
It seems that only one cache is needed to complete the MLA calculation? However, from the paper and the open source deepseek-v3 library, I did not find that two caches can share a piece of memory. Could you explain this problem? Thanks a lot！！！

The text was updated successfully, but these errors were encountered:

wbn03 · 2025-02-27T06:58:28Z

FlashMLA/flash_mla/flash_mla_interface.py

Line 58 in 480405a

None,
Why does def flash_mla_with_kvcache fix vcache to None when calling flash_mla_cuda.fwd_kvcache_mla? In the fwd_kvcache_mla C code, when the vcache pointer is null, vcache is equal to kcache. at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache; It seems that only one cache is needed to complete the MLA calculation? However, from the paper and the open source deepseek-v3 library, I did not find that two caches can share a piece of memory. Could you explain this problem? Thanks a lot！！！

This implementation using absorb method. The input of attention is the compressed kv_lora, which is shared by k/v cache. The content of v cache is already contained in k cache.
There is the MLA implementation in deepseek v3:https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py#L393

FlyingPotatoZ · 2025-02-27T07:43:14Z

FlashMLA/flash_mla/flash_mla_interface.py
Line 58 in 480405a
None,
Why does def flash_mla_with_kvcache fix vcache to None when calling flash_mla_cuda.fwd_kvcache_mla? In the fwd_kvcache_mla C code, when the vcache pointer is null, vcache is equal to kcache. at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache; It seems that only one cache is needed to complete the MLA calculation? However, from the paper and the open source deepseek-v3 library, I did not find that two caches can share a piece of memory. Could you explain this problem? Thanks a lot！！！

This implementation using absorb method. The input of attention is the compressed kv_lora, which is shared by k/v cache. The content of v cache is already contained in k cache. There is the MLA implementation in deepseek v3:https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py#L393

OK, I understand, thanks for your explanation. So the 2 caches (kv_lora and kpe) I understand are merged into 1 cache? If my understanding is correct, I think this issue can be closed, thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why vcache fix to "None" #46

Why vcache fix to "None" #46

FlyingPotatoZ commented Feb 27, 2025

wbn03 commented Feb 27, 2025

FlyingPotatoZ commented Feb 27, 2025

Why vcache fix to "None" #46

Why vcache fix to "None" #46

Comments

FlyingPotatoZ commented Feb 27, 2025

wbn03 commented Feb 27, 2025

FlyingPotatoZ commented Feb 27, 2025