Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why vcache fix to "None" #46

Open
FlyingPotatoZ opened this issue Feb 27, 2025 · 2 comments
Open

Why vcache fix to "None" #46

FlyingPotatoZ opened this issue Feb 27, 2025 · 2 comments

Comments

@FlyingPotatoZ
Copy link

Why does def flash_mla_with_kvcache fix vcache to None when calling flash_mla_cuda.fwd_kvcache_mla? In the fwd_kvcache_mla C code, when the vcache pointer is null, vcache is equal to kcache.
at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache;
It seems that only one cache is needed to complete the MLA calculation? However, from the paper and the open source deepseek-v3 library, I did not find that two caches can share a piece of memory. Could you explain this problem? Thanks a lot!!!

@wbn03
Copy link

wbn03 commented Feb 27, 2025

FlashMLA/flash_mla/flash_mla_interface.py

Line 58 in 480405a

None,
Why does def flash_mla_with_kvcache fix vcache to None when calling flash_mla_cuda.fwd_kvcache_mla? In the fwd_kvcache_mla C code, when the vcache pointer is null, vcache is equal to kcache. at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache; It seems that only one cache is needed to complete the MLA calculation? However, from the paper and the open source deepseek-v3 library, I did not find that two caches can share a piece of memory. Could you explain this problem? Thanks a lot!!!

This implementation using absorb method. The input of attention is the compressed kv_lora, which is shared by k/v cache. The content of v cache is already contained in k cache.
There is the MLA implementation in deepseek v3:https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py#L393

@FlyingPotatoZ
Copy link
Author

FlashMLA/flash_mla/flash_mla_interface.py
Line 58 in 480405a
None,
Why does def flash_mla_with_kvcache fix vcache to None when calling flash_mla_cuda.fwd_kvcache_mla? In the fwd_kvcache_mla C code, when the vcache pointer is null, vcache is equal to kcache. at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache; It seems that only one cache is needed to complete the MLA calculation? However, from the paper and the open source deepseek-v3 library, I did not find that two caches can share a piece of memory. Could you explain this problem? Thanks a lot!!!

This implementation using absorb method. The input of attention is the compressed kv_lora, which is shared by k/v cache. The content of v cache is already contained in k cache. There is the MLA implementation in deepseek v3:https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py#L393

OK, I understand, thanks for your explanation. So the 2 caches (kv_lora and kpe) I understand are merged into 1 cache? If my understanding is correct, I think this issue can be closed, thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants