-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why vcache fix to "None" #46
Comments
This implementation using absorb method. The input of attention is the compressed kv_lora, which is shared by k/v cache. The content of v cache is already contained in k cache. |
OK, I understand, thanks for your explanation. So the 2 caches (kv_lora and kpe) I understand are merged into 1 cache? If my understanding is correct, I think this issue can be closed, thank you very much! |
FlashMLA/flash_mla/flash_mla_interface.py
Line 58 in 480405a
Why does def flash_mla_with_kvcache fix vcache to None when calling flash_mla_cuda.fwd_kvcache_mla? In the fwd_kvcache_mla C code, when the vcache pointer is null, vcache is equal to kcache.
at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache;
It seems that only one cache is needed to complete the MLA calculation? However, from the paper and the open source deepseek-v3 library, I did not find that two caches can share a piece of memory. Could you explain this problem? Thanks a lot!!!
The text was updated successfully, but these errors were encountered: