Thank you for your impressive work on AnDPro!
I am currently trying to reproduce the results presented in your paper. I have successfully reproduced the performance on Mistral-7B-Instruct-v0.2, matching the results reported in Table 11.
However, I am facing difficulties reproducing the results on Llama-3.1-8B-Instruct. According to Table 3 in Appendix C.12, AnDPro should also achieve SOTA performance on Llama-3.1. In my experiments, the performance drops significantly on Llama-3.1 compared to the full cache baseline, whereas it works perfectly on Mistral.
My Setup:
● Model: Llama-3.1-8B-Instruct
● Hyperparameters: Aligned with the paper (Window Size=32, Chunk Size=4, $b=0$)3.
● Method: Using cross-head budget allocation and chunking as described in the implementation details.
Any guidance or reference implementation details regarding Llama-3.1 would be greatly appreciated.