Difficulty reproducing results on Llama-3.1-8B-Instruct: Significant performance gap compared to Mistral

Thank you for your impressive work on AnDPro!
I am currently trying to reproduce the results presented in your paper. I have successfully reproduced the performance on Mistral-7B-Instruct-v0.2, matching the results reported in Table 11.
However, I am facing difficulties reproducing the results on Llama-3.1-8B-Instruct. According to Table 3 in Appendix C.12, AnDPro should also achieve SOTA performance on Llama-3.1. In my experiments, the performance drops significantly on Llama-3.1 compared to the full cache baseline, whereas it works perfectly on Mistral.
My Setup:
● Model: Llama-3.1-8B-Instruct
● Hyperparameters: Aligned with the paper (Window Size=32, Chunk Size=4, $b=0$)3.
● Method: Using cross-head budget allocation and chunking as described in the implementation details.

Any guidance or reference implementation details regarding Llama-3.1 would be greatly appreciated. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty reproducing results on Llama-3.1-8B-Instruct: Significant performance gap compared to Mistral #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Difficulty reproducing results on Llama-3.1-8B-Instruct: Significant performance gap compared to Mistral #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions