Reproducing the results with LLaMa and TriviaQA (Figure 8) #12

YasamanJafari · 2024-07-23T19:55:07Z

Hi,

Thank you for the excellent paper and for providing the code! I have been trying to reproduce the results from Figure 8 of the paper using LLaMa-7B and LLaMa-13B and the TriviaQA dataset I downloaded using the command in ReadMe.
However, I get the following values:

7B:
0 docs: 50.8, 1 doc: 54.1, 2 docs: 55.9, 3 docs: 56.4

13B:
0 docs: 57.8, 1 doc: 58.8, 2 docs: 59.8, 3 docs: 60.4

Can you please provide some insights/information that explains this discrepancy?
(The numbers for 1-3 documents are similar but there is a ~3% gap for 0 documents.)

oriram · 2024-07-24T12:50:42Z

Hey Yasaman, Thanks for reaching out! I'm not sure what causes this discrepancy - two directions to look at: 1. Are results for NQ the same as reported? 2. Did you use LLaMA 2? THe results in the paper are for LLaMA 1.

…

On Tue, Jul 23, 2024 at 10:55 PM Yasaman Jafari ***@***.***> wrote: Hi, Thank you for the excellent paper and for providing the code! I have been trying to reproduce the results from Figure 8 of the paper using LLaMa-7B and LLaMa-13B and the TriviaQA dataset I downloaded using the command in ReadMe. However, I get the following values: 7B: 0 docs: 50.8, 1 doc: 54.1, 2 docs: 55.9, 3 docs: 56.4 13B: 0 docs: 57.8, 1 doc: 58.8, 2 docs: 59.8, 3 docs: 60.4 Can you please provide some insights/information that explains this discrepancy? (The numbers for 1-3 documents are similar but there is a ~3% gap for 0 documents.) Screenshot.2024-07-23.at.12.29.37.PM.png (view on web) <https://github.com/user-attachments/assets/f6750b23-43ef-41cc-bf75-d5927fae8dc2> — Reply to this email directly, view it on GitHub <#12>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGNXVETSQE6NG2RF2EAD5PDZN2YK7AVCNFSM6AAAAABLLDAUPKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDMMBRGIYTAOI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

YasamanJafari · 2024-07-26T17:40:00Z

Thank you for your response!

The same thing happens when experimenting with the NQ dataset. The results for 1-3 documents are very similar, but there is a noticeable difference between the results for 0 documents. The results I get are as follows:

LLaMa1-7B:
0 docs: 14.6%, 1 doc: 28.4%, 2 docs: 28.6%, 3 docs: 28.1%

LLaMa1-13B:
0 docs: 18.3%, 1 doc: 30.4%, 2 docs: 30.3%, 3 docs: 30.5%

I am using LLaMa 1. Is there a specific checkpoint you are using that may explain the discrepancy?

YasamanJafari · 2024-08-05T17:11:45Z

Hi again, I just wanted to follow up on this and check if there have been any updates about this discrepancy!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the results with LLaMa and TriviaQA (Figure 8) #12

Reproducing the results with LLaMa and TriviaQA (Figure 8) #12

YasamanJafari commented Jul 23, 2024

oriram commented Jul 24, 2024 via email

YasamanJafari commented Jul 26, 2024

YasamanJafari commented Aug 5, 2024

Reproducing the results with LLaMa and TriviaQA (Figure 8) #12

Reproducing the results with LLaMa and TriviaQA (Figure 8) #12

Comments

YasamanJafari commented Jul 23, 2024

oriram commented Jul 24, 2024 via email

YasamanJafari commented Jul 26, 2024

YasamanJafari commented Aug 5, 2024