You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately, I'm facing some issues in reproducing the similar accuracy in the paper, particularly llama-2 13b chat models. I've downloaded the synthetic trajectories provided in google drive and followed all the steps. It seems self-differentiation and group planning work, but the accuracy is just lower than the one in the paper.
llama-2 13b chat model's average F1 score on easy data in HotpotQA reaches ~38% (trajectory generated from 13b) and ~44% (trajectory generated from 70b). The number of trajectories are 200, so I expected them to be ~50% and ~60%, as shown in Figure 3(b) and (f).
could you please provide the 13B model files (PEFT adapters) by any chance?
just to make sure, could you please confirm I'm loading the correct model for fine-tuning. My command was like this:
hi, we will address this issue as soon as possible. We suggest you retry running it a few times, as different GPUs and environments may introduce some variance.
Thank you for the prompt response and suggestion! I'm retrying running it. I saw some variance (e.g., ~41%, ~42%, ~44% for 70b trajectory and 13b model) but hard to get the similar level of accuracy (~60%). It would be really nice if we can use the model checkpoints for PEFT to check how losses evolve and evaluate them.
Appreciate it! I understood there are Scripts for loading models. But, could you please share the lora checkpoints as well? (i.e., saved files after lora fine-tuning). My email is [email protected].
I think trajectory results are already shared in the repo (Google Drive).
Hello, thank you for the great repo.
Unfortunately, I'm facing some issues in reproducing the similar accuracy in the paper, particularly llama-2 13b chat models. I've downloaded the synthetic trajectories provided in google drive and followed all the steps. It seems self-differentiation and group planning work, but the accuracy is just lower than the one in the paper.
llama-2 13b chat model's average F1 score on easy data in HotpotQA reaches ~38% (trajectory generated from 13b) and ~44% (trajectory generated from 70b). The number of trajectories are 200, so I expected them to be ~50% and ~60%, as shown in Figure 3(b) and (f).
first, "python3 -m fastchat.serve.model_worker --port 21002 --worker http://localhost:21002 --model-names llama-2-13b-chat --model-path meta-llama/Llama-2-13b-chat-hf", and then, "Scripts/fastchat_lora.sh".
The text was updated successfully, but these errors were encountered: