Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to reproduce the same accuracy #16

Open
beomseokg opened this issue Feb 3, 2025 · 4 comments
Open

unable to reproduce the same accuracy #16

beomseokg opened this issue Feb 3, 2025 · 4 comments
Labels
question Further information is requested

Comments

@beomseokg
Copy link

Hello, thank you for the great repo.

Unfortunately, I'm facing some issues in reproducing the similar accuracy in the paper, particularly llama-2 13b chat models. I've downloaded the synthetic trajectories provided in google drive and followed all the steps. It seems self-differentiation and group planning work, but the accuracy is just lower than the one in the paper.

llama-2 13b chat model's average F1 score on easy data in HotpotQA reaches ~38% (trajectory generated from 13b) and ~44% (trajectory generated from 70b). The number of trajectories are 200, so I expected them to be ~50% and ~60%, as shown in Figure 3(b) and (f).

  1. could you please provide the 13B model files (PEFT adapters) by any chance?
  2. just to make sure, could you please confirm I'm loading the correct model for fine-tuning. My command was like this:

first, "python3 -m fastchat.serve.model_worker --port 21002 --worker http://localhost:21002 --model-names llama-2-13b-chat --model-path meta-llama/Llama-2-13b-chat-hf", and then, "Scripts/fastchat_lora.sh".

@zxlzr
Copy link
Contributor

zxlzr commented Feb 3, 2025

hi, we will address this issue as soon as possible. We suggest you retry running it a few times, as different GPUs and environments may introduce some variance.

@zxlzr zxlzr added the question Further information is requested label Feb 3, 2025
@beomseokg
Copy link
Author

Thank you for the prompt response and suggestion! I'm retrying running it. I saw some variance (e.g., ~41%, ~42%, ~44% for 70b trajectory and 13b model) but hard to get the similar level of accuracy (~60%). It would be really nice if we can use the model checkpoints for PEFT to check how losses evolve and evaluate them.

@Rolnand
Copy link
Contributor

Rolnand commented Feb 3, 2025

You can refer to this link to deploy the model.Scripts

If you still have problems, you can leave your email address and we will send you the trajectory results we saved in the experiment.

@beomseokg
Copy link
Author

Appreciate it! I understood there are Scripts for loading models. But, could you please share the lora checkpoints as well? (i.e., saved files after lora fine-tuning). My email is [email protected].

I think trajectory results are already shared in the repo (Google Drive).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants