You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I evaluated the model using lm-evaluation-harness on MedMCQA, MedQA-USMLE and PubMedQA and the model performs barely above llama2 7b with only 38% on the USMLE, 36% on MedMCQa and 73.9% on PubMedQA.
Could you describe how you got your results?
The text was updated successfully, but these errors were encountered:
Hey Doc! The evaluation function i used is in the .ipynb attached in the repository. I created a semantic similarity threshold for all responses congruent with possible responses in the USMLE. So it doesn't have to be a verbatim response, thus the accuracy was higher. Also, i am about to release a new fine-tuned model next week. the goal here is to keep on improving. i just merged my first PR. posted a paid bounty last week for UI issues. would love your help!
pterameta
pushed a commit
to pterameta/DoctorGPT
that referenced
this issue
Sep 20, 2023
❓ General Questions
I evaluated the model using lm-evaluation-harness on MedMCQA, MedQA-USMLE and PubMedQA and the model performs barely above llama2 7b with only 38% on the USMLE, 36% on MedMCQa and 73.9% on PubMedQA.
Could you describe how you got your results?
The text was updated successfully, but these errors were encountered: