Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLAMA2-7b PPL Scores #8

Open
jzhoubu opened this issue Mar 23, 2024 · 0 comments
Open

LLAMA2-7b PPL Scores #8

jzhoubu opened this issue Mar 23, 2024 · 0 comments

Comments

@jzhoubu
Copy link

jzhoubu commented Mar 23, 2024

Hi, thank you for sharing this interesting and insightful work. I am reaching out to inquire about the perplexity scores for the LLAMA2-7b model as reported in Table 2. I have followed the evaluation script detailed in your repository to assess the model using the settings provided:

MODEL_NAME=meta-llama/Llama-2-7b-hf
CUDA_VISIBLE_DEVICES=2 python eval_lm.py \
    --model_name $MODEL_NAME \
    --dataset_path wikitext \
    --dataset_name wikitext-103-v1 \
    --dataset_split test \
    --output_dir $OUTPUT_DIR \
    --stride 32 \
    --max_length 1024

The results I obtained were as follows:

  • Word-level PPL: 15.3
  • Token-level PPL: 7.54

Additionally, with a stride of 4, I observed:

  • Word-level PPL: 15.31

I was wondering if you could provide some insight into the observed discrepancy with the reported scores. Is there a particular aspect of the evaluation setup that I might have overlooked, or are such differences within an expected range under certain conditions (e.g., GPU device)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant