LLAMA2-7b PPL Scores #8

jzhoubu · 2024-03-23T20:50:36Z

Hi, thank you for sharing this interesting and insightful work. I am reaching out to inquire about the perplexity scores for the LLAMA2-7b model as reported in Table 2. I have followed the evaluation script detailed in your repository to assess the model using the settings provided:

MODEL_NAME=meta-llama/Llama-2-7b-hf
CUDA_VISIBLE_DEVICES=2 python eval_lm.py \
    --model_name $MODEL_NAME \
    --dataset_path wikitext \
    --dataset_name wikitext-103-v1 \
    --dataset_split test \
    --output_dir $OUTPUT_DIR \
    --stride 32 \
    --max_length 1024

The results I obtained were as follows:

Word-level PPL: 15.3
Token-level PPL: 7.54

Additionally, with a stride of 4, I observed:

Word-level PPL: 15.31

I was wondering if you could provide some insight into the observed discrepancy with the reported scores. Is there a particular aspect of the evaluation setup that I might have overlooked, or are such differences within an expected range under certain conditions (e.g., GPU device)?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLAMA2-7b PPL Scores #8

LLAMA2-7b PPL Scores #8

jzhoubu commented Mar 23, 2024 •

edited

Loading

LLAMA2-7b PPL Scores #8

LLAMA2-7b PPL Scores #8

Comments

jzhoubu commented Mar 23, 2024 • edited Loading

jzhoubu commented Mar 23, 2024 •

edited

Loading