You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope this finds you well. I've been using your code jquesnelle/yarn for testing the PG19 dataset. While reviewing the eval.sh script, I noticed some definitions related to the PG19 dataset, but the code for testing perplexity results seems somewhat unclear.
Settings:
Base Model: llama2-7b
Base Context Size: 4096
Sliding Window: 256, 4096
Scale to: 8192
In eval.sh, I found the following definition for the PG19 dataset:
# python eval/perplexity.py -m meta-llama/Llama-2-7b-hf --dataset pg19 --split test --feature text --save-tokenized output/pg19-test-tokenized
PG19="--tokenized emozilla/pg19-test-tokenized"
However, I did not find the actual code for testing perplexity results. Therefore, I attempted to use our own defined code for testing:
I observed that the results differ when the sliding window is set to 4096 and 256. In comparison to other PI and dy-ntk methods, the performance is unstable with a sliding window set to 256 and stable with a sliding window set to 4096.
Results:
--sliding-window 4096:
meta-llama/Llama-2-7b-hf: 8192=9.89344
--sliding-window 256:
meta-llama/Llama-2-7b-hf: 8192=32.76145
In contrast, other PI and dy-ntk methods maintain relatively stable performance when the sliding window is set to 256 and 4096:
Sliding window: 4096 / 256
PI: 10.79598 / 10.65644
dy-ntk: 10.19125 / 10.214816
I would appreciate your insights on this phenomenon. Is this behavior considered normal, or could there be potential configuration issues? If possible, could you provide more detailed information about the PG19 dataset testing script to help me better understand and adjust the testing configuration?
Thank you very much for your time and assistance. I look forward to your response.
Best regards,
Yiran
The text was updated successfully, but these errors were encountered:
Hi Yarn team,
I hope this finds you well. I've been using your code jquesnelle/yarn for testing the PG19 dataset. While reviewing the
eval.sh
script, I noticed some definitions related to the PG19 dataset, but the code for testing perplexity results seems somewhat unclear.Settings:
In eval.sh, I found the following definition for the PG19 dataset:
However, I did not find the actual code for testing perplexity results. Therefore, I attempted to use our own defined code for testing:
I observed that the results differ when the sliding window is set to 4096 and 256. In comparison to other PI and dy-ntk methods, the performance is unstable with a sliding window set to 256 and stable with a sliding window set to 4096.
Results:
--sliding-window 4096
:--sliding-window 256
:In contrast, other PI and dy-ntk methods maintain relatively stable performance when the sliding window is set to 256 and 4096:
I would appreciate your insights on this phenomenon. Is this behavior considered normal, or could there be potential configuration issues? If possible, could you provide more detailed information about the PG19 dataset testing script to help me better understand and adjust the testing configuration?
Thank you very much for your time and assistance. I look forward to your response.
Best regards,
Yiran
The text was updated successfully, but these errors were encountered: