-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent prompt_eval_count
for Large Prompts in Ollama Python Library
#271
Comments
This sounds like you've exceeded the context buffer and the value is the number of tokens that were processed in the last slot window. Try adding |
thanks for pointing this out. |
Also, by default |
Context buffer is expensive in VRAM cost which grows quadratically on length. I mentioned pushing layers off to CPU above, if that happens inference speed drops dramatically, so the default value is meant to preserve performance. If the user wants a larger context, it can be extended with Flash attention can reduce the VRAM cost, but it doesn't work for all models. |
What is the issue?
Inconsistent
prompt_eval_count
for Large Prompts in Ollama Python LibraryFor larger prompts, when using the Ollama Python library with the
llama3.1:8b-instruct-fp16
model, theprompt_eval_count
remains constant at fixed value (1026) tokens, even when the input prompt size varies significantly. This behavior is observed when using theollama.chat()
method.Sample output:
Tokens: (1026, 15, 1041)
Total_prompt_length: 57788
Tokens: (1026, 20, 1046)
Total_prompt_length: 57172
Tokens: (1026, 18, 1044)
Total_prompt_length: 57744
Current Behavior
prompt_eval_count
consistently returns same value (1026), regardless of the actual prompt length.eval_count
(output tokens) varies as expected. (this might also give fixed value once larger text is generated )Expected Behavior
prompt_eval_count
should accurately reflect the number of tokens in the input prompt.OS
macOS
GPU
Apple
CPU
Apple
Ollama version
0.3.9
The text was updated successfully, but these errors were encountered: