Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Models's output from recipe with prompt reuse optimization does not match the non-cached generation #78

Open
sannat17 opened this issue Oct 21, 2024 · 1 comment · May be fixed by #79

Comments

@sannat17
Copy link

sannat17 commented Oct 21, 2024

In performance_optimization/prompt_reuse.py, the current method of storing the cached prompt does not correctly discard the KV cache for the last token (and instead follows the same caching recipe as required for model.generate).

For context, look at these comments and discussions:

After running some preliminary tests, the current prompt_reuse.py recipe consistently generates different outputs than the non-cached generation, while using the method from the linked github issue produces consistent generations.

@sannat17
Copy link
Author

Note:

  • The problem does not present itself if your INITIAL_PROMPT ends in special tokens (for eg. by generation from a chat_template where the last token may be a role based token).
  • This is only a problem when tokenizer behaviour changes for the last token of INITIAL_PROMPT when a suffix is added, where the first few characters of the suffix could end up becoming a part of the last token of INITIAL_PROMPT, which is not that uncommon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant