You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In performance_optimization/prompt_reuse.py, the current method of storing the cached prompt does not correctly discard the KV cache for the last token (and instead follows the same caching recipe as required for model.generate).
For context, look at these comments and discussions:
After running some preliminary tests, the current prompt_reuse.py recipe consistently generates different outputs than the non-cached generation, while using the method from the linked github issue produces consistent generations.
The text was updated successfully, but these errors were encountered:
The problem does not present itself if your INITIAL_PROMPT ends in special tokens (for eg. by generation from a chat_template where the last token may be a role based token).
This is only a problem when tokenizer behaviour changes for the last token of INITIAL_PROMPT when a suffix is added, where the first few characters of the suffix could end up becoming a part of the last token of INITIAL_PROMPT, which is not that uncommon.
In
performance_optimization/prompt_reuse.py
, the current method of storing the cached prompt does not correctly discard the KV cache for the last token (and instead follows the same caching recipe as required for model.generate).For context, look at these comments and discussions:
generate()
transformers#24841 (comment)After running some preliminary tests, the current
prompt_reuse.py
recipe consistently generates different outputs than the non-cached generation, while using the method from the linked github issue produces consistent generations.The text was updated successfully, but these errors were encountered: