Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does INITIAL_PROMPT has a meaning, or just a placeholder for prompt re-use? #75

Open
OmriKaduri opened this issue Oct 16, 2024 · 3 comments

Comments

@OmriKaduri
Copy link

Thanks for your great work!

I looked into the prompt_reuse script.

It basically first feed the "INITIAL_PROMPT" through the model:

inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
with torch.no_grad():
    prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values

Then, to use this cache with another prompt suffix, they use:

prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) 

However, I am trying to understand if the INITIAL_PROMPT tokens have a meaning, other than being "position_ids" placeholder. Like, let's say we would change INITIAL_PROMPT to be some random prompt with same token length. Would we expect other results? I assume that since these tokens KV are taken from the cache, they're only placeholders.

Thanks!

@sannat17
Copy link

sannat17 commented Oct 22, 2024

It seems like the purpose of INITIAL_PROMPT tokens in subsequent calls is to have the correct output text for these tokens from model.generate when you decode it (not really useful if you are only interested in the newly generated tokens and never need to decode the tokens of INITIAL_PROMPT again).

You can verify that for yourself by creating an example scenario where you fill new_inputs' first few tokens with random input_ids, set do_sample=False (to make the model's generation deterministic), and compare the outputs of the regular input with the randomized input. Doing so yielded matching newly generated tokens so it verifies the intuition that they are only placeholders.

@OmriKaduri
Copy link
Author

Thanks! It does seem to be the case. If so I think that current API is a bit confusing. Should be changed to feeding only new input + cache (past_key_values) IMHO

@sannat17
Copy link

sannat17 commented Oct 22, 2024

The current API provided by the transformers library is catered towards KV caching when generating multiple tokens from a prompt, and since the model itself isn't stateful you would have to retain all the generated inputs between subsequent calls (so that the final output has the correct decoded text you need).

There have been discussions in the past for the library to add prompt caching as an inbuilt feature with a cleaner API (which led to a simpler interface to access and reuse the KV cache than what we had to do before), but till they add that we are stuck with the current approach.

Also, the current prompt_reuse logic has a small, edge-case related, bug. Check issue #78 that I raised recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants