Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: openai-style lookup decoding #12127

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

eeroel
Copy link

@eeroel eeroel commented Mar 1, 2025

This is a proposal for lookup decoding in the server example, similar to the "predicted outputs" feature in OpenAI API. The idea is that the user includes a string in the request that will be used for lookup decoding. This gives major speedups to code refactoring or text editing tasks where most of the output will be the same. See https://platform.openai.com/docs/guides/predicted-outputs#code-refactoring-example for examples.

I'm aware of the open PR #6828, and maybe that could be used here as well, but for now I implemented a custom lookup algorithm for this. I think the main difference is that in this OAI-style API, the user is in control of when to use lookup, so if they choose not to, there will be no performance impact in any direction.

I implemented the feature based on the existing speculative decoding code, and to keep it simple I made it so that lookup is not used if speculative decoding is enabled. It uses the n_max parameter from speculative, but for best results this should be much higher than when using a draft model.

The algorithm itself is probably not optimal, but gives good results anyway:

  1. Find the first token from the "prediction" that matches the last decoded one
  2. Uses the following tokens as the draft. Here, the draft window has an adaptive size, starting from 1 token and increasing every time we accept a full draft. When we accept only part or none of the draft, it is reset to 1.
  3. After checking the draft, remove the already-used tokens from the prediction.

For the adaptive window size I just set it to double the window size every time, starting from a window size of 16. I'm not sure what actually governs the performance here, but I get pretty nice speedups with these values. If there's some more clever algorithm already in the lookup example maybe that could be substituted here.

One limitation to OpenAI compatibility: rejected token count is not included in the response (not sure what this would mean exactly, but I figure accepted tokens is more important info anyway)

@eeroel eeroel requested a review from ngxson as a code owner March 1, 2025 09:25
@github-actions github-actions bot added examples python python script changes server labels Mar 1, 2025
@eeroel eeroel changed the title feat: openai-style lookup decoding for server Server: openai-style lookup decoding Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant