Server: openai-style lookup decoding #12127

eeroel · 2025-03-01T09:25:46Z

This is a proposal for lookup decoding in the server example, similar to the "predicted outputs" feature in OpenAI API. The idea is that the user includes a string in the request that will be used for lookup decoding. This gives major speedups to code refactoring or text editing tasks where most of the output will be the same. See https://platform.openai.com/docs/guides/predicted-outputs#code-refactoring-example for examples.

I'm aware of the open PR #6828, and maybe that could be used here as well, but for now I implemented a custom lookup algorithm for this. I think the main difference is that in this OAI-style API, the user is in control of when to use lookup, so if they choose not to, there will be no performance impact in any direction.

I implemented the feature based on the existing speculative decoding code, and to keep it simple I made it so that lookup is not used if speculative decoding is enabled. It uses the n_max parameter from speculative, but for best results this should be much higher than when using a draft model.

The algorithm itself is probably not optimal, but gives good results anyway:

Find the first token from the "prediction" that matches the last decoded one
Uses the following tokens as the draft. Here, the draft window has an adaptive size, starting from 1 token and increasing every time we accept a full draft. When we accept only part or none of the draft, it is reset to 1.
After checking the draft, remove the already-used tokens from the prediction.

For the adaptive window size I just set it to double the window size every time, starting from a window size of 16. I'm not sure what actually governs the performance here, but I get pretty nice speedups with these values. If there's some more clever algorithm already in the lookup example maybe that could be substituted here.

One limitation to OpenAI compatibility: rejected token count is not included in the response (not sure what this would mean exactly, but I figure accepted tokens is more important info anyway)

feat: openai-style lookup decoding for server

dd3e54f

eeroel requested a review from ngxson as a code owner March 1, 2025 09:25

github-actions bot added examples python python script changes server labels Mar 1, 2025

eeroel changed the title ~~feat: openai-style lookup decoding for server~~ Server: openai-style lookup decoding Mar 1, 2025

eeroel mentioned this pull request Mar 1, 2025

Server: enable lookup decoding #6828

Open

ngxson requested review from ggerganov and JohannesGaessler March 2, 2025 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: openai-style lookup decoding #12127

Server: openai-style lookup decoding #12127

eeroel commented Mar 1, 2025 •

edited

Loading

Server: openai-style lookup decoding #12127

Are you sure you want to change the base?

Server: openai-style lookup decoding #12127

Conversation

eeroel commented Mar 1, 2025 • edited Loading

eeroel commented Mar 1, 2025 •

edited

Loading