[router] Use string instead of token ids #673

gaocegege · 2025-02-14T04:36:32Z

Currently, we use token IDs to support prefix cache-aware routing, which requires encoding first. This introduces several microseconds of latency to the requests, and the benefits aren't substantial.

In Q&A scenarios, the input string could be equivalent to token IDs. For RAG scenarios, we might benefit from something like CacheBlend for better performance rather than solely relying on token IDs.

Therefore, I propose using strings in the router.

Jeffwan · 2025-02-14T18:46:55Z

Introducing tokenization brings some complexity on tokenizer managed (if we want every model uses their own tokenizer) as well. We need to consider the benefits and at least make this part configurable now.

gangmuk · 2025-02-20T21:13:59Z

Cross posting my comments here for future consideration

We can make it pluggable. It should be system-wide variable which shouldn't be changed during the runtime. Otherwise, it will mess up all the cache.

I think TokenizeInputText shouldn't be in each routing algorithm implementation. Currently, it is done in each Route function. It can be decoupled and done in common execution path (somewhere in gateway.go) before the Route.

Tokenization itself has two minor issues

overhead (not sure before testing)
debugging with the raw text is easier than looking at token ids. (so I used Detokenization on my side when I debugged my routing implementation.
I am not sure these are critical enough to support different ways of input embedding (raw string, tokenization method 1, tokenization method 2, etc)

gangmuk · 2025-02-28T01:38:15Z

My previous proposal to move tokenizer to the gateway was wrong. I missed the fact that the tokenization is only being used in prefix aware routing. I measured the tokenizer overhead and it is non-negligible. It is actually much higher than microsecond latency added. It is 50ms - 100ms (the overhead will depend on the tokenizer library but still it will be there). I don't see there is a fundamental reason to use tokenizer and it makes sense to avoid tokenizer processing overhead by not tokenizing it.

gaocegege · 2025-02-28T02:41:23Z

To be honest, I’m not entirely clear on the benefits of the token ID-based approach. Could someone shed some light on its advantages or explain why it’s used? Personally, I think we could consider eliminating it and transitioning to a string-based approach instead (No need to keep two implementations)

Jeffwan · 2025-02-28T05:18:35Z

@gaocegege Originally, I think the token based solution could be more aligned with the "page" tokens in vLLM and chunk by chunk alignment would be tidy comparing to two different approaches(tokens vs character/string). We use similar token based approach inside the engine for prefix cache but things become different on the gateway side, the overhead is too large. in this case, we should fully get ride of it.

gaocegege · 2025-02-28T07:04:53Z

Make sense. Thanks for the explanation.

Jeffwan assigned varungup90 Feb 14, 2025

Jeffwan added kind/enhancement New feature or request area/gateway priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 14, 2025

Jeffwan mentioned this issue Feb 20, 2025

Initial implementation of radix tree-based cache #678

Merged

Jeffwan mentioned this issue Feb 26, 2025

v0.3.0 roadmap #698

Open

41 tasks

varungup90 linked a pull request Feb 28, 2025 that will close this issue

Use string based tokenizer in prefix cache #774

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[router] Use string instead of token ids #673

[router] Use string instead of token ids #673

gaocegege commented Feb 14, 2025 •

edited

Loading

Jeffwan commented Feb 14, 2025

gangmuk commented Feb 20, 2025

gangmuk commented Feb 28, 2025 •

edited

Loading

gaocegege commented Feb 28, 2025 •

edited

Loading

Jeffwan commented Feb 28, 2025

gaocegege commented Feb 28, 2025

[router] Use string instead of token ids #673

[router] Use string instead of token ids #673

Comments

gaocegege commented Feb 14, 2025 • edited Loading

Jeffwan commented Feb 14, 2025

gangmuk commented Feb 20, 2025

gangmuk commented Feb 28, 2025 • edited Loading

gaocegege commented Feb 28, 2025 • edited Loading

Jeffwan commented Feb 28, 2025

gaocegege commented Feb 28, 2025

gaocegege commented Feb 14, 2025 •

edited

Loading

gangmuk commented Feb 28, 2025 •

edited

Loading

gaocegege commented Feb 28, 2025 •

edited

Loading