Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] prefix-cache-aware routing #59

Open
KuntaiDu opened this issue Feb 4, 2025 · 32 comments
Open

[RFC] prefix-cache-aware routing #59

KuntaiDu opened this issue Feb 4, 2025 · 32 comments

Comments

@KuntaiDu
Copy link
Collaborator

KuntaiDu commented Feb 4, 2025

We are planning to add prefix-cache-aware routing support, as mentioned in #26 . Here is an initial version of design. This design focuses on building the fundamental APIs for prefix-cache-aware routing, without requiring large API changes to vLLM.

Design choices and APIs:

  • Use string matching instead of token-ID-based matching. Token-ID-based matching are accurate, but it is difficult to get exactly the same sequence of token IDs because various issues (e.g. chat template) . So we use string matching. but it is slow (it takes several microseconds) so we don't want to run it for every request at the router side.
  • Build a string server to keep track of the prefix cache information inside each vLLM pod. Two APIs:
    • query(request: str, server_ids: List[Int], t: Timestamp): query which server_id inside the list of server_ids should we forward this request to.
    • notify(request: str, server_id: int, t: Timestamp): notify that the request is now executed by which server, so that the string server can update its internal status.

An initial implementation:

  • We will implement the string server using SQL database.
  • When notify(request: str, server_id: int, t: Timestamp), we will chop the request string into fix-size chunks c0, c1, ..., cn, for each chunk we create a content hash by content_hash(ci) = hash(c0 + c1 + ... + ci), store these chunks into the token server, and evict least-recently-used chunks in the corresponding server id to make sure the total number of chunks in the server id is smaller than a pre-defined constant C.
  • During query(request: str, server_ids: List[Int], t: Timestamp), we will query each server in the server list and see which one matches the maximum number of chunks.

Feel free to leave your feedback and comments!

@gaocegege
Copy link
Collaborator

Thanks for the RFC!

Token-ID-based matching are accurate, but it is difficult to get exactly the same sequence of token IDs because various issues (e.g. chat template)

Could you elaborate on how chat templates contribute to the differences in token IDs?

Build a string server to keep track of the prefix cache information inside each vLLM pod

I'm curious why this server operates at the pod level when the APIs it provides seem to function at the cluster level.

We will implement the string server using SQL.

What does SQL mean here, the SQL database like postgres?

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Feb 5, 2025

Could you elaborate on how chat templates contribute to the differences in token IDs?

Basically the chat template convert a json string

[
    {"role": "system", "content": "You are a friendly chatbot."},
    {"role": "user", "content": "How are you?"} 
] 

to the plain text input to the model

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a friendly chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

and chat template are typically different between models and can be different between different model versions.

But that said, the key reason to not do tokenization at the router side is because that tokenization itself is pretty slow (it takes several microseconds) so running it for every request creates huge overhead.

I'm curious why this server operates at the pod level when the APIs it provides seem to function at the cluster level.

The string server will be an independent pod in the cluster.

What does SQL mean here, the SQL database like postgres?

Exactly.

@AlexXi19
Copy link

AlexXi19 commented Feb 5, 2025

  1. On the pod affinity selection algorithm, why use cumulative hash chunking instead of a trie, though there are a few tradeoffs in compute / storage. On the same note, if you were to use fixed chunking, it might be more efficient to just use hash(c(i-1) + ci) instead of hash(c0 + c1 + ... + ci)?
  2. For this first version of the routing algorithm, what's preventing from sending similarly prefixed requests repeatedly to the same pod 😆. For this first version to be usable in production, I think you either have to make the affinity algorithm a weight boost on top of the random / RR load balancer (instead of a strict selection), or read the underlying metrics to reduce the routing weight / remove servers from selection.

@simon-mo
Copy link
Contributor

simon-mo commented Feb 5, 2025

One item should be minimal latency overhead, which can rule out a remote SQL database based solution

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Feb 6, 2025

  1. On the pod affinity selection algorithm, why use cumulative hash chunking instead of a trie, though there are a few tradeoffs in compute / storage. On the same note, if you were to use fixed chunking, it might be more efficient to just use hash(c(i-1) + ci) instead of hash(c0 + c1 + ... + ci)?
  2. For this first version of the routing algorithm, what's preventing from sending similarly prefixed requests repeatedly to the same pod 😆. For this first version to be usable in production, I think you either have to make the affinity algorithm a weight boost on top of the random / RR load balancer (instead of a strict selection), or read the underlying metrics to reduce the routing weight / remove servers from selection.
  1. Since current router is based on python, I am assuming that the pod affinity selection algorithm is also implemented in python. In python hash-based solution is likely faster than normal Trie. Also, vLLM internally uses hash-based method for prefix caching, so for vLLM we can get a better prefix cache simulation result via hash. As for the computation cost, we can use this hash function: hash(c0 + c1 + ... + ci) := hash( hash(c0 + ... + c(i-1) ), ci) to lower the computation cost.
  2. The router can prevent sending similarly prefixed requests repeatedly to the same pod by excluding that pod from the server_ids list when performing query.

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Feb 6, 2025

One item should be minimal latency overhead, which can rule out a remote SQL database based solution

Oh yeah ur right. In that case I should use some in-memory database (e.g. Redis).

@gaocegege
Copy link
Collaborator

Oh yeah ur right. In that case I should use some in-memory database (e.g. Redis).

Agree. We don't need a SQL database with transaction support. Also, what if we made the storage more abstract? That way, users could plug in their solutions if they wanted. Just a suggestion.

@gaocegege
Copy link
Collaborator

The string server will be an independent pod in the cluster.

If I understand correctly, would it be stateless to scale to multiple instances? Since we’re storing the states in Redis, we could have a single deployment with, say, 5 pods to help reduce the load.

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Feb 9, 2025

Oh yeah ur right. In that case I should use some in-memory database (e.g. Redis).

Agree. We don't need a SQL database with transaction support. Also, what if we made the storage more abstract? That way, users could plug in their solutions if they wanted. Just a suggestion.

Totally agree. The main purpose for this implementation is to build a series of interface, so that people can flexibly replace different components with their own code.

The string server will be an independent pod in the cluster.

If I understand correctly, would it be stateless to scale to multiple instances? Since we’re storing the states in Redis, we could have a single deployment with, say, 5 pods to help reduce the load.

Yes. But as @simon-mo suggested, the communication latency might be a concern so I need to benchmark and see if it is OK to put the string server in a separate pod or we should co-locate it with the router.

@gaocegege
Copy link
Collaborator

Could we combine the router and string server into a single unit, given that we maintain the cache in the KV store like Redis? The relationship between the router, string server, and Redis (M:N:K) seems complex, and I'm not seeing the advantages of this setup.

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Feb 9, 2025

Could we combine the router and string server into a single unit, given that we maintain the cache in the KV store like Redis? The relationship between the router, string server, and Redis (M:N:K) seems complex, and I'm not seeing the advantages of this setup.

Redis is like a storage backend of string server and they will co-locate in the same pod, so the setup is more like (M:N). I am mainly worrying about the router scalability if we put the router and the string server (with the Redis backend) into a single pod: we can scale the router by simply having multiple router replicas, but if we have multiple string server replicas we need to align their storage backend (which is tricky).

@wizenheimer
Copy link

I'd like to propose an alternative approach based on consistent hashing with bounded load (CHWBL) that could provide efficient prefix-aware routing. The core idea is to augment the consistent hash ring with cache-awareness while preserving O(1) routing decisions and bounded load guarantees.

While string matching provides precise cache hit detection, the hash ring approach offers better load distribution guarantees, much lower routing latency and doesn't need a separate routing infra. The tradeoff however, is slightly less precise cache affinity, but this imo are outweighed by the gains previously mentioned.

Image

Data Structures

  • Hash ring maintains virtual nodes for each server (typically 100-200 replicas)
  • Each server tracks: in-flight requests, load thresholds
  • Ring state: hash → server mapping + sorted hash array for efficient lookup

Request Flow

Hash the request prefix to get ring position

  • Binary search in sorted array finds next server
    For chosen server, verify:
  • Current load < (avg_load × load_factor)
  • If overloaded, walk ring clockwise

Track request and provide cleanup callback
The key insight is that similar prefixes naturally map to nearby ring positions, providing cache locality, while the bounded load ensures no server gets overwhelmed.
This gives us:

  • O(1) routing with no network calls
  • Automatic load balancing
  • Natural failure handling (only affected requests move)
  • No central coordination needed

@liu-cong
Copy link

I like the idea of consistent hashing with load awareness. One of its benefits is that it doesn't require a potentially large memory space to store the prefixes.

The key insight is that similar prefixes naturally map to nearby ring positions

What hashing algorithm do you propose to get this behavior? @wizenheimer

@wizenheimer
Copy link

What hashing algorithm do you propose to get this behavior? @wizenheimer

I was considering picking a hash algo that has low avalanche effect and at the same time has good enough distribution to avoid hotspots.

We could pick simhash if we want a hash that is more locality aware or could go with fixed-windowed xxhash to trim down avalanche effects of hash.

Curious, are there any other algos that could be a much better fit?

@gaocegege
Copy link
Collaborator

gaocegege commented Feb 14, 2025

Thanks for the proposal, I like the key insights here.

sorted hash array for efficient lookup

Could you please provide more details about the hash array? Specifically, what it will store, a sorted array of servers arranged by their load? Or the array of virtual nodes sorted by hash values.

I think this design makes the router scalable as well. Where do you plan to store the hash ring? In a centralized component like etcd or Consul, or by using a P2P gossip-based protocol (e.g. https://github.com/hashicorp/memberlist to synchronize among all routers?

@gaocegege
Copy link
Collaborator

could go with fixed-windowed xxhash to trim down avalanche effects of hash.

LSH is designed to support prefix-based matching by nature. Could you explain how xxHash would handle such a scenario, for example, with inputs like abcd and abedefg?

@wizenheimer
Copy link

LSH is designed to support prefix-based matching by nature.

Indeed, LSH family of hashes (like simhash shared earlier) would be the ideal fit for this.

Could you explain how xxHash would handle such a scenario, for example, with inputs like abcd and abedefg?

// Problem with regular xxHash:
input1 := "abcd"
input2 := "abcdefg"
// Even though they share prefix "abcd", their hashes are completely different
hash1 := xxhash.Sum64([]byte(input1))  // something like 14872408901234
hash2 := xxhash.Sum64([]byte(input2))  // totally different: 98123074123123

So we go with a fixed window, dampening the avalanches from subsequent bits.

// Break input into fixed windows
"abcdefg" -> ["abcd", "efg"]  // windowSize = 4

// Process each window:
FirstWindow := "abcd"
- hash1 := xxhash.Sum64([]byte(FirstWindow))
- Give more weight to first window:
- weightedHash1 := hash1 << 32  // shift left by 32 bits

SecondWindow := "efg"
- hash2 := xxhash.Sum64([]byte(SecondWindow))
- Reduce weight of later windows:
- weightedHash2 := hash2 >> 8   // shift right based on window position

// Combine hashes:
finalHash := weightedHash1 ^ weightedHash2

So xxhash based approach might only make sense when we need optimality in terms of both prefix matching AND uniform distribution. But I agree with the take, if prefix matching is our primary goal, using an LSH family algorithm directly would be more appropriate than trying to modify xxHash to do something it wasn't designed for :D

@wizenheimer
Copy link

Where do you plan to store the hash ring? In a centralized component like etcd or Consul, or by using a P2P gossip-based protocol (e.g. https://github.com/hashicorp/memberlist to synchronize among all routers?

Not super sure if we need distributed consensus for routing decisions. I believe each router can independently make good decisions based on its local view of the pods, given the sate is ephemeral - just for intermittent routing decisions. If router restarts, it rebuilds state from pod list.

Again, my take might be less informed, but I guess in-memory should be alright.

@gaocegege
Copy link
Collaborator

gaocegege commented Feb 14, 2025

So xxhash based approach might only make sense when we need optimality in terms of both prefix matching AND uniform distribution. But I agree with the take, if prefix matching is our primary goal, using an LSH family algorithm directly would be more appropriate than trying to modify xxHash to do something it wasn't designed for 🗡

Makes sense. I think xxhash may be helpful in our session router instead of prefix cache aware router. https://github.com/vllm-project/production-stack/blob/main/src/vllm_router/routing_logic.py#L85

Right now, the session router is built on a hash ring, but I haven't looked into which hash algorithm it's using.

@gaocegege
Copy link
Collaborator

Not super sure if we need distributed consensus for routing decisions. I believe each router can independently make good decisions based on its local view of the pods, given the sate is ephemeral - just for intermittent routing decisions. If router restarts, it rebuilds state from pod list.

Yes, it makes sense. We can rely on K8s control plane for that. It should work.

@wizenheimer
Copy link

Perfect if we are in agreement, I would be happy to take a shot at implementing it next.

Curious, what the test scenarios should look like? Was planning to have these. Are there anymore we'd like to have?

  1. Testing consistent endpoint selection for same/similar prefixed keys
  2. Test bounded load behavior

@gaocegege
Copy link
Collaborator

Let's wait for @KuntaiDu to share his thoughts. Also, cc @ApostaC.

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Feb 14, 2025

@wizenheimer since a wide range of chatting applications have long system prompt as the prefix of all requests, when user's chatting history is relatively short, I guess all requests will have a very similar "position" in the hash space and this design will overload the servers that are close to that "position" in the hashing space, and leave the server that are far away to that "position" under-utilized. Is my understanding correct?

@KuntaiDu
Copy link
Collaborator Author

But I definitely see that @wizenheimer 's design works really well in other types of workloads where the prefix of requests are really diverse (e.g. long-document QA)

@gaocegege
Copy link
Collaborator

Have a quick simulation for the chat use case https://gist.github.com/gaocegege/11cb5a0acf370ea8ca72a05eb69da0f8

Use a relatively long prefix of about 1900 characters, along with some varying suffixes. This was simulated across 4 servers using SimHash from this repository: https://github.com/1e0ng/simhash.

Here are the simulation results (... represents the shared prefix). It appears that it doesn't perform well in this use case. Not sure if there are some variants from LSH to support this long prefix case.

Request: ...<|User|>Hello! How are you?<|end▁of▁sentence|>
Server: Server2
---
Request: ...<|User|>Hello! How are you? What's new?<|end▁of▁sentence|>
Server: Server2
---
Request: ...<|User|>Can you help me?<|end▁of▁sentence|>
Server: Server2
---
Request: ...<|User|>Can you help me with Python programming?<|end▁of▁sentence|>
Server: Server2
---
Request: ...<|User|>How do I implement a binary search tree?<|end▁of▁sentence|>
Server: Server2
---
Request: ...<|User|>How do I implement a binary search tree? Can you help?<|end▁of▁sentence|>
Server: Server2
---
Request: ...<|User|>What's the time complexity?<|end▁of▁sentence|>
Server: Server2
---
Request: ...<|User|>What's the time complexity of these algorithms?<|end▁of▁sentence|>
Server: Server3
---
Request: ...<|User|>Can you compare bubble sort and insertion sort?<|end▁of▁sentence|>
Server: Server3
---
Request: ...<|User|>Can you compare bubble sort and insertion sort? Which is better?<|end▁of▁sentence|>
Server: Server3
---
Request: ...<|User|>Hello! How are you?<|end▁of▁sentence|><|Assistant|>I'm doing well, thank you! How can I assist you today?<|end▁of▁sentence|>
Server: Server3
---
Request: ...<|User|>Hello! How are you?<|end▁of▁sentence|><|Assistant|>I'm doing well, thank you! How can I assist you today?<|end▁of▁sentence|><|User|>Can you help me?<|end▁of▁sentence|>
Server: Server1
---
Server Load Distribution:
Server1: 1 requests
Server2: 7 requests
Server3: 4 requests
Server4: 0 requests

@wizenheimer
Copy link

Thanks for the note @KuntaiDu @gaocegege, appreciate it!

when user's chatting history is relatively short, I guess all requests will have a very similar "position" in the hash space and this design will overload the servers that are close to that "position" in the hashing space, and leave the server that are far away to that "position" under-utilized. Is my understanding correct?

Absolutely, if we solely use simhash based routing this is definitely possible - where initial prefix would dominate the simhash's fingerprint and leading to hotspots in hash space.
To counter this we have a secondary bounded load criteria, which prevents request build up and makes sure requests are sufficiently distributed. With this we'd be able to have "cluster with overflow" pattern.

SimHash(prefix) -> Initial Anchor Point
Bounded Load   -> Neighborhood Spread

Intended behavior

  1. Similar prefixes want to cluster (SimHash)
  2. But can't all go to the same server (Bounded Load)
  3. Result: They spread across a "neighborhood" of nearby servers
  4. This maintains APC benefits while preventing overload

Here are the simulation results (... represents the shared prefix). It appears that it doesn't perform well in this use case. Not sure if there are some variants from LSH to support this long prefix case.

I might need to follow up on LSH bit, but if we want to exploit APC, it would mean that these clustered requests are more efficient than if they were evenly distributed.
Please let me know if my understanding is correct, but the "clustering" that seems problematic might actually help us since processing the shared prefix only needs to be done once per cluster.

@wizenheimer
Copy link

With this we'd be able to have "cluster with overflow" pattern

Created a Colab Notebook with a sample implementation, closely resembling our LLD.

https://colab.research.google.com/drive/1IfWccwyJQySWzIADv5HS9U8YB8JGjpom?usp=sharing

A standout feature of prefix aware routing is that breaking every traditional load balancing rule leads to better performance.

  1. We intentionally cluster similar requests using LSH. Unlike traditionally where load distribution is paramount here KV cache reuse is the objective.
  2. We create controlled hot spots based on prefix similarity (aka cluster with overflow) pattern.
  3. Routing is heavily influenced by previous similar requests. Unlike traditionally where each request gets routed independently.

Clustering Phase:

# First try the LSH-selected node
target_url = self.hash_ring.get_node(str(hash_value))

# If not overloaded, stick with clustering
if url_to_rif[target_url] <= rif_threshold:
    return target_url

Overflow Phase

# If primary node is overloaded:
for _, url in self.hash_ring.iterate_nodes(str(hash_value), distinct=True):
    if current_rif <= rif_threshold:
        return url  # Found a nearby non-overloaded node

When a server gets too busy (RIF > threshold)

  1. Start walking the hash ring from that position
  2. Find next server that isn't overloaded
  3. Because we're walking the ring from the hash position:

Overflow maintains prefix locality

  1. Nearby servers = Similar hash values
  2. Similar hash values = Similar prefixes

Visualized it here: https://claude.site/artifacts/3cfbbf91-d555-4c7e-bb6f-89a23d78bddc

cc: @gaocegege @KuntaiDu

@gaocegege
Copy link
Collaborator

gaocegege commented Feb 15, 2025

@wizenheimer Could you also share some sampling results of the routing? I’d like to understand if we can ensure that requests with the same prefix are routed to the same server as consistently as possible. In my simulation, I noticed that some requests don’t adhere to this behavior:

Request: ...<|User|>Hello! How are you?<|end▁of▁sentence|><|Assistant|>I'm doing well, thank you! How can I assist you today?<|end▁of▁sentence|>
Server: Server3
---
Request: ...<|User|>Hello! How are you?<|end▁of▁sentence|><|Assistant|>I'm doing well, thank you! How can I assist you today?<|end▁of▁sentence|><|User|>Can you help me?<|end▁of▁sentence|>
Server: Server1
---

Perhaps we could calculate the ratio:
(number of same-prefix requests routed to different servers) / (total number of same-prefix requests)

@wizenheimer
Copy link

Perhaps we could calculate the ratio:
(number of same-prefix requests routed to different servers) / (total number of same-prefix requests)

This seems like a great metric to have, would help us quantify collocation.

Created a notebook using the chat data from here, focussed on this statistic:

https://colab.research.google.com/drive/1zck6yIq-ZkUmazyKIXIucVQiMFG39029

I'm observing mixed results here, in some cases there's really strong collocation while in others I see a spread. Truncation length is another variable in the mix. Not sure if this stems from simhash's feature explosion.

Thanks for the nudge @gaocegege. Appreciate it!

Is there any something we could improvise over? other approaches we could try and benchmark against?

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Feb 15, 2025

(number of same-prefix requests routed to different servers) / (total number of same-prefix requests)

Another potential metric is to calculate the average time-to-first-token (TTFT), which can be roughly estimated by ttft(request) = c * len(the suffix of the request that does not hit the prefix cache), where c is a constant that does not vary between requests, and then average the hit(request) across request.

(Background: TTFT is roughly proportional to the string length of the part of incoming request that does not hit prefix cache, see the measurement here.)

Image

I'll also draft my design based on @gaocegege 's repo so that we can compare. But my guess is that both solutions (hash-based solution and string-match-based solution) will have some pros and cons, and we can consider having both (and adding some helm chart values so that we can toggle between them).

@wizenheimer
Copy link

Perfect, sounds good @KuntaiDu. Seems much more involved, but would align closer to the end objective i.e. reducing TTFT

Quick query, we might need parse the incoming request and extract the user and system prompt. Should we introduce a dependency on vLLM or copy over the protocol.py file.

  1. vLLM protocol.py - https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/protocol.py#L1238
  2. Prefix Aware Routing - https://gist.github.com/wizenheimer/5dd676f355fdd138ef40fcd2e2ae4811

@KuntaiDu
Copy link
Collaborator Author

Perfect, sounds good @KuntaiDu. Seems much more involved, but would align closer to the end objective i.e. reducing TTFT

Quick query, we might need parse the incoming request and extract the user and system prompt. Should we introduce a dependency on vLLM or copy over the protocol.py file.

  1. vLLM protocol.py - https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/protocol.py#L1238
  2. Prefix Aware Routing - https://gist.github.com/wizenheimer/5dd676f355fdd138ef40fcd2e2ae4811

Maybe we can directly dump the messages attribute to json string, and do json-string-based match, which does not require extra libraries (assuming that the HTTP request complies with OpenAI API format).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants