-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] prefix-cache-aware routing #59
Comments
Thanks for the RFC!
Could you elaborate on how chat templates contribute to the differences in token IDs?
I'm curious why this server operates at the pod level when the APIs it provides seem to function at the cluster level.
What does SQL mean here, the SQL database like postgres? |
Basically the chat template convert a json string
to the plain text input to the model
and chat template are typically different between models and can be different between different model versions. But that said, the key reason to not do tokenization at the router side is because that tokenization itself is pretty slow (it takes several microseconds) so running it for every request creates huge overhead.
The string server will be an independent pod in the cluster.
Exactly. |
|
One item should be minimal latency overhead, which can rule out a remote SQL database based solution |
|
Oh yeah ur right. In that case I should use some in-memory database (e.g. Redis). |
Agree. We don't need a SQL database with transaction support. Also, what if we made the storage more abstract? That way, users could plug in their solutions if they wanted. Just a suggestion. |
If I understand correctly, would it be stateless to scale to multiple instances? Since we’re storing the states in Redis, we could have a single deployment with, say, 5 pods to help reduce the load. |
Totally agree. The main purpose for this implementation is to build a series of interface, so that people can flexibly replace different components with their own code.
Yes. But as @simon-mo suggested, the communication latency might be a concern so I need to benchmark and see if it is OK to put the string server in a separate pod or we should co-locate it with the router. |
Could we combine the router and string server into a single unit, given that we maintain the cache in the KV store like Redis? The relationship between the router, string server, and Redis (M:N:K) seems complex, and I'm not seeing the advantages of this setup. |
Redis is like a storage backend of string server and they will co-locate in the same pod, so the setup is more like (M:N). I am mainly worrying about the router scalability if we put the router and the string server (with the Redis backend) into a single pod: we can scale the router by simply having multiple router replicas, but if we have multiple string server replicas we need to align their storage backend (which is tricky). |
I'd like to propose an alternative approach based on consistent hashing with bounded load (CHWBL) that could provide efficient prefix-aware routing. The core idea is to augment the consistent hash ring with cache-awareness while preserving O(1) routing decisions and bounded load guarantees. While string matching provides precise cache hit detection, the hash ring approach offers better load distribution guarantees, much lower routing latency and doesn't need a separate routing infra. The tradeoff however, is slightly less precise cache affinity, but this imo are outweighed by the gains previously mentioned. Data Structures
Request Flow Hash the request prefix to get ring position
Track request and provide cleanup callback
|
I like the idea of consistent hashing with load awareness. One of its benefits is that it doesn't require a potentially large memory space to store the prefixes.
What hashing algorithm do you propose to get this behavior? @wizenheimer |
I was considering picking a hash algo that has low avalanche effect and at the same time has good enough distribution to avoid hotspots. We could pick Curious, are there any other algos that could be a much better fit? |
Thanks for the proposal, I like the key insights here.
Could you please provide more details about the hash array? Specifically, what it will store, a sorted array of servers arranged by their load? Or the array of virtual nodes sorted by hash values. I think this design makes the router scalable as well. Where do you plan to store the hash ring? In a centralized component like etcd or Consul, or by using a P2P gossip-based protocol (e.g. https://github.com/hashicorp/memberlist to synchronize among all routers? |
LSH is designed to support prefix-based matching by nature. Could you explain how xxHash would handle such a scenario, for example, with inputs like |
Indeed, LSH family of hashes (like
// Problem with regular xxHash:
input1 := "abcd"
input2 := "abcdefg"
// Even though they share prefix "abcd", their hashes are completely different
hash1 := xxhash.Sum64([]byte(input1)) // something like 14872408901234
hash2 := xxhash.Sum64([]byte(input2)) // totally different: 98123074123123 So we go with a fixed window, dampening the avalanches from subsequent bits. // Break input into fixed windows
"abcdefg" -> ["abcd", "efg"] // windowSize = 4
// Process each window:
FirstWindow := "abcd"
- hash1 := xxhash.Sum64([]byte(FirstWindow))
- Give more weight to first window:
- weightedHash1 := hash1 << 32 // shift left by 32 bits
SecondWindow := "efg"
- hash2 := xxhash.Sum64([]byte(SecondWindow))
- Reduce weight of later windows:
- weightedHash2 := hash2 >> 8 // shift right based on window position
// Combine hashes:
finalHash := weightedHash1 ^ weightedHash2 So |
Not super sure if we need distributed consensus for routing decisions. I believe each router can independently make good decisions based on its local view of the pods, given the sate is ephemeral - just for intermittent routing decisions. If router restarts, it rebuilds state from pod list. Again, my take might be less informed, but I guess in-memory should be alright. |
Makes sense. I think xxhash may be helpful in our session router instead of prefix cache aware router. https://github.com/vllm-project/production-stack/blob/main/src/vllm_router/routing_logic.py#L85 Right now, the session router is built on a hash ring, but I haven't looked into which hash algorithm it's using. |
Yes, it makes sense. We can rely on K8s control plane for that. It should work. |
Perfect if we are in agreement, I would be happy to take a shot at implementing it next. Curious, what the test scenarios should look like? Was planning to have these. Are there anymore we'd like to have?
|
@wizenheimer since a wide range of chatting applications have long system prompt as the prefix of all requests, when user's chatting history is relatively short, I guess all requests will have a very similar "position" in the hash space and this design will overload the servers that are close to that "position" in the hashing space, and leave the server that are far away to that "position" under-utilized. Is my understanding correct? |
But I definitely see that @wizenheimer 's design works really well in other types of workloads where the prefix of requests are really diverse (e.g. long-document QA) |
Have a quick simulation for the chat use case https://gist.github.com/gaocegege/11cb5a0acf370ea8ca72a05eb69da0f8 Use a relatively long prefix of about 1900 characters, along with some varying suffixes. This was simulated across 4 servers using SimHash from this repository: https://github.com/1e0ng/simhash. Here are the simulation results (... represents the shared prefix). It appears that it doesn't perform well in this use case. Not sure if there are some variants from LSH to support this long prefix case.
|
Thanks for the note @KuntaiDu @gaocegege, appreciate it!
Absolutely, if we solely use simhash based routing this is definitely possible - where initial prefix would dominate the simhash's fingerprint and leading to hotspots in hash space. SimHash(prefix) -> Initial Anchor Point
Bounded Load -> Neighborhood Spread Intended behavior
I might need to follow up on LSH bit, but if we want to exploit APC, it would mean that these clustered requests are more efficient than if they were evenly distributed. |
Created a Colab Notebook with a sample implementation, closely resembling our LLD. https://colab.research.google.com/drive/1IfWccwyJQySWzIADv5HS9U8YB8JGjpom?usp=sharing A standout feature of prefix aware routing is that breaking every traditional load balancing rule leads to better performance.
Clustering Phase: # First try the LSH-selected node
target_url = self.hash_ring.get_node(str(hash_value))
# If not overloaded, stick with clustering
if url_to_rif[target_url] <= rif_threshold:
return target_url Overflow Phase # If primary node is overloaded:
for _, url in self.hash_ring.iterate_nodes(str(hash_value), distinct=True):
if current_rif <= rif_threshold:
return url # Found a nearby non-overloaded node When a server gets too busy (RIF > threshold)
Overflow maintains prefix locality
Visualized it here: https://claude.site/artifacts/3cfbbf91-d555-4c7e-bb6f-89a23d78bddc cc: @gaocegege @KuntaiDu |
@wizenheimer Could you also share some sampling results of the routing? I’d like to understand if we can ensure that requests with the same prefix are routed to the same server as consistently as possible. In my simulation, I noticed that some requests don’t adhere to this behavior:
Perhaps we could calculate the ratio: |
This seems like a great metric to have, would help us quantify collocation. Created a notebook using the chat data from here, focussed on this statistic: https://colab.research.google.com/drive/1zck6yIq-ZkUmazyKIXIucVQiMFG39029 I'm observing mixed results here, in some cases there's really strong collocation while in others I see a spread. Truncation length is another variable in the mix. Not sure if this stems from simhash's feature explosion. Thanks for the nudge @gaocegege. Appreciate it! Is there any something we could improvise over? other approaches we could try and benchmark against? |
Another potential metric is to calculate the average time-to-first-token (TTFT), which can be roughly estimated by (Background: TTFT is roughly proportional to the string length of the part of incoming request that does not hit prefix cache, see the measurement here.) I'll also draft my design based on @gaocegege 's repo so that we can compare. But my guess is that both solutions (hash-based solution and string-match-based solution) will have some pros and cons, and we can consider having both (and adding some helm chart values so that we can toggle between them). |
Perfect, sounds good @KuntaiDu. Seems much more involved, but would align closer to the end objective i.e. reducing TTFT Quick query, we might need parse the incoming request and extract the user and system prompt. Should we introduce a dependency on vLLM or copy over the protocol.py file.
|
Maybe we can directly dump the |
We are planning to add prefix-cache-aware routing support, as mentioned in #26 . Here is an initial version of design. This design focuses on building the fundamental APIs for prefix-cache-aware routing, without requiring large API changes to vLLM.
Design choices and APIs:
but it is difficult to get exactly the same sequence of token IDs because various issues (e.g. chat template) . So we use string matching.but it is slow (it takes several microseconds) so we don't want to run it for every request at the router side.query(request: str, server_ids: List[Int], t: Timestamp)
: query whichserver_id
inside the list ofserver_ids
should we forward thisrequest
to.notify(request: str, server_id: int, t: Timestamp)
: notify that therequest
is now executed by which server, so that the string server can update its internal status.An initial implementation:
notify(request: str, server_id: int, t: Timestamp)
, we will chop the request string into fix-size chunksc0, c1, ..., cn
, for each chunk we create a content hash bycontent_hash(ci) = hash(c0 + c1 + ... + ci)
, store these chunks into the token server, and evict least-recently-used chunks in the corresponding server id to make sure the total number of chunks in the server id is smaller than a pre-defined constant C.query(request: str, server_ids: List[Int], t: Timestamp)
, we will query each server in the server list and see which one matches the maximum number of chunks.Feel free to leave your feedback and comments!
The text was updated successfully, but these errors were encountered: