Online serving benchmarks [multiturn chat, shared prefix] to multi-tier KV caching #2665
+1,609
−404
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This PR extend the
bench_serving.py
to include more patterns that can be usedfor multi-tier KV caching. These patterns also cover more dataset with different
characteristics (e.g., input len, output len, #rounds, etc)
Now I have included two patterns, each of which (will) support multiple datasets:
Multiturn Chat
Shared Prefix
Modifications
Code Structure
bench_serving.py
: the main entrace to the benchmarkdata_processing
: process the downloaded datasets according to the argsbench/nextqa/
: gadgets for testing the nextqa video benchmarkNew options in
bench_serving.py
--disable-shuffle
: disable random shuffling of the dataset to get more stable results--enable-multiturn
: turn on multiturn chat for the datasets mentioned above--enable-shared-prefix
: turn on shared prefix for the datasets mentioned aboveMain changes
I introduced
inputs_requests_queue
which is anasyncio.queue
. At the beginning,the first user prompt of all the conversations are enqueued to
inputs_requests_queue
.get_requests
takes one prompt from the queue following the request rate. At the endof the
request_func
when it finishes processing the prompt,request_func
pushesanother following request to the
inputs_requests_queue
for the next round.Notes on the request order:
Multiturn: For now, since we use a queue, multiturn chat sends requests in a round-robin fashion.
For example, if we have 3 conversations A, B, C whose rounds are
[2, 3, 4]
correspondingly,multiturn chat will send the requests to the backend in the following order:
[A1, B1, C1, A2, B2, C2, B3, C3, C4]
This has implications on the cache reuse patterns: the cache reuse distance is the largest
under this request pattern (which means a prefix-aware local scheduler in the backend can
yield the most benefit compared to a FIFO scheduler)
Shared Prefix: For now, the requests share the same prefix will be sent together
in the benchmark. For example, if we have 3 shared prefix A, B, C, which have [2, 3, 4]
questions correspondingly, the shared prefix benchmark will send the requests to the
backend in the following order:
[A+Q1, A+Q2, B+Q1, B+Q2, B+Q3, C+Q1, C+Q2, C+Q3]
.If this is not ideal, we can either follow the same pattern as the previous multiturn, or
shuffle all the request randomly again.
TODO list
Checklist