Online serving benchmarks [multiturn chat, shared prefix] to multi-tier KV caching #2665

PanJason · 2024-12-30T08:24:08Z

Motivation

This PR extend the bench_serving.py to include more patterns that can be used
for multi-tier KV caching. These patterns also cover more dataset with different
characteristics (e.g., input len, output len, #rounds, etc)
Now I have included two patterns, each of which (will) support multiple datasets:

Multiturn Chat
- Ultrachat (short input)
- ShareGPT (medium input)
- LooGLE (long input)
- NextQA (multi modality)
Shared Prefix
- LooGLE (shared text)
- NextQA (shared video)

Modifications

Code Structure

bench_serving.py: the main entrace to the benchmark
data_processing: process the downloaded datasets according to the args
bench/nextqa/: gadgets for testing the nextqa video benchmark

New options in `bench_serving.py`

--disable-shuffle: disable random shuffling of the dataset to get more stable results
--enable-multiturn: turn on multiturn chat for the datasets mentioned above
--enable-shared-prefix: turn on shared prefix for the datasets mentioned above

Main changes

I introduced inputs_requests_queue which is an asyncio.queue. At the beginning,
the first user prompt of all the conversations are enqueued to inputs_requests_queue.
get_requests takes one prompt from the queue following the request rate. At the end
of the request_func when it finishes processing the prompt, request_func pushes
another following request to the inputs_requests_queue for the next round.

Notes on the request order:

Multiturn: For now, since we use a queue, multiturn chat sends requests in a round-robin fashion.
For example, if we have 3 conversations A, B, C whose rounds are [2, 3, 4] correspondingly,
multiturn chat will send the requests to the backend in the following order: [A1, B1, C1, A2, B2, C2, B3, C3, C4]
This has implications on the cache reuse patterns: the cache reuse distance is the largest
under this request pattern (which means a prefix-aware local scheduler in the backend can
yield the most benefit compared to a FIFO scheduler)

Shared Prefix: For now, the requests share the same prefix will be sent together
in the benchmark. For example, if we have 3 shared prefix A, B, C, which have [2, 3, 4]
questions correspondingly, the shared prefix benchmark will send the requests to the
backend in the following order: [A+Q1, A+Q2, B+Q1, B+Q2, B+Q3, C+Q1, C+Q2, C+Q3].
If this is not ideal, we can either follow the same pattern as the previous multiturn, or
shuffle all the request randomly again.

TODO list

Make NextQA work
Add zipfian distribution when generating requests

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

xiezhq-hermann · 2024-12-30T08:46:32Z

Thank you for the contribution! Can you organize all the benchmark scripts and datasets under the benchmark directory to make it cleaner and not changing the CI execution?

PanJason · 2024-12-30T08:54:11Z

Thank you for the contribution! Can you organize all the benchmark scripts and datasets under the benchmark directory to make it cleaner and not changing the CI execution?

Yeah, working on it. I will also drop the common part as the bench_serving.py in this change

PanJason added 5 commits December 30, 2024 07:35

Modify the online serving to support the multiturn chat. WIP

651ef3f

Feature: Add online multiturn chat. WIP

b14d1d7

Feature: add multiturn support for sglang native

e5b0b57

Feature: multiturn support for all datasets except random and nextqa

23c5520

Formatter

19ac360

xiezhq-hermann self-assigned this Dec 30, 2024

xiezhq-hermann self-requested a review December 30, 2024 08:27

PanJason added 2 commits December 30, 2024 08:42

Try to make CI happy

4aba829

Try to make CI happy

85cb5e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online serving benchmarks [multiturn chat, shared prefix] to multi-tier KV caching #2665

Online serving benchmarks [multiturn chat, shared prefix] to multi-tier KV caching #2665

PanJason commented Dec 30, 2024

xiezhq-hermann commented Dec 30, 2024

PanJason commented Dec 30, 2024

Online serving benchmarks [multiturn chat, shared prefix] to multi-tier KV caching #2665

Are you sure you want to change the base?

Online serving benchmarks [multiturn chat, shared prefix] to multi-tier KV caching #2665

Conversation

PanJason commented Dec 30, 2024

Motivation

Modifications

Code Structure

New options in bench_serving.py

Main changes

Notes on the request order:

TODO list

Checklist

xiezhq-hermann commented Dec 30, 2024

PanJason commented Dec 30, 2024

New options in `bench_serving.py`