Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online serving benchmarks [multiturn chat, shared prefix] to multi-tier KV caching #2665

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

PanJason
Copy link
Contributor

Motivation

This PR extend the bench_serving.py to include more patterns that can be used
for multi-tier KV caching. These patterns also cover more dataset with different
characteristics (e.g., input len, output len, #rounds, etc)
Now I have included two patterns, each of which (will) support multiple datasets:

  • Multiturn Chat

    • Ultrachat (short input)
    • ShareGPT (medium input)
    • LooGLE (long input)
    • NextQA (multi modality)
  • Shared Prefix

    • LooGLE (shared text)
    • NextQA (shared video)

Modifications

Code Structure

  • bench_serving.py: the main entrace to the benchmark
  • data_processing: process the downloaded datasets according to the args
  • bench/nextqa/: gadgets for testing the nextqa video benchmark

New options in bench_serving.py

  • --disable-shuffle: disable random shuffling of the dataset to get more stable results
  • --enable-multiturn: turn on multiturn chat for the datasets mentioned above
  • --enable-shared-prefix: turn on shared prefix for the datasets mentioned above

Main changes

I introduced inputs_requests_queue which is an asyncio.queue. At the beginning,
the first user prompt of all the conversations are enqueued to inputs_requests_queue.
get_requests takes one prompt from the queue following the request rate. At the end
of the request_func when it finishes processing the prompt, request_func pushes
another following request to the inputs_requests_queue for the next round.

Notes on the request order:

Multiturn: For now, since we use a queue, multiturn chat sends requests in a round-robin fashion.
For example, if we have 3 conversations A, B, C whose rounds are [2, 3, 4] correspondingly,
multiturn chat will send the requests to the backend in the following order: [A1, B1, C1, A2, B2, C2, B3, C3, C4]
This has implications on the cache reuse patterns: the cache reuse distance is the largest
under this request pattern (which means a prefix-aware local scheduler in the backend can
yield the most benefit compared to a FIFO scheduler)

Shared Prefix: For now, the requests share the same prefix will be sent together
in the benchmark. For example, if we have 3 shared prefix A, B, C, which have [2, 3, 4]
questions correspondingly, the shared prefix benchmark will send the requests to the
backend in the following order: [A+Q1, A+Q2, B+Q1, B+Q2, B+Q3, C+Q1, C+Q2, C+Q3].
If this is not ideal, we can either follow the same pattern as the previous multiturn, or
shuffle all the request randomly again.

TODO list

  • Make NextQA work
  • Add zipfian distribution when generating requests

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@xiezhq-hermann xiezhq-hermann self-assigned this Dec 30, 2024
@xiezhq-hermann xiezhq-hermann self-requested a review December 30, 2024 08:27
@xiezhq-hermann
Copy link
Collaborator

Thank you for the contribution! Can you organize all the benchmark scripts and datasets under the benchmark directory to make it cleaner and not changing the CI execution?

@PanJason
Copy link
Contributor Author

Thank you for the contribution! Can you organize all the benchmark scripts and datasets under the benchmark directory to make it cleaner and not changing the CI execution?

Yeah, working on it. I will also drop the common part as the bench_serving.py in this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants