Skip to content

Commit

Permalink
basically finish serve layer docs
Browse files Browse the repository at this point in the history
  • Loading branch information
SiriusNEO committed Sep 12, 2024
1 parent d784375 commit d2dac99
Show file tree
Hide file tree
Showing 5 changed files with 46 additions and 1 deletion.
Binary file added docs/images/flow_scheduling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/sys_design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Parrot is a distributed serving system for LLM-based Applications. It can be div
- [ServeCore](serve_layer/core.md), a.k.a. Parrot Manager.
- [Global Scheduler](serve_layer/global_scheduler.md).
- [Parrot's Graph Representation](serve_layer/graph.md).
- [Parrot's Graph Executor](serve_layer/executor.md), read how Parrot efficiently executes a DAG of requests.
- [Context](serve_layer/context.md), read the cluster-level memory management of Parrot.
- [Engines](serve_layer/engines.md), read the management of engines.
- [Sessions](serve_layer/sessions.md), read the management of sessions.
Expand Down
10 changes: 9 additions & 1 deletion docs/sys_design/serve_layer/executor.md
Original file line number Diff line number Diff line change
@@ -1 +1,9 @@
# Graph Executor
# Graph Executor

Parrot adopts a graph executor to enable automatically parallelizing and batching LLM requests in the DAG. Each `Session` has its own executor, with a `ComputeGraph` dynamically maintained.

## Graph-based Execution with Coroutines

Parrot's `ComputeGraph` is a data-dependent graph. The basic unit of execution is `CompletionChain` in our graph (See [Graph](graph.md)), which will be executed once it's ready ("Ready" means the dependencies of the chain have all been executed).

To implement this, we need to continuously poll and pop out chains with zero in-degree. Parrot assigns a `Coroutine` to each `CompletionChain`, and wraps it as a task in the polling loop. Different CompletionChains communicate with each other using `Event`s (in Python asynchronous programming framework).
20 changes: 20 additions & 0 deletions docs/sys_design/serve_layer/global_scheduler.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
# Global Scheduler

Parrot’s **Global Scheduler** primarily addresses the issue of determining which machine to dispatch requests to (Note the distinction from the **Local Scheduler** on the Engine, which is primarily responsible for scheduling each iteration of the LLM, a.k.a continuous batching).

Thanks to Semantic Variables, we are able to extract a wealth of useful high-level information, which allows us to design various sophisticated scheduling strategies. This section introduces some of the builtin scheduling strategies of Parrot.

## QoS-aware Scheduling based on DAG

Due to the presence of batching, the per-token latency of an Engine is typically determined by the load of requests on it. We define our QoS as **the per-token latency of the generation process** (This is important in many streaming applications, such as, most importantly, chat). Parrot's Global Scheduler utilizes the DAG information to schedule requests such that each Engine will not violate the QoS requirement.

The detailed technique can be found in the 5.2 section of our paper.

## Application-level FIFO (Flow Scheduling)

When Parrot faces multiple analytic tasks (i.e. multiple chains with the same length), App-FIFO is a free lunch.

In the following example, we have N Applications (N analytic tasks) with each application composed of 2 steps / 2 LLM requests, denoting as A -> B. Intuitively, using App FIFO can achieve the best overall JCT (Job Completion Time). App FIFO v.s. Request-level FIFO is like depth-first order v.s. breadth-first order. In some cases like multiple chains with different lengths, the situation may be different and the latter strategy may be better.

![](../../images/flow_scheduling.png)

## Context-aware Scheduling

Some of the scheduling strategies are co-designed between the high-level and low-level layers. Since the builtin `Engine` is equipped with [Shared Attention Kernel](../engine_layer/shared_attention_kernel.md), it's better to co-locate requests with the same prefix (i.e. with the same prefix `Context`) to the same machine whenever possible.
16 changes: 16 additions & 0 deletions docs/sys_design/serve_layer/sv_manage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Semantic Variable Management

Semantic Variable management focuses on the generation and assignment of different Variable `id`s.

## Namespace

A Semantic Variable namespace is just a `id` generator. Parrot ensures that Variable `id`s will not have duplicate names within the same namespace. Overall, there are two kinds of namespaces in the Semantic Variable manager:
- Global namespace, a.k.a. the namespace for Constant Prefix variables.
- Local namespace of each `Session`.

## Generate `id`

The Semantic Variable manager is responsible for assigning a Semantic Variable to each node in a `RequestChain` (See [Graph](graph.md)). The manager has different treatments for different node types:
- `ConstantFill`: To ensure consistency in management, we have also assigned Semantic Variables to constant text (not just to input/output fields). The `id` of this type of nodes is generated by hashing the text content itself, so that text chunks with the same content can reuse the Semantic Variable.
- `PlaceholderFill`: This type of nodes must be already bound to an existing variable!
- `PlaceholderGen`: We randomly generate a unique `id` (Hash by the given` var_name`).

0 comments on commit d2dac99

Please sign in to comment.