basically finish serve layer docs

microsoft · Sep 12, 2024 · d2dac99 · d2dac99
1 parent d784375
commit d2dac99
Show file tree

Hide file tree

Showing 5 changed files with 46 additions and 1 deletion.
diff --git a/docs/images/flow_scheduling.png b/docs/images/flow_scheduling.png
diff --git a/docs/sys_design/README.md b/docs/sys_design/README.md
@@ -8,6 +8,7 @@ Parrot is a distributed serving system for LLM-based Applications. It can be div
     - [ServeCore](serve_layer/core.md), a.k.a. Parrot Manager.
     - [Global Scheduler](serve_layer/global_scheduler.md).
     - [Parrot's Graph Representation](serve_layer/graph.md).
+    - [Parrot's Graph Executor](serve_layer/executor.md), read how Parrot efficiently executes a DAG of requests.
     - [Context](serve_layer/context.md), read the cluster-level memory management of Parrot.
     - [Engines](serve_layer/engines.md), read the management of engines.
     - [Sessions](serve_layer/sessions.md), read the management of sessions.

diff --git a/docs/sys_design/serve_layer/executor.md b/docs/sys_design/serve_layer/executor.md
@@ -1 +1,9 @@
-# Graph Executor
+# Graph Executor
+
+Parrot adopts a graph executor to enable automatically parallelizing and batching LLM requests in the DAG. Each `Session` has its own executor, with a `ComputeGraph` dynamically maintained.
+
+## Graph-based Execution with Coroutines
+
+Parrot's `ComputeGraph` is a data-dependent graph. The basic unit of execution is `CompletionChain` in our graph (See [Graph](graph.md)), which will be executed once it's ready ("Ready" means the dependencies of the chain have all been executed).
+
+To implement this, we need to continuously poll and pop out chains with zero in-degree. Parrot assigns a `Coroutine` to each `CompletionChain`, and wraps it as a task in the polling loop. Different CompletionChains communicate with each other using `Event`s (in Python asynchronous programming framework).
diff --git a/docs/sys_design/serve_layer/global_scheduler.md b/docs/sys_design/serve_layer/global_scheduler.md
@@ -1,3 +1,23 @@
 # Global Scheduler
 
+Parrot’s **Global Scheduler** primarily addresses the issue of determining which machine to dispatch requests to (Note the distinction from the **Local Scheduler** on the Engine, which is primarily responsible for scheduling each iteration of the LLM, a.k.a continuous batching).
 
+Thanks to Semantic Variables, we are able to extract a wealth of useful high-level information, which allows us to design various sophisticated scheduling strategies. This section introduces some of the builtin scheduling strategies of Parrot.
+
+## QoS-aware Scheduling based on DAG
+
+Due to the presence of batching, the per-token latency of an Engine is typically determined by the load of requests on it. We define our QoS as **the per-token latency of the generation process** (This is important in many streaming applications, such as, most importantly, chat). Parrot's Global Scheduler utilizes the DAG information to schedule requests such that each Engine will not violate the QoS requirement.
+
+The detailed technique can be found in the 5.2 section of our paper.
+
+## Application-level FIFO (Flow Scheduling)
+
+When Parrot faces multiple analytic tasks (i.e. multiple chains with the same length), App-FIFO is a free lunch.
+
+In the following example, we have N Applications (N analytic tasks) with each application composed of 2 steps / 2 LLM requests, denoting as A -> B. Intuitively, using App FIFO can achieve the best overall JCT (Job Completion Time). App FIFO v.s. Request-level FIFO is like depth-first order v.s. breadth-first order. In some cases like multiple chains with different lengths, the situation may be different and the latter strategy may be better.
+
+![](../../images/flow_scheduling.png)
+
+## Context-aware Scheduling
+
+Some of the scheduling strategies are co-designed between the high-level and low-level layers. Since the builtin `Engine` is equipped with [Shared Attention Kernel](../engine_layer/shared_attention_kernel.md), it's better to co-locate requests with the same prefix (i.e. with the same prefix `Context`) to the same machine whenever possible.
diff --git a/docs/sys_design/serve_layer/sv_manage.md b/docs/sys_design/serve_layer/sv_manage.md
@@ -0,0 +1,16 @@
+# Semantic Variable Management
+
+Semantic Variable management focuses on the generation and assignment of different Variable `id`s.
+
+## Namespace
+
+A Semantic Variable namespace is just a `id` generator. Parrot ensures that Variable `id`s will not have duplicate names within the same namespace. Overall, there are two kinds of namespaces in the Semantic Variable manager:
+- Global namespace, a.k.a. the namespace for Constant Prefix variables.
+- Local namespace of each `Session`.
+
+## Generate `id`
+
+The Semantic Variable manager is responsible for assigning a Semantic Variable to each node in a `RequestChain` (See [Graph](graph.md)). The manager has different treatments for different node types:
+- `ConstantFill`: To ensure consistency in management, we have also assigned Semantic Variables to constant text (not just to input/output fields). The `id` of this type of nodes is generated by hashing the text content itself, so that text chunks with the same content can reuse the Semantic Variable.
+- `PlaceholderFill`: This type of nodes must be already bound to an existing variable!
+- `PlaceholderGen`: We randomly generate a unique `id` (Hash by the given` var_name`).