V3 is the third version of our Parrot system. It refactors many components (indeed, we rewrite the whole code repository) to match the claims in paper and brings a more clear, user-friendly and efficient implementation.
🌟 Major new features:
- RESTful APIs with semantic variables.
- Better dispatcher (a.k.a OS-scheduler).
- Better DAG builder.
- Performance deduction and Task Group.
- Native function.
- Frontend: We provide a RESTful API w/ Semantic Variable for flexibility and extensibility. And the semantic function will be parsed into API request instead of being serialized.
- Frontend (2): We change the native function to string transformations.
- OS: Token streaming/chunked fill/Async Fill are deprecated due to low gain.
- API: Stateful call (call w/ Context)
- Frontend (3): Remove
vm heartbeat
- Engine: Only reserve builtin engines (give up MLC support)
- Three layers: Frontend Layer (Adaptors), Serve Layer, Engine Layer.
Serve layer (original “OS”)
Entry: class ParrotServeCore
{system}. This is a request. The user input is {input}. The output is {output}.
The prompt is split up to several chunks.
Dataflow:
-
Split the prompt according to
{{}}
marks and create several chunks. (ChunkedRequest - RequestTextChunk - FunctionParameter - RequestMetadata) -
Prefix matching (TODO: Radix-style, semantic variable split). Matching the request body in global prefix matcher, and split according to match positions.
-
Creating RequestChain from ChunkedRequest, and insert it to the ComputeGraph of session. This process contains two part:
-
- Convert chunks to nodes and link them;
-
- Create semantic variables according to previous
cached_sv
marks. The first constant prefix (if no hit) will be global semantic variable, while the remaining ones are local semantic variables (in session).
- Create semantic variables according to previous
After this, the request becomes a chain in the graph (a.k.a RequestChain).
-
-
The basic unit of scheduling is CompletionChain (a sub chain of RequestChain,
Fill -> Fill -> Fill -> Gen
). When a CompletionChain is out, it will be:- Analyzed by Graph.
- Tokenized by TokenizerManager.
- Scheduling by TaskDispatcher (Using ContextManager as a info).
- Assigned Context by ContextManager, according to scheduled result.
- Build PrimitiveRequest using PrimitiveBuilder.
Multiple Outputs:
- TBD
Each session is created manually through /session
API. Different sessions are isolated (except for some global-sharing resources).
The following resources are owned by a session, and should be recycled after destroying the session.
- A session scope in the SemanticVariableManager.
- A session memory space in the ContextManager.
- ComputeGraph of the session.
- Executor of the session.
- When a request is submitted to the system, it will be routed to corresponding session to be executed. A request will be split into several
GenTask
,each task is a (Fill, Fill, Fill, Gen). - There is a
Graph
in each session.- Edge type A: For each
GenTask
, we analyze its body and add edges from Fill SV to Gen SV. And forGenTasks
in the same request, we add an edge from Gen SV (in 1 GenTask) to the first Fill SV (in 2 GenTask). - Edge type B: For a Gen node and a Fill node with the same SV, we add an edge from Gen SV to Fill SV.
- Edge type A: For each
- TaskGroup: Tasks linked to the same node will be packed as a group when submitted. The system should optimize e2e latency of the entire group, instead of a single Request.
- Execution in Graph Topo Order:
- The executor maintains a list of SVs with 0 in-degree (activated). The Tasks of these nodes are submitted to the TaskDispatcher queue to do scheduling. Then each activated task is labelled with a scheduled engine, means they are running.
- When an activated Task is finished, it will be removed from the Graph. Node activated by another node will have a higher priority in the dispatch queue for a App-level FIFO. (to discuss: policy?)
This manager manages both intra-session and inter-session Contexts. Hence, it must be a global component in OS.
- Intra-session context: constant prefix / system prompt.
- Inter-session context: any middle results (Fill, Gen) in the requests. Will be freed directly after the request / session ends.
Some notes:
- Context is a sequence of tokens and their KV cache tensors. The OS is only used for managing Context. Real implementation is in the engine side, a.k.a
Low-Level Context
. - Context has the ability called
fork
. Each context may have a parent context, hence creating a tree-like structure. - Heuristic strategy for preserving Constant Prefix (like system prompts of BingCopilot/GPTs).
PrefixMatcher: Split string (chunked request)
Cache in SemanticVariableManager (map the same text to the same variable, if global/local matched?)
Cache in ContextManager (map the same semantic variable array to the same context). It’s a query since the request may not be scheduled to the engine.
Caching in ContextManager
When a GenTask is activated, it will be dispatched to certain engine. We use a comprehensive scheduling strategy when dispatching. (Ordered by priority)
- TaskGroup: If a
GenTask
is is in a TaskGroup, the dispatcher will schedule the group together. - Ctx-aware (1): If Tasks share the same prefix in the queue, we will do a grouping before scheduling.
- Ctx-aware (2): The dispatcher will invoke Context manager’s prefix-matching, and dispatch the request to the engine with longest matched prefix. (If the capacity is OK).
- By default, we will find a Engine with the least capacity impact for a request which satisfies its constraint. This optimizes the whole throughput.
Engines manager. The OS manages multiple engines.
- The connections between OS and engines are maintained by heartbeats. The heartbeat also returns some information for monitoring engines. (NOTE: The method of collecting data should not be too time-consuming.
- The OS send requests to the engine by http request. We can also replace it with a more efficient way (since they are in a Cloud).
(Deprecated): MLC-LLM
Parrot’s engine should implement the following features:
- F1: Able to execute
Fill
/Gen
primitive. (Fill
primitive is context-aware). - F2: Support paged attention & paged KV cache management.
- F3: Based on paged memory management, support context management. [Sharing]
- F4: Customized shared prefix kernel.
Problem:
- How to padding parent_context?