Skip to content

Latest commit

 

History

History
133 lines (86 loc) · 6.52 KB

File metadata and controls

133 lines (86 loc) · 6.52 KB

V3

V3 is the third version of our Parrot system. It refactors many components (indeed, we rewrite the whole code repository) to match the claims in paper and brings a more clear, user-friendly and efficient implementation.

🌟 Major new features:

  • RESTful APIs with semantic variables.
  • Better dispatcher (a.k.a OS-scheduler).
  • Better DAG builder.
  • Performance deduction and Task Group.
  • Native function.

Deprecated?

  • Frontend: We provide a RESTful API w/ Semantic Variable for flexibility and extensibility. And the semantic function will be parsed into API request instead of being serialized.
  • Frontend (2): We change the native function to string transformations.
  • OS: Token streaming/chunked fill/Async Fill are deprecated due to low gain.
  • API: Stateful call (call w/ Context)
  • Frontend (3): Remove vm heartbeat
  • Engine: Only reserve builtin engines (give up MLC support)

New Arch

  • Three layers: Frontend Layer (Adaptors), Serve Layer, Engine Layer.

Serve Layer

Overview

Serve layer (original “OS”)

Entry: class ParrotServeCore

Request

{system}. This is a request. The user input is {input}. The output is {output}.

The prompt is split up to several chunks.

Dataflow:

  • Split the prompt according to {{}} marks and create several chunks. (ChunkedRequest - RequestTextChunk - FunctionParameter - RequestMetadata)

  • Prefix matching (TODO: Radix-style, semantic variable split). Matching the request body in global prefix matcher, and split according to match positions.

  • Creating RequestChain from ChunkedRequest, and insert it to the ComputeGraph of session. This process contains two part:

      1. Convert chunks to nodes and link them;
      1. Create semantic variables according to previous cached_sv marks. The first constant prefix (if no hit) will be global semantic variable, while the remaining ones are local semantic variables (in session).

    After this, the request becomes a chain in the graph (a.k.a RequestChain).

  • The basic unit of scheduling is CompletionChain (a sub chain of RequestChain, Fill -> Fill -> Fill -> Gen). When a CompletionChain is out, it will be:

    • Analyzed by Graph.
    • Tokenized by TokenizerManager.
    • Scheduling by TaskDispatcher (Using ContextManager as a info).
    • Assigned Context by ContextManager, according to scheduled result.
    • Build PrimitiveRequest using PrimitiveBuilder.

Multiple Outputs:

  • TBD

Session

Each session is created manually through /session API. Different sessions are isolated (except for some global-sharing resources).

The following resources are owned by a session, and should be recycled after destroying the session.

  • A session scope in the SemanticVariableManager.
  • A session memory space in the ContextManager.
  • ComputeGraph of the session.
  • Executor of the session.

Compute Graph

  • When a request is submitted to the system, it will be routed to corresponding session to be executed. A request will be split into several GenTask,each task is a (Fill, Fill, Fill, Gen).
  • There is a Graph in each session.
    • Edge type A: For each GenTask, we analyze its body and add edges from Fill SV to Gen SV. And for GenTasks in the same request, we add an edge from Gen SV (in 1 GenTask) to the first Fill SV (in 2 GenTask).
    • Edge type B: For a Gen node and a Fill node with the same SV, we add an edge from Gen SV to Fill SV.
  • TaskGroup: Tasks linked to the same node will be packed as a group when submitted. The system should optimize e2e latency of the entire group, instead of a single Request.
  • Execution in Graph Topo Order:
    • The executor maintains a list of SVs with 0 in-degree (activated). The Tasks of these nodes are submitted to the TaskDispatcher queue to do scheduling. Then each activated task is labelled with a scheduled engine, means they are running.
    • When an activated Task is finished, it will be removed from the Graph. Node activated by another node will have a higher priority in the dispatch queue for a App-level FIFO. (to discuss: policy?)

Context Manager

This manager manages both intra-session and inter-session Contexts. Hence, it must be a global component in OS.

  • Intra-session context: constant prefix / system prompt.
  • Inter-session context: any middle results (Fill, Gen) in the requests. Will be freed directly after the request / session ends.

Some notes:

  • Context is a sequence of tokens and their KV cache tensors. The OS is only used for managing Context. Real implementation is in the engine side, a.k.a Low-Level Context.
  • Context has the ability called fork. Each context may have a parent context, hence creating a tree-like structure.
  • Heuristic strategy for preserving Constant Prefix (like system prompts of BingCopilot/GPTs).

PrefixMatcher: Split string (chunked request)

Cache in SemanticVariableManager (map the same text to the same variable, if global/local matched?)

Cache in ContextManager (map the same semantic variable array to the same context). It’s a query since the request may not be scheduled to the engine.

Caching in ContextManager

Task Dispatcher

When a GenTask is activated, it will be dispatched to certain engine. We use a comprehensive scheduling strategy when dispatching. (Ordered by priority)

  • TaskGroup: If a GenTask is is in a TaskGroup, the dispatcher will schedule the group together.
  • Ctx-aware (1): If Tasks share the same prefix in the queue, we will do a grouping before scheduling.
  • Ctx-aware (2): The dispatcher will invoke Context manager’s prefix-matching, and dispatch the request to the engine with longest matched prefix. (If the capacity is OK).
  • By default, we will find a Engine with the least capacity impact for a request which satisfies its constraint. This optimizes the whole throughput.

Engine Manager

Engines manager. The OS manages multiple engines.

  • The connections between OS and engines are maintained by heartbeats. The heartbeat also returns some information for monitoring engines. (NOTE: The method of collecting data should not be too time-consuming.
  • The OS send requests to the engine by http request. We can also replace it with a more efficient way (since they are in a Cloud).

Engine

(Deprecated): MLC-LLM

Parrot’s engine should implement the following features:

  • F1: Able to execute Fill / Gen primitive. (Fill primitive is context-aware).
  • F2: Support paged attention & paged KV cache management.
  • F3: Based on paged memory management, support context management. [Sharing]
  • F4: Customized shared prefix kernel.

Problem:

  • How to padding parent_context?