Code: parrot/protocol/internal/
.
/free_context
, arguments:context_id: int
. Free a Low-level context in the engine./ping_engine
, arguments: None. Ping an engine to make sure it's alive.
/register_engine
, arguments:engine_config: EngineConfig
. Register an Engine in the ServeCore./engine_heartbeat
, arguments:engine_id: int, engine_name: str, runtime_info: EngineRuntimeInfo
. Heartbeats from Engine, also update the runtime information of the engine.
Primitive requests are APIs we have defined for implementing the basic functions of LLMs, primarily to support Contextual Prefill/Generate functionalities.
class Primitve:
# The session id.
session_id: int
# Its task id. Since two primitive requests belonging to the same CompletionTask cannot appear simultaneously in the Engine.
task_id: int
# Specify the Context this primitive operates on
context_id: int
parent_context_id: int
-
Fill
object (post on/fill
):class Fill(Primitive): token_ids: Optional[List[int]] text: Optional[str]
A
Fill
can use a untokenized textstring
or a list of tokenizedtoken_ids
as the fill content, depending on the backend type the user choose. We don't call itPrefill
since we support contextualFill
here, i.e., we can perform aFill
even after aGenerate
.When the ServeCore send a
Fill
request to an Engine, the Engine will calculate the KV cache on the specifiedContext
(which can be viewed as we "extend" theContext
by some tokens). -
Generate
object (post on/generate
).class Generate(Primitive): sampling_configs: SamplingConfig
This request will trigger a completion action based on the
specified
Context on the target Engine. TheContext
is also "extended" As tokens are generated one by one and the KV are appended to the corresponding KV cache./generate_stream
(TODO)
Note: In fact, free_context
can also be considered a type of primitive request, as it provides basic functionality for managing the context.