| 
 | 1 | +# Plugin Architecture  | 
 | 2 | + | 
 | 3 | +The Spyre plugin extends or replaces three main components in vLLM:  | 
 | 4 | + | 
 | 5 | +1. Scheduler  | 
 | 6 | +2. Model worker and model runner  | 
 | 7 | +3. Modeling code  | 
 | 8 | + | 
 | 9 | +To better understand these modifications, it's helpful to  | 
 | 10 | +consider the state of the native vllm for GPU architecture.  | 
 | 11 | + | 
 | 12 | +  | 
 | 13 | + | 
 | 14 | +The API server, the engine core, and the workers live in  | 
 | 15 | +different processes. All three refer to the platform API for backend  | 
 | 16 | +specific concerns.  | 
 | 17 | + | 
 | 18 | +In vLLM-Spyre, we implement a platform API that is  | 
 | 19 | +loaded at the vLLM startup time and bootstraps all other components.  | 
 | 20 | + | 
 | 21 | +  | 
 | 22 | + | 
 | 23 | +As we can see in the diagram, the plugin mainly modifies the engine core  | 
 | 24 | +and worker processes. The platform API includes request validation hooks  | 
 | 25 | +that the API server invokes to ensure that the requests  | 
 | 26 | +can be handled by the backend.  | 
 | 27 | + | 
 | 28 | +In the engine core, we customize the scheduler to handle the constraints  | 
 | 29 | +of static batching and continuous batching.  | 
 | 30 | + | 
 | 31 | +The changes are broader in the worker process. Most of the main  | 
 | 32 | +classes have Spyre-specific implementations. From the vLLM code, we mainly  | 
 | 33 | +reuse the sampling code (including logits processing) and the pooling  | 
 | 34 | +code for non-generative use cases.  | 
 | 35 | + | 
 | 36 | +We provide model runners for three cases: static batching, continuous batching and  | 
 | 37 | +pooling. The pooling model runner is very similar to the static batching one,  | 
 | 38 | +except that it does pooling instead of sampling and  | 
 | 39 | +uses the `transformers` modeling code instead of the `foundation-model-stack`  | 
 | 40 | +code.  | 
0 commit comments