Features of Mini-SGLang

Online Serving

Mini-SGLang supports online serving with an OpenAI-compatible API server. It provides the standard /v1/chat/completions endpoint, allowing seamless integration with existing tools and clients. For detailed command-line arguments and configuration options, run python -m minisgl --help.

Interactive Shell Mode

For demonstration and testing purposes, an interactive shell mode is available. In this mode, users can input prompts directly, and the LLM will generate responses in real-time. The shell automatically caches chat history to maintain context. To clear the conversation history and start a new session, use the /reset command.

Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell

Distributed Serving

To scale performance across multiple GPUs, Mini-SGLang supports Tensor Parallelism (TP). You can enable distributed serving by specifying the number of GPUs with the --tp n argument, where n is the degree of parallelism.

Supported Models

Our framework currently supports the following dense model architectures:

Llama-3 series
Qwen-3 series (including MoE)
Qwen-2.5 series

Chunked Prefill

Chunked Prefill, a technique introduced by Sarathi-Serve, is enabled by default. This feature splits long prompts into smaller chunks during the prefill phase, significantly reducing peak memory usage and preventing Out-Of-Memory (OOM) errors in long-context serving. The chunk size can be configured using --max-prefill-length n. Note that setting n to a very small value (e.g., 128) is not recommended as it may significantly degrade performance.

Page Size

You can specify the page size of the system using the --page-size argument.

Attention Backends

Mini-SGLang integrates high-performance attention kernels, including FlashAttention (fa), FlashInfer (fi) and TensorRT-LLM fmha (trtllm). It supports using different backends for the prefill and decode phases to maximize efficiency. For example, on NVIDIA Hopper GPUs, FlashAttention 3 is used for prefill and FlashInfer for decode by default.

You can specify the backend using the --attn argument. If two values are provided (e.g., --attn fa,fi), the first specifies the prefill backend and the second the decode backend. Note that some attention backend might override the user-provided page size (e.g. trtllm only supports page size 16,32,64).

CUDA Graph

To minimize CPU launch overhead during decoding, Mini-SGLang supports capturing and replaying CUDA graphs. This feature is enabled by default. The maximum batch size for CUDA graph capture can be set with --cuda-graph-max-bs n. Setting n to 0 disables this feature.

Radix Cache

Adopting the original design from SGLang, Mini-SGLang implements a Radix Cache to manage the Key-Value (KV) cache. This allows the reuse of KV cache for shared prefixes across requests, reducing redundant computation. This feature is enabled by default but can be switched to a naive cache management strategy using --cache naive.

Illustration of Radix Attention from LMSYS Blog.

Overlap Scheduling

To further reduce CPU overhead, Mini-SGLang employs overlap scheduling, a technique proposed in NanoFlow. This approach overlaps the CPU scheduling overhead with GPU computation, improving overall system throughput.

Illustration of Overlap Scheduling from LMSYS Blog.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features of Mini-SGLang

Online Serving

Interactive Shell Mode

Distributed Serving

Supported Models

Chunked Prefill

Page Size

Attention Backends

CUDA Graph

Radix Cache

Overlap Scheduling

FilesExpand file tree

features.md

Latest commit

History

features.md

File metadata and controls

Features of Mini-SGLang

Online Serving

Interactive Shell Mode

Distributed Serving

Supported Models

Chunked Prefill

Page Size

Attention Backends

CUDA Graph

Radix Cache

Overlap Scheduling