| What | Decompose LLM serving latency into network · queue · prefill · decode. |
| Why | "It's slow" is not actionable. Knowing which phase is slow is. |
| How | Stream a request, time first-byte / first-token / last-token, subtract. |
| Status | 🟡 Experimental — works against any OpenAI-compatible endpoint (vLLM, SGLang, TGI, OpenAI, Together, DeepSeek). |
| Stack | Python 3.10+, httpx, numpy, click, rich. |
| License | Apache-2.0 |
pip install prefilltap
prefilltap probe \
--base-url http://localhost:8000/v1 \
--model qwen2.5-7b-instruct \
--network-rtt-ms 0.4total: 1843.2 ms
network: 0.4 ms ( 0.0%)
queue: 12.1 ms ( 0.7%)
prefill: 421.8 ms ( 22.9%)
decode: 1409.0 ms ( 76.4%)
output_tokens: 142
decode tps: 100.4
| Phase | Definition | If this is your bottleneck … |
|---|---|---|
| network | RTT to the server (probe it separately, pass --network-rtt-ms). |
Move closer; check TCP/TLS overhead. |
| queue | First-byte time minus network RTT. | You're saturated — add replicas or batch. |
| prefill | First-byte → first-token. Scales with input length × model size. | Reduce prompt; enable prefix caching; chunked prefill. |
| decode | First-token → last-token. Roughly output_tokens / decode_tps. |
Speculative decoding; smaller model; quantization. |
Find the cliff:
prefilltap sweep \
--base-url http://localhost:8000/v1 \
--model qwen2.5-7b-instruct \
--input-lengths 64,512,2048,8192 \
--concurrencies 1,4,16,32 \
--samples 12 \
--output sweep.csvsweep.csv has p50/p95 total, p50/p95 TTFT, and decode tokens-per-second per
cell. Drop it into a notebook, plot, ship.
from prefilltap import Probe, decompose
probe = Probe("http://localhost:8000/v1")
result = await probe.run("qwen2.5-7b-instruct", "explain quicksort", max_tokens=256)
b = decompose(result, network_rtt_ms=0.4)
print(b.share()) # {'network': 0.0001, 'queue': 0.007, 'prefill': 0.23, 'decode': 0.76}Because the fixes are completely different. A serving stack with 90% prefill time wants chunked prefill or a smaller model. A stack with 90% decode time wants speculative decoding, lower precision, or a bigger batch size. A stack with 90% queue time doesn't need any model changes — it needs more replicas. Optimizing the wrong phase is the most common waste of engineering time I see in serving infra.
- No GPU profiling. Use
nsysortorch.profilerfor that. - No client-side per-token tokenization. We count chunks; for exact tokens use the server's reported usage.
- Doesn't measure cold-start. Probe in steady state.
- OpenAI-compatible streaming
- sweep × CSV
- Per-server header parsing (
x-vllm-queue-time,x-tgi-queue) - HTML report with plotly waterfall
- Speculative-decoding aware decode-ms calculation
PRs welcome. Open an issue first if it's a big change. Tests live in tests/,
run with pytest.