prefilltap


What	Decompose LLM serving latency into network · queue · prefill · decode.
Why	"It's slow" is not actionable. Knowing which phase is slow is.
How	Stream a request, time first-byte / first-token / last-token, subtract.
Status	🟡 Experimental — works against any OpenAI-compatible endpoint (vLLM, SGLang, TGI, OpenAI, Together, DeepSeek).
Stack	Python 3.10+, `httpx`, `numpy`, `click`, `rich`.
License	Apache-2.0

TL;DR

pip install prefilltap

prefilltap probe \
  --base-url http://localhost:8000/v1 \
  --model    qwen2.5-7b-instruct \
  --network-rtt-ms 0.4

total:      1843.2 ms
  network:    0.4 ms (  0.0%)
  queue:     12.1 ms (  0.7%)
  prefill:  421.8 ms ( 22.9%)
  decode:  1409.0 ms ( 76.4%)
output_tokens: 142
decode tps:    100.4

What each phase means

Phase	Definition	If this is your bottleneck …
network	RTT to the server (probe it separately, pass `--network-rtt-ms`).	Move closer; check TCP/TLS overhead.
queue	First-byte time minus network RTT.	You're saturated — add replicas or batch.
prefill	First-byte → first-token. Scales with input length × model size.	Reduce prompt; enable prefix caching; chunked prefill.
decode	First-token → last-token. Roughly `output_tokens / decode_tps`.	Speculative decoding; smaller model; quantization.

Sweep mode

Find the cliff:

prefilltap sweep \
  --base-url http://localhost:8000/v1 \
  --model qwen2.5-7b-instruct \
  --input-lengths 64,512,2048,8192 \
  --concurrencies 1,4,16,32 \
  --samples 12 \
  --output sweep.csv

sweep.csv has p50/p95 total, p50/p95 TTFT, and decode tokens-per-second per cell. Drop it into a notebook, plot, ship.

Programmatic use

from prefilltap import Probe, decompose

probe = Probe("http://localhost:8000/v1")
result = await probe.run("qwen2.5-7b-instruct", "explain quicksort", max_tokens=256)
b = decompose(result, network_rtt_ms=0.4)

print(b.share())   # {'network': 0.0001, 'queue': 0.007, 'prefill': 0.23, 'decode': 0.76}

Why care about decomposition?

Because the fixes are completely different. A serving stack with 90% prefill time wants chunked prefill or a smaller model. A stack with 90% decode time wants speculative decoding, lower precision, or a bigger batch size. A stack with 90% queue time doesn't need any model changes — it needs more replicas. Optimizing the wrong phase is the most common waste of engineering time I see in serving infra.

What it doesn't do

No GPU profiling. Use nsys or torch.profiler for that.
No client-side per-token tokenization. We count chunks; for exact tokens use the server's reported usage.
Doesn't measure cold-start. Probe in steady state.

Roadmap

OpenAI-compatible streaming
sweep × CSV
Per-server header parsing (x-vllm-queue-time, x-tgi-queue)
HTML report with plotly waterfall
Speculative-decoding aware decode-ms calculation

Contributing

PRs welcome. Open an issue first if it's a big change. Tests live in tests/, run with pytest.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
examples		examples
src/prefilltap		src/prefilltap
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prefilltap

TL;DR

What each phase means

Sweep mode

Programmatic use

Why care about decomposition?

What it doesn't do

Roadmap

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

prefilltap

TL;DR

What each phase means

Sweep mode

Programmatic use

Why care about decomposition?

What it doesn't do

Roadmap

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages