Skip to content

fragres/prefilltap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

prefilltap

What Decompose LLM serving latency into network · queue · prefill · decode.
Why "It's slow" is not actionable. Knowing which phase is slow is.
How Stream a request, time first-byte / first-token / last-token, subtract.
Status 🟡 Experimental — works against any OpenAI-compatible endpoint (vLLM, SGLang, TGI, OpenAI, Together, DeepSeek).
Stack Python 3.10+, httpx, numpy, click, rich.
License Apache-2.0

TL;DR

pip install prefilltap

prefilltap probe \
  --base-url http://localhost:8000/v1 \
  --model    qwen2.5-7b-instruct \
  --network-rtt-ms 0.4
total:      1843.2 ms
  network:    0.4 ms (  0.0%)
  queue:     12.1 ms (  0.7%)
  prefill:  421.8 ms ( 22.9%)
  decode:  1409.0 ms ( 76.4%)
output_tokens: 142
decode tps:    100.4

What each phase means

Phase Definition If this is your bottleneck …
network RTT to the server (probe it separately, pass --network-rtt-ms). Move closer; check TCP/TLS overhead.
queue First-byte time minus network RTT. You're saturated — add replicas or batch.
prefill First-byte → first-token. Scales with input length × model size. Reduce prompt; enable prefix caching; chunked prefill.
decode First-token → last-token. Roughly output_tokens / decode_tps. Speculative decoding; smaller model; quantization.

Sweep mode

Find the cliff:

prefilltap sweep \
  --base-url http://localhost:8000/v1 \
  --model qwen2.5-7b-instruct \
  --input-lengths 64,512,2048,8192 \
  --concurrencies 1,4,16,32 \
  --samples 12 \
  --output sweep.csv

sweep.csv has p50/p95 total, p50/p95 TTFT, and decode tokens-per-second per cell. Drop it into a notebook, plot, ship.

Programmatic use

from prefilltap import Probe, decompose

probe = Probe("http://localhost:8000/v1")
result = await probe.run("qwen2.5-7b-instruct", "explain quicksort", max_tokens=256)
b = decompose(result, network_rtt_ms=0.4)

print(b.share())   # {'network': 0.0001, 'queue': 0.007, 'prefill': 0.23, 'decode': 0.76}

Why care about decomposition?

Because the fixes are completely different. A serving stack with 90% prefill time wants chunked prefill or a smaller model. A stack with 90% decode time wants speculative decoding, lower precision, or a bigger batch size. A stack with 90% queue time doesn't need any model changes — it needs more replicas. Optimizing the wrong phase is the most common waste of engineering time I see in serving infra.

What it doesn't do

  • No GPU profiling. Use nsys or torch.profiler for that.
  • No client-side per-token tokenization. We count chunks; for exact tokens use the server's reported usage.
  • Doesn't measure cold-start. Probe in steady state.

Roadmap

  • OpenAI-compatible streaming
  • sweep × CSV
  • Per-server header parsing (x-vllm-queue-time, x-tgi-queue)
  • HTML report with plotly waterfall
  • Speculative-decoding aware decode-ms calculation

Contributing

PRs welcome. Open an issue first if it's a big change. Tests live in tests/, run with pytest.

About

Decompose LLM serving latency into prefill, decode, queue, and network — for vLLM, SGLang, TGI, and any OpenAI-compatible endpoint.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages