Add bench-serve workload contracts#406
Conversation
|
I pushed one more harness increment at af365f7. Workload mode now honors --repetitions, records each case repetition, and adds per-case variance summaries for latency, TTFT, generation throughput, content size, policy-timeout failures, and quality failure rate. Validation is green locally and in CI run 24879357737. Local checks run: /opt/ai-runtime/venv-live/bin/python -m pytest -q tests/test_bench_serve.py tests/test_metrics.py
/opt/ai-runtime/venv-live/bin/python -m black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
/opt/homebrew/bin/uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check |
|
I pushed one more small addition at fa40b8f: direct SQLite output for both prompt sweeps and workload runs. Additional local validation: /opt/ai-runtime/venv-live/bin/python -m pytest -q tests/test_bench_serve.py tests/test_metrics.py
/opt/ai-runtime/venv-live/bin/python -m black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
/opt/homebrew/bin/uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check
/opt/ai-runtime/venv-live/bin/python -m vllm_mlx.cli bench-serve --workload examples/bench_serve_workload.json --model test-model --url http://127.0.0.1:1 --request-timeout-s 1 --output /tmp/bench-workload-out.db --format sqlite --scrape-metrics false
sqlite3 /tmp/bench-workload-out.db "select count(*), min(case_id), max(repetition) from bench_serve_workload;"CI is green on run 24879845684. |
janhilgard
left a comment
There was a problem hiding this comment.
Review: Add bench-serve workload contracts
Nice feature — this fills a real gap between raw prompt sweeps and production qualification. The design is well-thought-out: keeping policy_timeout_ms as comparison evidence rather than a hard pass/fail is the right call.
Things I like
- Clean separation between sweep mode and workload mode — no existing behavior changed.
request_pathsupport — avoids duplicating large prompt bodies in the workload JSON.- Variance tracking with
--repetitionsand per-case min/median/max summaries. - SQLite output for longitudinal comparisons is practical and well-implemented.
- Cache policy controls (
before-run,before-case) — critical for meaningful cold-start benchmarks. - Backward-compatible
parse_status_responseaccepting bothactive_memory_gband legacyactive_gbkeys. - Comprehensive test coverage — ~600 lines of new tests covering loading, quality checks, summaries, runners, formatters, and SQLite.
Suggestions
-
SQL injection in
_write_sqlite_rows(line ~1596): Thetableand column names are interpolated directly into SQL strings. While they're currently controlled by constants, consider using an allowlist check or at minimum an assertion thattablematches^[a-z_]+$. This is a defense-in-depth concern if someone later passes user input here. -
quality_failuresvsfailuresduplication insummarize_workload_results(lines ~1381 and ~1388): Both compute the same list (not r["quality"]["ok"]). One of them can be removed:quality_failures = [r for r in results if not r["quality"]["ok"]] # ... failures = [r for r in results if not r["quality"]["ok"]] # same as quality_failures
-
Missing
request_timeout_spropagation tohttpx.AsyncClient: Inrun_workload_case, theclientis created inrun_bench_serve_workloadwith the timeout, butstream_chat_completionreceives the client without any per-request timeout override. If a case has a very long expected runtime (e.g. 100K-token generation), the global transport timeout applies. This is probably fine for now, but worth documenting thatrequest_timeout_sis the HTTP transport ceiling for all cases. -
--formatdefault changed from"table"toNone(cli.py line ~1453): The fallback logic (args.format or "json"for workloads,args.format or "table"for sweeps) works, butNoneas a default with downstreamorfeels fragile. Consider using a sentinel or documenting the tri-state clearly. -
Workload example (
examples/bench_serve_workload.json): The"I cannot"entry inforbidden_regexcould be confusing — it's a plain string, not a regex. Works fine withre.search, but a comment in the docs noting that patterns are Python regex (where literal strings are valid) would help users. -
Minor:
_normalize_cache_policysilently normalizes underscores to hyphens. The CLIchoicesonly list hyphenated forms. If someone passesbefore_casein the workload JSON defaults, it works but isn't documented.
Questions
- Is there a plan to add a
--workloadoption to thevllm-mlx-benchstandalone tool as well, or is this intentionally limited tobench-serve(running server only)? - For the
jsoncheck invalidate_quality_checks: should it also validate against a JSON schema (e.g. checking required keys), or is that considered out of scope?
Overall this is solid work. The concerns above are minor — none are blocking.
|
Thanks for the careful review. I addressed these in the current head
For the questions: I intentionally scoped this to Local validation after the current head: uv run --extra dev pytest -q tests/test_bench_serve.py
uv run --extra dev black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --checkResult: |
What changed
vllm-mlx bench-serve --workloadfor declarative product-style serving contracts.policy_timeout_ms./metricsdeltas,/v1/statusMetal/cache data, quality failures, and policy-timeout pass/fail separately.--formatsurface: JSON, CSV, SQL, and table.Why
Prompt sweeps are useful for raw serving performance, but they are not enough to qualify real application contracts. This adds a durable upstream path for repeatable workload qualification where quality gates, cache metrics, runtime metadata, and product policy comparisons are captured in one artifact.
policy_timeout_msis deliberately reported as comparison evidence, not as an implicit hardware SLA. That avoids turning arbitrary app guardrails into benchmark truth.Validation
uvx black --fast --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.pypython -m pytest -q tests/test_bench_serve.pypython -m py_compile vllm_mlx/bench_serve.py vllm_mlx/cli.pyuvx ruff check --select I vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.pyNote
I also tried
python -m pytest -q tests/test_metrics.py tests/test_bench_serve.py;tests/test_metrics.pyis blocked in this local checkout by missingfastapi. CI installsfastapi, so I am leaving that to CI rather than changing dependency setup in this PR.