Add bench-serve workload contracts by Thump604 · Pull Request #406 · waybarrios/vllm-mlx

Thump604 · 2026-04-24T02:57:12Z

What changed

Added vllm-mlx bench-serve --workload for declarative product-style serving contracts.
Workload cases can define request settings, quality checks, tags, and comparison-only policy_timeout_ms.
The runner records runtime provenance, hardware fingerprint, /metrics deltas, /v1/status Metal/cache data, quality failures, and policy-timeout pass/fail separately.
Added workload output formats using the existing --format surface: JSON, CSV, SQL, and table.
Added an example workload and docs for using contract workloads as model and feature-stack qualification inputs.

Why

Prompt sweeps are useful for raw serving performance, but they are not enough to qualify real application contracts. This adds a durable upstream path for repeatable workload qualification where quality gates, cache metrics, runtime metadata, and product policy comparisons are captured in one artifact.

policy_timeout_ms is deliberately reported as comparison evidence, not as an implicit hardware SLA. That avoids turning arbitrary app guardrails into benchmark truth.

Validation

uvx black --fast --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
python -m pytest -q tests/test_bench_serve.py
python -m py_compile vllm_mlx/bench_serve.py vllm_mlx/cli.py
uvx ruff check --select I vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
CLI smoke for workload JSON, CSV, and SQL output against an unavailable local URL to verify error and policy-timeout recording

Note

I also tried python -m pytest -q tests/test_metrics.py tests/test_bench_serve.py; tests/test_metrics.py is blocked in this local checkout by missing fastapi. CI installs fastapi, so I am leaving that to CI rather than changing dependency setup in this PR.

Thump604 · 2026-04-24T08:17:01Z

I pushed one more harness increment at af365f7. Workload mode now honors --repetitions, records each case repetition, and adds per-case variance summaries for latency, TTFT, generation throughput, content size, policy-timeout failures, and quality failure rate.

Validation is green locally and in CI run 24879357737. Local checks run:

/opt/ai-runtime/venv-live/bin/python -m pytest -q tests/test_bench_serve.py tests/test_metrics.py
/opt/ai-runtime/venv-live/bin/python -m black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
/opt/homebrew/bin/uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check

Thump604 · 2026-04-24T08:27:40Z

I pushed one more small addition at fa40b8f: direct SQLite output for both prompt sweeps and workload runs. --format sqlite --output bench.db appends prompt rows into bench_serve and workload rows into bench_serve_workload, using the same schemas as the SQL output.

Additional local validation:

/opt/ai-runtime/venv-live/bin/python -m pytest -q tests/test_bench_serve.py tests/test_metrics.py
/opt/ai-runtime/venv-live/bin/python -m black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
/opt/homebrew/bin/uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check
/opt/ai-runtime/venv-live/bin/python -m vllm_mlx.cli bench-serve --workload examples/bench_serve_workload.json --model test-model --url http://127.0.0.1:1 --request-timeout-s 1 --output /tmp/bench-workload-out.db --format sqlite --scrape-metrics false
sqlite3 /tmp/bench-workload-out.db "select count(*), min(case_id), max(repetition) from bench_serve_workload;"

CI is green on run 24879845684.

janhilgard

Review: Add bench-serve workload contracts

Nice feature — this fills a real gap between raw prompt sweeps and production qualification. The design is well-thought-out: keeping policy_timeout_ms as comparison evidence rather than a hard pass/fail is the right call.

Things I like

Clean separation between sweep mode and workload mode — no existing behavior changed.
request_path support — avoids duplicating large prompt bodies in the workload JSON.
Variance tracking with --repetitions and per-case min/median/max summaries.
SQLite output for longitudinal comparisons is practical and well-implemented.
Cache policy controls (before-run, before-case) — critical for meaningful cold-start benchmarks.
Backward-compatible parse_status_response accepting both active_memory_gb and legacy active_gb keys.
Comprehensive test coverage — ~600 lines of new tests covering loading, quality checks, summaries, runners, formatters, and SQLite.

Suggestions

SQL injection in _write_sqlite_rows (line ~1596): The table and column names are interpolated directly into SQL strings. While they're currently controlled by constants, consider using an allowlist check or at minimum an assertion that table matches ^[a-z_]+$. This is a defense-in-depth concern if someone later passes user input here.

quality_failures vs failures duplication in summarize_workload_results (lines ~1381 and ~1388): Both compute the same list (not r["quality"]["ok"]). One of them can be removed:

quality_failures = [r for r in results if not r["quality"]["ok"]]
# ...
failures = [r for r in results if not r["quality"]["ok"]]  # same as quality_failures

Missing request_timeout_s propagation to httpx.AsyncClient: In run_workload_case, the client is created in run_bench_serve_workload with the timeout, but stream_chat_completion receives the client without any per-request timeout override. If a case has a very long expected runtime (e.g. 100K-token generation), the global transport timeout applies. This is probably fine for now, but worth documenting that request_timeout_s is the HTTP transport ceiling for all cases.
--format default changed from "table" to None (cli.py line ~1453): The fallback logic (args.format or "json" for workloads, args.format or "table" for sweeps) works, but None as a default with downstream or feels fragile. Consider using a sentinel or documenting the tri-state clearly.
Workload example (examples/bench_serve_workload.json): The "I cannot" entry in forbidden_regex could be confusing — it's a plain string, not a regex. Works fine with re.search, but a comment in the docs noting that patterns are Python regex (where literal strings are valid) would help users.
Minor: _normalize_cache_policy silently normalizes underscores to hyphens. The CLI choices only list hyphenated forms. If someone passes before_case in the workload JSON defaults, it works but isn't documented.

Questions

Is there a plan to add a --workload option to the vllm-mlx-bench standalone tool as well, or is this intentionally limited to bench-serve (running server only)?
For the json check in validate_quality_checks: should it also validate against a JSON schema (e.g. checking required keys), or is that considered out of scope?

Overall this is solid work. The concerns above are minor — none are blocking.

Thump604 · 2026-04-24T14:42:04Z

Thanks for the careful review. I addressed these in the current head ab33faa:

added SQLite identifier validation before interpolating table or column names
removed the duplicate quality-failure list in workload summaries
documented that --request-timeout-s is the HTTP transport ceiling and policy_timeout_ms is product-policy evidence
made --format explicitly use auto, with table for prompt sweeps and JSON for workloads
documented that workload regex checks are Python regex patterns, so plain literal strings are valid
documented underscore cache-policy normalization for workload JSON

For the questions: I intentionally scoped this to bench-serve because workload contracts need a running server, /metrics, cache reset behavior, and real HTTP streaming behavior. I would keep offline vllm-mlx-bench separate unless we add a distinct non-server artifact contract later. JSON schema validation is a good next increment, but I kept this PR to syntax-valid JSON plus regex/content gates so the initial workload contract stays model-agnostic and low-friction.

Local validation after the current head:

uv run --extra dev pytest -q tests/test_bench_serve.py
uv run --extra dev black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check

Result: 84 passed, 2 skipped, formatting clean, ruff clean, diff clean.

Thump604 · 2026-04-24T23:24:37Z

This PR was accidentally closed when I renamed its source branch from codex/ to feat/ prefix. GitHub treats the old ref deletion as head_ref_deleted and auto-closes the PR. The same happened to PRs #405, #409, #417, #418, #419.

All affected PRs have been recreated with the same code on the new branch names:

Add bench-serve workload contracts #406 -> Add bench-serve workload contracts #427 (bench-serve workload contracts)
Fix sampling defaults and short prefix-cache reuse #405 -> Fix sampling defaults and short prefix-cache reuse #424 (sampling defaults / prefix-cache)
Add request cancellation and streaming timeout enforcement #409 -> Add request cancellation and streaming timeout enforcement #426 (request cancellation / timeout)
Add model artifact workflow CLI #417 -> Add model artifact workflow CLI #423 (model artifact workflow CLI)
Fix sampling defaults and short prefix-cache reuse #418 -> Fix sampling defaults and short prefix-cache reuse #424 (supersedes Fix sampling defaults and short prefix-cache reuse #405 refresh)
Add model registration and qualification handoff #419 -> Add model registration and qualification handoff #425 (model registration / qualification)

Review history on the originals was lost but all code is intact. Sorry for the confusion.

Thump604 added 3 commits April 23, 2026 21:56

Add bench-serve workload contracts

f980bed

Support workload request files

e8db1df

Read current status memory keys in bench serve

2bb2747

Thump604 marked this pull request as ready for review April 24, 2026 04:24

Add workload variance summaries

af365f7

Add bench serve SQLite output

fa40b8f

Thump604 changed the title ~~[codex] Add bench-serve workload contracts~~ Add bench-serve workload contracts Apr 24, 2026

Add workload cache policy controls

36c664e

janhilgard reviewed Apr 24, 2026

View reviewed changes

Thump604 added 2 commits April 24, 2026 08:32

Respect workload timeout defaults

6316785

Harden workload benchmark review edges

ab33faa

Thump604 mentioned this pull request Apr 24, 2026

Add model registration and qualification handoff #419

Closed

Thump604 closed this Apr 24, 2026

Thump604 deleted the codex/bench-serve-workload-contract branch April 24, 2026 15:48

Thump604 mentioned this pull request Apr 24, 2026

Add model registration and qualification handoff #425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bench-serve workload contracts#406

Add bench-serve workload contracts#406
Thump604 wants to merge 8 commits intowaybarrios:mainfrom
Thump604:codex/bench-serve-workload-contract

Thump604 commented Apr 24, 2026

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thump604 commented Apr 24, 2026

What changed

Why

Validation

Note

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Review: Add bench-serve workload contracts

Things I like

Suggestions

Questions

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

Thump604 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants