Skip to content

Add bench-serve workload contracts#406

Closed
Thump604 wants to merge 8 commits intowaybarrios:mainfrom
Thump604:codex/bench-serve-workload-contract
Closed

Add bench-serve workload contracts#406
Thump604 wants to merge 8 commits intowaybarrios:mainfrom
Thump604:codex/bench-serve-workload-contract

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

What changed

  • Added vllm-mlx bench-serve --workload for declarative product-style serving contracts.
  • Workload cases can define request settings, quality checks, tags, and comparison-only policy_timeout_ms.
  • The runner records runtime provenance, hardware fingerprint, /metrics deltas, /v1/status Metal/cache data, quality failures, and policy-timeout pass/fail separately.
  • Added workload output formats using the existing --format surface: JSON, CSV, SQL, and table.
  • Added an example workload and docs for using contract workloads as model and feature-stack qualification inputs.

Why

Prompt sweeps are useful for raw serving performance, but they are not enough to qualify real application contracts. This adds a durable upstream path for repeatable workload qualification where quality gates, cache metrics, runtime metadata, and product policy comparisons are captured in one artifact.

policy_timeout_ms is deliberately reported as comparison evidence, not as an implicit hardware SLA. That avoids turning arbitrary app guardrails into benchmark truth.

Validation

  • uvx black --fast --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
  • python -m pytest -q tests/test_bench_serve.py
  • python -m py_compile vllm_mlx/bench_serve.py vllm_mlx/cli.py
  • uvx ruff check --select I vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
  • CLI smoke for workload JSON, CSV, and SQL output against an unavailable local URL to verify error and policy-timeout recording

Note

I also tried python -m pytest -q tests/test_metrics.py tests/test_bench_serve.py; tests/test_metrics.py is blocked in this local checkout by missing fastapi. CI installs fastapi, so I am leaving that to CI rather than changing dependency setup in this PR.

@Thump604 Thump604 marked this pull request as ready for review April 24, 2026 04:24
@Thump604
Copy link
Copy Markdown
Collaborator Author

I pushed one more harness increment at af365f7. Workload mode now honors --repetitions, records each case repetition, and adds per-case variance summaries for latency, TTFT, generation throughput, content size, policy-timeout failures, and quality failure rate.

Validation is green locally and in CI run 24879357737. Local checks run:

/opt/ai-runtime/venv-live/bin/python -m pytest -q tests/test_bench_serve.py tests/test_metrics.py
/opt/ai-runtime/venv-live/bin/python -m black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
/opt/homebrew/bin/uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check

@Thump604
Copy link
Copy Markdown
Collaborator Author

I pushed one more small addition at fa40b8f: direct SQLite output for both prompt sweeps and workload runs. --format sqlite --output bench.db appends prompt rows into bench_serve and workload rows into bench_serve_workload, using the same schemas as the SQL output.

Additional local validation:

/opt/ai-runtime/venv-live/bin/python -m pytest -q tests/test_bench_serve.py tests/test_metrics.py
/opt/ai-runtime/venv-live/bin/python -m black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
/opt/homebrew/bin/uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check
/opt/ai-runtime/venv-live/bin/python -m vllm_mlx.cli bench-serve --workload examples/bench_serve_workload.json --model test-model --url http://127.0.0.1:1 --request-timeout-s 1 --output /tmp/bench-workload-out.db --format sqlite --scrape-metrics false
sqlite3 /tmp/bench-workload-out.db "select count(*), min(case_id), max(repetition) from bench_serve_workload;"

CI is green on run 24879845684.

@Thump604 Thump604 changed the title [codex] Add bench-serve workload contracts Add bench-serve workload contracts Apr 24, 2026
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add bench-serve workload contracts

Nice feature — this fills a real gap between raw prompt sweeps and production qualification. The design is well-thought-out: keeping policy_timeout_ms as comparison evidence rather than a hard pass/fail is the right call.

Things I like

  • Clean separation between sweep mode and workload mode — no existing behavior changed.
  • request_path support — avoids duplicating large prompt bodies in the workload JSON.
  • Variance tracking with --repetitions and per-case min/median/max summaries.
  • SQLite output for longitudinal comparisons is practical and well-implemented.
  • Cache policy controls (before-run, before-case) — critical for meaningful cold-start benchmarks.
  • Backward-compatible parse_status_response accepting both active_memory_gb and legacy active_gb keys.
  • Comprehensive test coverage — ~600 lines of new tests covering loading, quality checks, summaries, runners, formatters, and SQLite.

Suggestions

  1. SQL injection in _write_sqlite_rows (line ~1596): The table and column names are interpolated directly into SQL strings. While they're currently controlled by constants, consider using an allowlist check or at minimum an assertion that table matches ^[a-z_]+$. This is a defense-in-depth concern if someone later passes user input here.

  2. quality_failures vs failures duplication in summarize_workload_results (lines ~1381 and ~1388): Both compute the same list (not r["quality"]["ok"]). One of them can be removed:

    quality_failures = [r for r in results if not r["quality"]["ok"]]
    # ...
    failures = [r for r in results if not r["quality"]["ok"]]  # same as quality_failures
  3. Missing request_timeout_s propagation to httpx.AsyncClient: In run_workload_case, the client is created in run_bench_serve_workload with the timeout, but stream_chat_completion receives the client without any per-request timeout override. If a case has a very long expected runtime (e.g. 100K-token generation), the global transport timeout applies. This is probably fine for now, but worth documenting that request_timeout_s is the HTTP transport ceiling for all cases.

  4. --format default changed from "table" to None (cli.py line ~1453): The fallback logic (args.format or "json" for workloads, args.format or "table" for sweeps) works, but None as a default with downstream or feels fragile. Consider using a sentinel or documenting the tri-state clearly.

  5. Workload example (examples/bench_serve_workload.json): The "I cannot" entry in forbidden_regex could be confusing — it's a plain string, not a regex. Works fine with re.search, but a comment in the docs noting that patterns are Python regex (where literal strings are valid) would help users.

  6. Minor: _normalize_cache_policy silently normalizes underscores to hyphens. The CLI choices only list hyphenated forms. If someone passes before_case in the workload JSON defaults, it works but isn't documented.

Questions

  • Is there a plan to add a --workload option to the vllm-mlx-bench standalone tool as well, or is this intentionally limited to bench-serve (running server only)?
  • For the json check in validate_quality_checks: should it also validate against a JSON schema (e.g. checking required keys), or is that considered out of scope?

Overall this is solid work. The concerns above are minor — none are blocking.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Thanks for the careful review. I addressed these in the current head ab33faa:

  • added SQLite identifier validation before interpolating table or column names
  • removed the duplicate quality-failure list in workload summaries
  • documented that --request-timeout-s is the HTTP transport ceiling and policy_timeout_ms is product-policy evidence
  • made --format explicitly use auto, with table for prompt sweeps and JSON for workloads
  • documented that workload regex checks are Python regex patterns, so plain literal strings are valid
  • documented underscore cache-policy normalization for workload JSON

For the questions: I intentionally scoped this to bench-serve because workload contracts need a running server, /metrics, cache reset behavior, and real HTTP streaming behavior. I would keep offline vllm-mlx-bench separate unless we add a distinct non-server artifact contract later. JSON schema validation is a good next increment, but I kept this PR to syntax-valid JSON plus regex/content gates so the initial workload contract stays model-agnostic and low-friction.

Local validation after the current head:

uv run --extra dev pytest -q tests/test_bench_serve.py
uv run --extra dev black --check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py
uvx ruff check vllm_mlx/bench_serve.py vllm_mlx/cli.py tests/test_bench_serve.py --select E,F,W --ignore E402,E501,E731,F811,F841
git diff --check

Result: 84 passed, 2 skipped, formatting clean, ruff clean, diff clean.

@Thump604 Thump604 closed this Apr 24, 2026
@Thump604 Thump604 deleted the codex/bench-serve-workload-contract branch April 24, 2026 15:48
@Thump604
Copy link
Copy Markdown
Collaborator Author

This PR was accidentally closed when I renamed its source branch from codex/ to feat/ prefix. GitHub treats the old ref deletion as head_ref_deleted and auto-closes the PR. The same happened to PRs #405, #409, #417, #418, #419.

All affected PRs have been recreated with the same code on the new branch names:

Review history on the originals was lost but all code is intact. Sorry for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants