waybarrios · Thump604 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/README.md b/README.md
@@ -188,6 +188,12 @@ python examples/tts_multilingual.py "Hola mundo" --lang es --play
 
 ```bash
 vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv
+
+# Product-style workload with quality checks and metrics deltas
+vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --output results.json
+
+# Append workload rows into SQLite for longitudinal comparisons
+vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --format sqlite --output bench.db
 ```
 
 ### Prometheus metrics

diff --git a/docs/benchmarks/README.md b/docs/benchmarks/README.md
@@ -19,6 +19,99 @@ vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit
 
 # Video benchmark
 vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit --video
+
+# Running-server prompt sweep with Prometheus metric deltas
+vllm-mlx bench-serve --url http://localhost:8000 --prompts short,long \
+  --concurrency 1,4 --output bench.json --format json
+
+# Running-server product-style workload with quality checks
+vllm-mlx bench-serve --url http://localhost:8000 \
+  --workload ./workload.json --output workload-results.json
+```
+
+## Contract Workloads
+
+`vllm-mlx bench-serve --workload` runs declarative cases against an already
+running OpenAI-compatible server. This is intended for model and feature-stack
+qualification, where raw speed is not enough and every run needs provenance,
+quality checks, Prometheus metric deltas, and policy-timeout evidence.
+Use `--repetitions` to measure variance; workload summaries report per-case
+sample counts, failure rates, and min/median/max latency and throughput.
+`required_regex` and `forbidden_regex` entries are Python regular expressions;
+plain literal strings are valid regex patterns. Workload `cache_policy` accepts
+`preserve`, `before-run`, and `before-case`; JSON/YAML-style underscore
+spellings such as `before_case` are normalized to the same values.
+
+Example workload:
+
+```json
+{
+  "name": "writing-contract",
+  "description": "Representative long-form writing requests",
+  "defaults": {
+    "max_tokens": 32768,
+    "enable_thinking": true,
+    "policy_timeout_ms": 180000,
+    "checks": {
+      "finish_reason": "stop",
+      "forbidden_regex": ["<think>", "prompt leakage"],
+      "min_chars": 500
+    }
+  },
+  "cases": [
+    {
+      "id": "resume-golden-1",
+      "messages": [
+        {"role": "user", "content": "Write the requested artifact..."}
+      ],
+      "tags": ["resume", "quality-floor"]
+    }
+  ]
+}
+```
+
+Cases can also reference an existing OpenAI-compatible request JSON instead of
+duplicating a large prompt body:
+
+```json
+{
+  "name": "writing-contract",
+  "cases": [
+    {
+      "id": "resume-golden-1",
+      "request_path": "./fixtures/job543_resume_precise_request.json",
+      "checks": {
+        "finish_reason": "stop",
+        "forbidden_regex": ["<think>"]
+      }
+    }
+  ]
+}
+```
+
+When `request_path` is used, `messages`, `max_tokens`, `enable_thinking`, and
+extra request-body fields such as `thinking_token_budget` are read from that
+file. Case-level `extra_body` values override request-file values.
+
+`policy_timeout_ms` is recorded as comparison evidence. It is not treated as a
+hardware capability claim. Use it to answer "would this run fit my product
+policy?" after first measuring what the model and serving stack can actually do.
+
+Workload output defaults to JSON for full provenance. Use `--format csv` for
+flat per-case rows, `--format sql` to emit importable SQL, or
+`--format sqlite --output bench.db` to append rows directly into a local
+benchmark database.
+
+`--request-timeout-s` is the HTTP transport ceiling for each request in
+workload mode. Product policy timeouts belong in the workload as
+`policy_timeout_ms` and are recorded as comparison evidence.
+
+```bash
+vllm-mlx bench-serve --url http://localhost:8000 \
+  --workload ./workload.json --repetitions 5 --output workload-results.json
+
+vllm-mlx bench-serve --url http://localhost:8000 \
+  --workload ./workload.json --repetitions 5 --format sqlite --output bench.db
 ```
 
 ## Standalone Test Defaults

diff --git a/docs/reference/cli.md b/docs/reference/cli.md
@@ -5,6 +5,7 @@
 | Command | Description |
 |---------|-------------|
 | `vllm-mlx serve` | Start OpenAI-compatible server |
+| `vllm-mlx bench-serve` | Benchmark a running server with prompt sweeps or workload contracts |
 | `vllm-mlx-bench` | Run performance benchmarks |
 | `vllm-mlx-chat` | Start Gradio chat interface |
 
@@ -132,6 +133,62 @@ curl http://localhost:8000/v1/models \
   -H "Authorization: Bearer your-secret-key"
 ```
 
+## `vllm-mlx bench-serve`
+
+Benchmark a running vllm-mlx server over HTTP. Prompt-sweep mode measures
+TTFT, TPOT, throughput, cache deltas, and Metal memory. Workload mode adds
+per-case quality checks, repeated samples for variance, and comparison-only
+product policy timeouts. Workload cases can embed `messages` directly or point
+`request_path` at an existing OpenAI-compatible request JSON.
+
+### Usage
+
+```bash
+vllm-mlx bench-serve --url http://localhost:8000 [options]
+```
+
+### Options
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--url` | Running server base URL | `http://127.0.0.1:8080` |
+| `--model` | API model id | Auto-detect |
+| `--prompts` | Comma-separated prompt sets or files for sweep mode | `short,medium,long` |
+| `--workload` | Declarative workload JSON for contract mode | None |
+| `--concurrency` | Comma-separated concurrency levels for sweep mode | `1,4` |
+| `--max-tokens` | Max tokens for sweep mode | `256` |
+| `--repetitions` | Repetitions per sweep configuration or workload case | `3` |
+| `--enable-thinking` | `true`, `false`, or `true,false` sweep | None |
+| `--scrape-metrics` | Scrape `/metrics` before/after runs | `true` |
+| `--include-content` | Include full generated content in workload JSON | False |
+| `--request-timeout-s` | Workload HTTP transport timeout, `0` disables | `300` |
+| `--cache-policy` | Workload cache handling: `preserve`, `before-run`, `before-case` | Workload default or `preserve` |
+| `--output` | Output file | stdout |
+| `--format` | Output format: `auto`, `table`, `json`, `csv`, `sql`, `sqlite` | `auto` = `table` for prompt sweeps, `json` for workloads |
+
+In workload mode, `--request-timeout-s` is the HTTP transport ceiling for each
+request. Product policy timeouts should live in the workload as
+`policy_timeout_ms`. Workload `required_regex` and `forbidden_regex` values are
+Python regex patterns, so literal strings are valid. Workload JSON may spell
+cache policy values with underscores, such as `before_case`; they normalize to
+the hyphenated CLI values.
+
+### Examples
+
+```bash
+# Prompt sweep
+vllm-mlx bench-serve --url http://localhost:8000 \
+  --prompts short,long --concurrency 1,4 --format json --output bench.json
+
+# Contract workload with quality checks and policy-timeout evidence
+vllm-mlx bench-serve --url http://localhost:8000 \
+  --workload workload.json --repetitions 5 --output workload-results.json
+
+# Append contract rows directly into SQLite for longitudinal comparisons
+vllm-mlx bench-serve --url http://localhost:8000 \
+  --workload workload.json --repetitions 5 --format sqlite --output bench.db
+```
+
 ## `vllm-mlx-bench`
 
 Run performance benchmarks.

diff --git a/examples/bench_serve_workload.json b/examples/bench_serve_workload.json
@@ -0,0 +1,44 @@
+{
+  "name": "quality-contract-smoke",
+  "description": "Small contract-style workload demonstrating bench-serve quality checks.",
+  "defaults": {
+    "max_tokens": 256,
+    "enable_thinking": false,
+    "policy_timeout_ms": 30000,
+    "checks": {
+      "finish_reason": "stop",
+      "forbidden_regex": [
+        "<think>",
+        "I cannot"
+      ],
+      "min_chars": 40
+    }
+  },
+  "cases": [
+    {
+      "id": "python-palindrome",
+      "tags": [
+        "code",
+        "quality"
+      ],
+      "messages": [
+        {
+          "role": "user",
+          "content": "Write a Python function with type hints that checks whether a string is a palindrome. Include one short example."
+        }
+      ],
+      "checks": {
+        "finish_reason": "stop",
+        "required_regex": [
+          "def ",
+          "-> bool"
+        ],
+        "forbidden_regex": [
+          "<think>",
+          "TODO"
+        ],
+        "min_chars": 80
+      }
+    }
+  ]
+}