Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,12 @@ python examples/tts_multilingual.py "Hola mundo" --lang es --play

```bash
vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv

# Product-style workload with quality checks and metrics deltas
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --output results.json

# Append workload rows into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --format sqlite --output bench.db
```

### Prometheus metrics
Expand Down
93 changes: 93 additions & 0 deletions docs/benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,99 @@ vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit

# Video benchmark
vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit --video

# Running-server prompt sweep with Prometheus metric deltas
vllm-mlx bench-serve --url http://localhost:8000 --prompts short,long \
--concurrency 1,4 --output bench.json --format json

# Running-server product-style workload with quality checks
vllm-mlx bench-serve --url http://localhost:8000 \
--workload ./workload.json --output workload-results.json
```

## Contract Workloads

`vllm-mlx bench-serve --workload` runs declarative cases against an already
running OpenAI-compatible server. This is intended for model and feature-stack
qualification, where raw speed is not enough and every run needs provenance,
quality checks, Prometheus metric deltas, and policy-timeout evidence.
Use `--repetitions` to measure variance; workload summaries report per-case
sample counts, failure rates, and min/median/max latency and throughput.
`required_regex` and `forbidden_regex` entries are Python regular expressions;
plain literal strings are valid regex patterns. Workload `cache_policy` accepts
`preserve`, `before-run`, and `before-case`; JSON/YAML-style underscore
spellings such as `before_case` are normalized to the same values.

Example workload:

```json
{
"name": "writing-contract",
"description": "Representative long-form writing requests",
"defaults": {
"max_tokens": 32768,
"enable_thinking": true,
"policy_timeout_ms": 180000,
"checks": {
"finish_reason": "stop",
"forbidden_regex": ["<think>", "prompt leakage"],
"min_chars": 500
}
},
"cases": [
{
"id": "resume-golden-1",
"messages": [
{"role": "user", "content": "Write the requested artifact..."}
],
"tags": ["resume", "quality-floor"]
}
]
}
```

Cases can also reference an existing OpenAI-compatible request JSON instead of
duplicating a large prompt body:

```json
{
"name": "writing-contract",
"cases": [
{
"id": "resume-golden-1",
"request_path": "./fixtures/job543_resume_precise_request.json",
"checks": {
"finish_reason": "stop",
"forbidden_regex": ["<think>"]
}
}
]
}
```

When `request_path` is used, `messages`, `max_tokens`, `enable_thinking`, and
extra request-body fields such as `thinking_token_budget` are read from that
file. Case-level `extra_body` values override request-file values.

`policy_timeout_ms` is recorded as comparison evidence. It is not treated as a
hardware capability claim. Use it to answer "would this run fit my product
policy?" after first measuring what the model and serving stack can actually do.

Workload output defaults to JSON for full provenance. Use `--format csv` for
flat per-case rows, `--format sql` to emit importable SQL, or
`--format sqlite --output bench.db` to append rows directly into a local
benchmark database.

`--request-timeout-s` is the HTTP transport ceiling for each request in
workload mode. Product policy timeouts belong in the workload as
`policy_timeout_ms` and are recorded as comparison evidence.

```bash
vllm-mlx bench-serve --url http://localhost:8000 \
--workload ./workload.json --repetitions 5 --output workload-results.json

vllm-mlx bench-serve --url http://localhost:8000 \
--workload ./workload.json --repetitions 5 --format sqlite --output bench.db
```

## Standalone Test Defaults
Expand Down
57 changes: 57 additions & 0 deletions docs/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
| Command | Description |
|---------|-------------|
| `vllm-mlx serve` | Start OpenAI-compatible server |
| `vllm-mlx bench-serve` | Benchmark a running server with prompt sweeps or workload contracts |
| `vllm-mlx-bench` | Run performance benchmarks |
| `vllm-mlx-chat` | Start Gradio chat interface |

Expand Down Expand Up @@ -132,6 +133,62 @@ curl http://localhost:8000/v1/models \
-H "Authorization: Bearer your-secret-key"
```

## `vllm-mlx bench-serve`

Benchmark a running vllm-mlx server over HTTP. Prompt-sweep mode measures
TTFT, TPOT, throughput, cache deltas, and Metal memory. Workload mode adds
per-case quality checks, repeated samples for variance, and comparison-only
product policy timeouts. Workload cases can embed `messages` directly or point
`request_path` at an existing OpenAI-compatible request JSON.

### Usage

```bash
vllm-mlx bench-serve --url http://localhost:8000 [options]
```

### Options

| Option | Description | Default |
|--------|-------------|---------|
| `--url` | Running server base URL | `http://127.0.0.1:8080` |
| `--model` | API model id | Auto-detect |
| `--prompts` | Comma-separated prompt sets or files for sweep mode | `short,medium,long` |
| `--workload` | Declarative workload JSON for contract mode | None |
| `--concurrency` | Comma-separated concurrency levels for sweep mode | `1,4` |
| `--max-tokens` | Max tokens for sweep mode | `256` |
| `--repetitions` | Repetitions per sweep configuration or workload case | `3` |
| `--enable-thinking` | `true`, `false`, or `true,false` sweep | None |
| `--scrape-metrics` | Scrape `/metrics` before/after runs | `true` |
| `--include-content` | Include full generated content in workload JSON | False |
| `--request-timeout-s` | Workload HTTP transport timeout, `0` disables | `300` |
| `--cache-policy` | Workload cache handling: `preserve`, `before-run`, `before-case` | Workload default or `preserve` |
| `--output` | Output file | stdout |
| `--format` | Output format: `auto`, `table`, `json`, `csv`, `sql`, `sqlite` | `auto` = `table` for prompt sweeps, `json` for workloads |

In workload mode, `--request-timeout-s` is the HTTP transport ceiling for each
request. Product policy timeouts should live in the workload as
`policy_timeout_ms`. Workload `required_regex` and `forbidden_regex` values are
Python regex patterns, so literal strings are valid. Workload JSON may spell
cache policy values with underscores, such as `before_case`; they normalize to
the hyphenated CLI values.

### Examples

```bash
# Prompt sweep
vllm-mlx bench-serve --url http://localhost:8000 \
--prompts short,long --concurrency 1,4 --format json --output bench.json

# Contract workload with quality checks and policy-timeout evidence
vllm-mlx bench-serve --url http://localhost:8000 \
--workload workload.json --repetitions 5 --output workload-results.json

# Append contract rows directly into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 \
--workload workload.json --repetitions 5 --format sqlite --output bench.db
```

## `vllm-mlx-bench`

Run performance benchmarks.
Expand Down
44 changes: 44 additions & 0 deletions examples/bench_serve_workload.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"name": "quality-contract-smoke",
"description": "Small contract-style workload demonstrating bench-serve quality checks.",
"defaults": {
"max_tokens": 256,
"enable_thinking": false,
"policy_timeout_ms": 30000,
"checks": {
"finish_reason": "stop",
"forbidden_regex": [
"<think>",
"I cannot"
],
"min_chars": 40
}
},
"cases": [
{
"id": "python-palindrome",
"tags": [
"code",
"quality"
],
"messages": [
{
"role": "user",
"content": "Write a Python function with type hints that checks whether a string is a palindrome. Include one short example."
}
],
"checks": {
"finish_reason": "stop",
"required_regex": [
"def ",
"-> bool"
],
"forbidden_regex": [
"<think>",
"TODO"
],
"min_chars": 80
}
}
]
}
Loading
Loading