Harden bench-serve workload runner with focused regression tests by waybarrios · Pull Request #515 · waybarrios/vllm-mlx

waybarrios · 2026-05-07T20:38:18Z

Addresses #499.

Summary

Adds 21 focused regression tests for the bench-serve workload runner —
the dense load/validate/run/report paths that issue #499 flagged as
the highest concentrated complexity in vllm_mlx/bench_serve.py. No
source changes — every test passes against existing behavior, locking
in the contract going forward.

Background

vllm_mlx/bench_serve.py is both a workload runner and an evidence
artifact producer for release qualification. The advisory complexity
scan from #499 highlighted four functions as the densest:
load_workload, validate_quality_checks, run_workload_case,
run_bench_serve_workload.

Existing coverage in test_bench_serve.py (1785 lines, 119 cases)
covers happy paths for workload loading, sweep expansion, formatters,
runner end-to-end, and basic quality checks. The corners that were
not pinned down: load_workload validation guard clauses, streaming
tool-call accumulation across chunk-boundary deltas, and the
diagnostic shape of validate_quality_checks for argument-validation
edge cases.

Tests added

tests/test_bench_serve_workload_hardening.py (21 cases):

TestLoadWorkloadValidation (8):

root must be a JSON object
empty cases list rejected
missing cases key rejected
defaults must be an object
case must be an object
extra_body invalid type rejected
tags string is normalised to a single-tuple
tags invalid type (int) rejected

TestStreamingToolCallAccumulation (4):

name and arguments concatenate across chunk-boundary deltas
out-of-order indices (2, 0, 1) land sorted in the finalised list
id set on the first delta survives later deltas that omit it
omitted index defaults to 0 (mlx-lm style streams sometimes do this)

TestQualityCheckDiagnostics (9):

tool_call_count mismatch reports the actual count
tool_call_args invalid JSON reports a parse error
tool_call_args non-object arguments are flagged
missing required keys lists what is missing
missing function name surfaces the function name in the issue
no_tool_calls combines with min_chars to surface both diagnostics
invalid required_regex surfaces as an issue rather than crashing
finish_reason list-form accepts any allowed member
finish_reason string-form rejects others

Verification

The behaviors the tests pin down, exercised against the live runner:

======================================================================
1) Streaming tool-call accumulation across deltas
======================================================================
after delta : {0: {'id': 'call_1', 'type': 'function', 'function': {'name': 'get_', 'arguments': '{"city":'}}}
after delta : {0: {'id': 'call_1', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city":"Tokyo"}'}}}
finalised   : [{'id': 'call_1', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city":"Tokyo"}'}}]
parsed args : {'city': 'Tokyo'}

======================================================================
2) Out-of-order indices land sorted
======================================================================
names in order: ['first', 'second', 'third']

======================================================================
3) tool_call_args_required_keys diagnostic on missing keys
======================================================================
ok      : False
issue   : tool_call_args_required_keys: f missing keys ['b', 'c']

======================================================================
4) tool_call_args invalid-JSON diagnostic
======================================================================
issue   : tool_call_args_required_keys: f arguments invalid JSON: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

======================================================================
5) no_tool_calls + min_chars combine into multiple issues
======================================================================
issue   : content shorter than min_chars=100
issue   : no_tool_calls: expected 0 tool calls, got 1

======================================================================
6) invalid required_regex surfaces as an issue, not a crash
======================================================================
ok      : False
issue   : invalid required_regex '[unclosed': unterminated character set at position 0

New regression suite:

======================== 21 passed in 0.15s ========================

The wider bench-serve suite is also green:
pytest tests/test_bench_serve.py tests/test_bench_serve_workload_hardening.py → 119 + 21 + 2 skipped (live-run integration) = 140/140 passed.

Demo: tests catch a real regression

To prove the streaming-accumulation tests aren't vacuous, I temporarily
swapped the += concatenation in accumulate_tool_calls for a =
overwrite (the regression #499 worries about as the dense state machine
gets touched):

if function_delta.get("name"):
    entry["function"]["name"] = function_delta["name"]   # was: +=
if function_delta.get("arguments"):
    entry["function"]["arguments"] = function_delta["arguments"]  # was: +=

The hardening test fired with a precise diagnostic:

FAILED test_concatenates_name_and_arguments_across_deltas
  AssertionError: assert 'weather' == 'get_weather'

  - get_weather
  ? ----
  + weather
========================= 1 failed, 20 passed in 0.23s =========================

After restoring += the full suite goes back to 21 passed.

Note on scope

This PR is test-only. No production logic changes. None of the new
tests surfaced a real bug against current main; the runner's
existing behavior matches what the new contract asserts. If a future
PR refactors any of the hot functions (load_workload,
accumulate_tool_calls, validate_quality_checks), this suite will
catch unintentional drift.

Pin behavior of the workload load/validate/run paths called out in #499: - load_workload: every guard clause that rejects malformed JSON (root-not-object, empty/missing cases, defaults-not-object, case-not-object, extra_body wrong type, tags wrong type), plus the string-tag normalisation to a tuple. - accumulate_tool_calls / finalize_tool_calls: name and arguments concatenate across chunk-boundary deltas, out-of-order indices land sorted, ids set on the first delta survive subsequent deltas that omit them, omitted index defaults to 0. - validate_quality_checks: tool_call_count mismatch reports actual count, invalid-JSON arguments report a parse error, non-object arguments are flagged, missing required keys are listed, missing function names surface the function name, no_tool_calls combines with min_chars to produce both diagnostics, invalid required_regex surfaces as an issue rather than crashing, and the list-form of finish_reason accepts any allowed member while the string form rejects others. All 21 tests pass against current main with no source changes. The full bench-serve suite (119 + 21 = 140 tests) stays green. The streaming-accumulation tests were demoed against an inline-injected regression that swapped += with = on name/arguments concatenation; the suite caught it with a precise diagnostic ("'weather' == 'get_weather'").

waybarrios added the optimization label May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden bench-serve workload runner with focused regression tests#515

Harden bench-serve workload runner with focused regression tests#515
waybarrios wants to merge 1 commit intomainfrom
harden-bench-serve-workload

waybarrios commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

waybarrios commented May 7, 2026

Summary

Background

Tests added

Verification

Demo: tests catch a real regression

Note on scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant