Skip to content

Harden bench-serve workload runner with focused regression tests#515

Open
waybarrios wants to merge 1 commit intomainfrom
harden-bench-serve-workload
Open

Harden bench-serve workload runner with focused regression tests#515
waybarrios wants to merge 1 commit intomainfrom
harden-bench-serve-workload

Conversation

@waybarrios
Copy link
Copy Markdown
Owner

Addresses #499.

Summary

Adds 21 focused regression tests for the bench-serve workload runner —
the dense load/validate/run/report paths that issue #499 flagged as
the highest concentrated complexity in vllm_mlx/bench_serve.py. No
source changes — every test passes against existing behavior, locking
in the contract going forward.

Background

vllm_mlx/bench_serve.py is both a workload runner and an evidence
artifact producer for release qualification. The advisory complexity
scan from #499 highlighted four functions as the densest:
load_workload, validate_quality_checks, run_workload_case,
run_bench_serve_workload.

Existing coverage in test_bench_serve.py (1785 lines, 119 cases)
covers happy paths for workload loading, sweep expansion, formatters,
runner end-to-end, and basic quality checks. The corners that were
not pinned down: load_workload validation guard clauses, streaming
tool-call accumulation across chunk-boundary deltas, and the
diagnostic shape of validate_quality_checks for argument-validation
edge cases.

Tests added

tests/test_bench_serve_workload_hardening.py (21 cases):

TestLoadWorkloadValidation (8):

  • root must be a JSON object
  • empty cases list rejected
  • missing cases key rejected
  • defaults must be an object
  • case must be an object
  • extra_body invalid type rejected
  • tags string is normalised to a single-tuple
  • tags invalid type (int) rejected

TestStreamingToolCallAccumulation (4):

  • name and arguments concatenate across chunk-boundary deltas
  • out-of-order indices (2, 0, 1) land sorted in the finalised list
  • id set on the first delta survives later deltas that omit it
  • omitted index defaults to 0 (mlx-lm style streams sometimes do this)

TestQualityCheckDiagnostics (9):

  • tool_call_count mismatch reports the actual count
  • tool_call_args invalid JSON reports a parse error
  • tool_call_args non-object arguments are flagged
  • missing required keys lists what is missing
  • missing function name surfaces the function name in the issue
  • no_tool_calls combines with min_chars to surface both diagnostics
  • invalid required_regex surfaces as an issue rather than crashing
  • finish_reason list-form accepts any allowed member
  • finish_reason string-form rejects others

Verification

The behaviors the tests pin down, exercised against the live runner:

======================================================================
1) Streaming tool-call accumulation across deltas
======================================================================
after delta : {0: {'id': 'call_1', 'type': 'function', 'function': {'name': 'get_', 'arguments': '{"city":'}}}
after delta : {0: {'id': 'call_1', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city":"Tokyo"}'}}}
finalised   : [{'id': 'call_1', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city":"Tokyo"}'}}]
parsed args : {'city': 'Tokyo'}

======================================================================
2) Out-of-order indices land sorted
======================================================================
names in order: ['first', 'second', 'third']

======================================================================
3) tool_call_args_required_keys diagnostic on missing keys
======================================================================
ok      : False
issue   : tool_call_args_required_keys: f missing keys ['b', 'c']

======================================================================
4) tool_call_args invalid-JSON diagnostic
======================================================================
issue   : tool_call_args_required_keys: f arguments invalid JSON: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

======================================================================
5) no_tool_calls + min_chars combine into multiple issues
======================================================================
issue   : content shorter than min_chars=100
issue   : no_tool_calls: expected 0 tool calls, got 1

======================================================================
6) invalid required_regex surfaces as an issue, not a crash
======================================================================
ok      : False
issue   : invalid required_regex '[unclosed': unterminated character set at position 0

New regression suite:

======================== 21 passed in 0.15s ========================

The wider bench-serve suite is also green:
pytest tests/test_bench_serve.py tests/test_bench_serve_workload_hardening.py → 119 + 21 + 2 skipped (live-run integration) = 140/140 passed.

Demo: tests catch a real regression

To prove the streaming-accumulation tests aren't vacuous, I temporarily
swapped the += concatenation in accumulate_tool_calls for a =
overwrite (the regression #499 worries about as the dense state machine
gets touched):

if function_delta.get("name"):
    entry["function"]["name"] = function_delta["name"]   # was: +=
if function_delta.get("arguments"):
    entry["function"]["arguments"] = function_delta["arguments"]  # was: +=

The hardening test fired with a precise diagnostic:

FAILED test_concatenates_name_and_arguments_across_deltas
  AssertionError: assert 'weather' == 'get_weather'

  - get_weather
  ? ----
  + weather
========================= 1 failed, 20 passed in 0.23s =========================

After restoring += the full suite goes back to 21 passed.

Note on scope

This PR is test-only. No production logic changes. None of the new
tests surfaced a real bug against current main; the runner's
existing behavior matches what the new contract asserts. If a future
PR refactors any of the hot functions (load_workload,
accumulate_tool_calls, validate_quality_checks), this suite will
catch unintentional drift.

Pin behavior of the workload load/validate/run paths called out in #499:

- load_workload: every guard clause that rejects malformed JSON
  (root-not-object, empty/missing cases, defaults-not-object,
  case-not-object, extra_body wrong type, tags wrong type), plus the
  string-tag normalisation to a tuple.
- accumulate_tool_calls / finalize_tool_calls: name and arguments
  concatenate across chunk-boundary deltas, out-of-order indices land
  sorted, ids set on the first delta survive subsequent deltas that
  omit them, omitted index defaults to 0.
- validate_quality_checks: tool_call_count mismatch reports actual
  count, invalid-JSON arguments report a parse error, non-object
  arguments are flagged, missing required keys are listed, missing
  function names surface the function name, no_tool_calls combines
  with min_chars to produce both diagnostics, invalid required_regex
  surfaces as an issue rather than crashing, and the list-form of
  finish_reason accepts any allowed member while the string form
  rejects others.

All 21 tests pass against current main with no source changes.
The full bench-serve suite (119 + 21 = 140 tests) stays green.

The streaming-accumulation tests were demoed against an inline-injected
regression that swapped += with = on name/arguments concatenation; the
suite caught it with a precise diagnostic ("'weather' == 'get_weather'").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant