Harden bench-serve workload runner with focused regression tests#515
Open
waybarrios wants to merge 1 commit intomainfrom
Open
Harden bench-serve workload runner with focused regression tests#515waybarrios wants to merge 1 commit intomainfrom
waybarrios wants to merge 1 commit intomainfrom
Conversation
Pin behavior of the workload load/validate/run paths called out in #499: - load_workload: every guard clause that rejects malformed JSON (root-not-object, empty/missing cases, defaults-not-object, case-not-object, extra_body wrong type, tags wrong type), plus the string-tag normalisation to a tuple. - accumulate_tool_calls / finalize_tool_calls: name and arguments concatenate across chunk-boundary deltas, out-of-order indices land sorted, ids set on the first delta survive subsequent deltas that omit them, omitted index defaults to 0. - validate_quality_checks: tool_call_count mismatch reports actual count, invalid-JSON arguments report a parse error, non-object arguments are flagged, missing required keys are listed, missing function names surface the function name, no_tool_calls combines with min_chars to produce both diagnostics, invalid required_regex surfaces as an issue rather than crashing, and the list-form of finish_reason accepts any allowed member while the string form rejects others. All 21 tests pass against current main with no source changes. The full bench-serve suite (119 + 21 = 140 tests) stays green. The streaming-accumulation tests were demoed against an inline-injected regression that swapped += with = on name/arguments concatenation; the suite caught it with a precise diagnostic ("'weather' == 'get_weather'").
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #499.
Summary
Adds 21 focused regression tests for the bench-serve workload runner —
the dense load/validate/run/report paths that issue #499 flagged as
the highest concentrated complexity in
vllm_mlx/bench_serve.py. Nosource changes — every test passes against existing behavior, locking
in the contract going forward.
Background
vllm_mlx/bench_serve.pyis both a workload runner and an evidenceartifact producer for release qualification. The advisory complexity
scan from #499 highlighted four functions as the densest:
load_workload,validate_quality_checks,run_workload_case,run_bench_serve_workload.Existing coverage in
test_bench_serve.py(1785 lines, 119 cases)covers happy paths for workload loading, sweep expansion, formatters,
runner end-to-end, and basic quality checks. The corners that were
not pinned down: load_workload validation guard clauses, streaming
tool-call accumulation across chunk-boundary deltas, and the
diagnostic shape of validate_quality_checks for argument-validation
edge cases.
Tests added
tests/test_bench_serve_workload_hardening.py(21 cases):TestLoadWorkloadValidation(8):TestStreamingToolCallAccumulation(4):TestQualityCheckDiagnostics(9):Verification
The behaviors the tests pin down, exercised against the live runner:
New regression suite:
The wider bench-serve suite is also green:
pytest tests/test_bench_serve.py tests/test_bench_serve_workload_hardening.py→ 119 + 21 + 2 skipped (live-run integration) = 140/140 passed.Demo: tests catch a real regression
To prove the streaming-accumulation tests aren't vacuous, I temporarily
swapped the
+=concatenation inaccumulate_tool_callsfor a=overwrite (the regression #499 worries about as the dense state machine
gets touched):
The hardening test fired with a precise diagnostic:
After restoring
+=the full suite goes back to 21 passed.Note on scope
This PR is test-only. No production logic changes. None of the new
tests surfaced a real bug against current
main; the runner'sexisting behavior matches what the new contract asserts. If a future
PR refactors any of the hot functions (
load_workload,accumulate_tool_calls,validate_quality_checks), this suite willcatch unintentional drift.