feat(benchmarks): Add Claude UI benchmark harness #427

6 issues

Medium

First-run prompt preflight declares success on initial app load before prompts appear - `src/benchmarks/claude-ui/first-run-preflight.ts:229-233`

When the app launches, any initial UI content (loading screen, app chrome) that doesn't contain a target label causes promptsDismissed = true to be set immediately, skipping all prompt dismissals even though the actual first-run prompts haven't appeared yet.

Also found at:

src/benchmarks/claude-ui/simulator-lifecycle.ts:256-259

Log write failure causes `runLoggedCommand` to reject instead of resolving with the command result - `src/benchmarks/claude-ui/simulator-lifecycle.ts:205-206`

If appendLifecycleLog fails (e.g. disk full, bad log path), .catch(reject) surfaces the I/O error instead of resolving with the already-complete result, silently dropping the command's exit code, stdout, and stderr.

runLoggedCommand rejects on log-write failure, masking a successful command result

In simulator-lifecycle.ts, the close handler calls resolve(result) only inside .then() of the appendLifecycleLog promise, so any transient I/O error writing the lifecycle log (e.g. disk-full) causes the outer promise to reject with a log error instead of resolving with the command result — making prepareTemporarySimulator throw as if xcrun simctl boot or bootstatus failed when they actually succeeded.

Low

`writeMcpConfig` tests use real tmpdir filesystem I/O instead of injected dependencies - `src/benchmarks/claude-ui/__tests__/simulator-lifecycle.test.ts:359-464`

writeMcpConfig in harness.ts calls mkdir/writeFile directly from node:fs/promises with no filesystem injection, so the corresponding tests in simulator-lifecycle.test.ts resort to real mkdtemp(os.tmpdir(), ...) plus real readFile to verify output. This is sandboxed and deterministic (no xcodebuild/xcrun/AXe/device/simulator calls), but it does deviate from the skill's preference for injecting filesystem dependencies. Consider extracting the pure config-building logic (already mostly isolated in isolatedSessionDefaults) and testing that without I/O, or threading a writer abstraction into writeMcpConfig.

runCommand has no error handler on child.stdin, risking an unhandled EPIPE crash

If the spawned process (e.g. claude) exits before consuming all stdin, child.stdin will emit an 'error' event (EPIPE) with no registered listener, which becomes an uncaught exception and crashes the benchmark process.

Write stream errors in benchmark `runCommand` are unhandled - `src/benchmarks/claude-ui/harness.ts:228-242`

In runCommand, the createWriteStream instances for stdout/stderr have no 'error' listener attached before being used as pipe destinations. If a write fails (e.g. disk full, permission error on the artifacts directory) before child.on('close', ...) fires, Node will emit 'error' on a stream with no listener and throw an uncaught exception, crashing the benchmark run. child.on('error', reject) only covers spawn errors, and finished(stdout)/finished(stderr) is only awaited inside the close handler. This is local-only benchmark code so user impact is limited to a noisy crash during a benchmark run rather than a clean rejection, but adding .on('error', reject) to both streams is a small, safe hardening.

13 skills analyzed

Skill	Findings	Duration	Cost
security-review	0	6m 5s	$1.15
xcodebuildmcp-docs-release-review	0	47.1s	$0.13
xcodebuildmcp-docs-command-review	0	4m 48s	$0.00
xcodebuildmcp-packaging-resource-review	0	4m 48s	$0.00
xcodebuildmcp-test-boundary-review	1	4m 58s	$2.52
xcodebuildmcp-tool-contract-review	0	4m 47s	$0.27
wrdn-pii	0	1m 6s	$0.15
wrdn-authz	0	47.9s	$0.07
wrdn-code-execution	0	47.4s	$0.13
wrdn-data-exfil	0	35.5s	$0.06
find-bugs	4	20m 5s	$9.03
code-review	1	17m 22s	$1.03
code-simplifier	0	21m 34s	$1.90

_{⏱ 88m 31s · 8.9M in / 436.0k out · $16.46}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmarks): Add Claude UI benchmark harness #427

Uh oh!

Uh oh!

feat(benchmarks): Add Claude UI benchmark harness #427

Uh oh!

6 issues

Medium

Low

Re-running checks...

Uh oh!

feat(benchmarks): Add Claude UI benchmark harness #427

Uh oh!

Fix duplicate stumble count for parse errors

Uh oh!

feat(benchmarks): Add Claude UI benchmark harness #427

Uh oh!

6 issues

Medium

Low

Re-running checks...