Skip to content

PR #427

PR #427 #1400

Triggered via dynamic May 23, 2026 19:50
Status Success
Total duration 1m 19s
Artifacts

codeql

on: dynamic
Matrix: analyze
Fit to window
Zoom out
Zoom in

Annotations

8 warnings
Tests use real filesystem I/O instead of injecting filesystem dependency: src/benchmarks/claude-ui/__tests__/first-run-preflight.test.ts#L1
The tests create real temp directories via `mkdtemp` and read real log files via `readFile` because `dismissFirstRunPrompts` calls `appendFile` directly rather than through an injectable `FileSystemExecutor`; add a filesystem injection point (e.g. `appendLog?: (path, msg) => Promise<void>`) to allow tests to capture log writes without touching disk.
Non-matching UI elements on first poll cause premature 'dismissed' completion: src/benchmarks/claude-ui/first-run-preflight.ts#L186
If `describe-ui` returns non-empty elements (e.g. a loading screen) before the configured first-run prompt has appeared, `uiSeen` is set **and** `promptsDismissed = true` in the same iteration — the loop exits without ever tapping the label. Have you considered requiring at least one additional poll after `uiSeen` becomes true before treating the absence of a label as confirmation the prompts are gone?
Unhandled EPIPE error on child.stdin may crash the process: src/benchmarks/claude-ui/harness.ts#L214
If the child process exits before consuming all stdin data, Node.js emits an `'error'` event on `child.stdin`. Without an error handler, this becomes an unhandled exception and can crash the benchmark harness. Add `child.stdin.on('error', () => {})` or a proper handler before calling `.end()`.
Log write failure after `simctl create` rejects with I/O error instead of resolving, orphaning the simulator: src/benchmarks/claude-ui/simulator-lifecycle.ts#L201
If `appendLifecycleLog` fails (e.g. disk full) after a subprocess completes, `reject` is called with the file-write error rather than resolving with the command result — so a successful `xcrun simctl create` appears to have failed, `simulatorId` is never returned to `prepareTemporarySimulator`, and the newly created simulator is leaked with no cleanup path.
One-sided variance check allows zero-work runs to pass all metric thresholds: src/benchmarks/claude-ui/compare.ts#L12
All numeric metrics use only an upper-bound check (`actual <= expected + allowedVariance`), so a run where the agent completes nothing (0 tool calls, 0s elapsed) passes every metric regardless of the configured baseline; combined with the default `sequenceMode: 'warn'` this means a silent failure where Claude exits 0 without calling any tools will report an overall PASS.
Loop breaks prematurely when UI has zero elements after a successful prompt tap: src/benchmarks/claude-ui/first-run-preflight.ts#L186
After a button tap sets `uiSeen = true` (line 198), any subsequent `not-found` result with `hasElements: false` (which occurs during UI transitions between prompts) immediately sets `promptsDismissed = true` and breaks the loop, so only the first of multiple sequential first-run prompts is ever dismissed.
Failure patterns tested against raw JSONL lines cause double-counting when a tool failure also matches a configured pattern: src/benchmarks/claude-ui/render.ts#L155
In `analyzeClaudeJsonl`, each configured failure pattern is tested against the full raw JSON text of every JSONL line before any type-specific handling, so a tool result that triggers `resultDidError` (contributing to `audit.failures`) and whose raw JSON also contains a configured pattern string (contributing to `audit.patternFailures`) is counted twice in `failureCount`, inflating the reported failure total and potentially flipping a passing benchmark to failing.
Log write failure in `close` handler rejects the promise instead of resolving with the command result: src/benchmarks/claude-ui/simulator-lifecycle.ts#L196
If `appendLifecycleLog` fails after a successful command (e.g. disk full, bad log path), `.catch(reject)` settles the promise with the I/O error, discarding the actual command result. This makes a successful `simctl create` or `simctl boot` look like a failure. The `error` handler on line 173 correctly uses `.finally(() => reject(error))` to always deliver the original error — apply the same pattern here: `.finally(() => resolve(result))`.