fix+feat+test: HangBuster review follow-ups + capture-path bugs + raw NDJSON mode (closes #79 #80 #81 #82 #83 #84 #85)#90
Merged
Conversation
Closes #81. The worker's auto-sample stash previously wrote auto_samples.json as a read-modify-write of an aggregate dict, with no locking. Under concurrent stashes (busy worker, many distinct fingerprints) one stash could clobber another's writes. Switch to auto_samples.jsonl: append-only, one record per line, fsync on each write. Mirrors the events.jsonl discipline. - SessionStore.stash_auto_sample(session_id, fingerprint, sample) — appends - SessionStore.read_auto_samples(session_id) — reduces last-write-wins per fp - HangBuster._stash_auto_sample now delegates to the store; no local I/O Tests: 5 new in tests/test_hang_sessions.py covering basic append, last-write-wins, missing-file, 50-thread concurrent stash, and corrupt-line skip. Verification: ruff/black clean; 93 passed (88 → 93).
Closes #79. The divide-by-zero guard in diff_sessions() used a bare `continue`, silently dropping clusters whose A-side max_duration_ms was 0. A real regression where a previously-silent cluster started hanging (0 → N ms) was lost from the diff entirely. Replace with explicit 4-way branching: - both zero → stable - A=0, B>0 → drift, delta_pct = +inf (treated as worsened) - A>0, B=0 → drift, delta_pct = -100.0 (fully improved) - both nonzero → existing percentage logic, unchanged format_diff() renders the inf delta as `new` instead of `+inf%` for readability. Tests: 5 new in tests/test_diff.py covering each branch + a regression guard that the standard nonzero path still uses the drift threshold, plus format_diff render assertion for the inf case. Verification: ruff/black clean; 93 passed (88 → 93).
…to 2 Closes #80. compute_fingerprint() previously returned the raw `sym:<symbol>` or `msg:<message_prefix>` string. Upstream normalisation (normalise_message collapsing tokens to <addr>/<pid>/<n> placeholders) shrinks the effective namespace dramatically — distinct hangs with overlapping normalised prefixes silently clustered together. Switch to sha256[:16]: key = f'sym:{symbol}' if symbol else f'msg:{message_prefix}' return f'fp:{hashlib.sha256(key.encode()).hexdigest()[:16]}' Cluster.symbol_or_prefix already carries the human-readable label — fingerprint is purely an identity key, so opacity is fine. ## Compat FINGERPRINT_VERSION bumps from 1 → 2. `diff_sessions` already warns + skips structural compare on version mismatch, so v1 summaries on disk are gracefully handled. ## Tests - test_compute_fingerprint_prefers_symbol: updated for new `fp:` prefix - test_compute_fingerprint_falls_back_to_prefix: updated, asserts opacity - test_compute_fingerprint_is_stable_across_calls: new (deterministic guard) - test_compute_fingerprint_avoids_overlap_collision: new (the issue's motivating case — distinct messages sharing a 40-char head must not collide) - test_build_normalised_event_full: `sym:` → `fp:` prefix update Verification: ruff/black clean; 90 passed.
Closes #82. Three coverage gaps from the PR #78 review, now closed. The crashed-worker recovery path had no test coverage. Adds: - test_mark_crashed_sets_status — direct call after create - test_mark_crashed_after_claim_overrides_running — overrides running state - test_mark_crashed_silent_on_missing_session — no-op on dead session ID - test_load_summary_returns_none_when_missing — crashed session has no summary.json The 50-event fixture in tests/fixtures/sample_session.py only ever produced CRITICAL events (all duration ≥ 600 ms). L1 hit ~36% of its 200-token budget; L2 only ~8% of 2000. Token-budget tests trivially passed. Hardened make_events(): - 12 clusters spanning all 4 Severity tiers (MINOR/WARN/CRITICAL/FROZEN) - One hot cluster at 18 events (near-max count); others spread across cold tail - Two processes (MyApp + Helper) → process_distribution populates - First-cluster events packed tightly → detect_temporal_bursts populates - 7s gap inserted mid-window → detect_quiet_periods populates - Long symbol names → per-line formatter stress After hardening: L0=28/30 (93%), L1=91/200 (46%), L2=298/2000 (15%). L1/L2 are naturally sparse at default top_n=3; the real win is that all four L2 aggregate branches (Severity / Bursts / Quiet periods / Processes) now actually render. Adds qualitative assertions in test_token_budgets.py: - test_fixture_covers_all_severity_tiers — guard against fixture regression - test_fixture_has_at_least_ten_clusters - test_l2_renders_all_aggregate_branches — every L2 branch must populate - test_l1_floor_is_meaningful — floor at 60 tokens - test_l2_floor_renders_full_summary — floor at 200 tokens diff_sessions() produces 5 verdict strings; only 3 were tested. format_diff was only rendered for 'no change' and version mismatch — the new/resolved/drift render paths were never asserted. Adds 6 new in tests/test_diff.py: - test_diff_verdict_regression_new_minor - test_diff_verdict_improvement_resolved - test_diff_verdict_drift_mixed_signs (the bidirectional drift case) - test_format_diff_renders_new_block - test_format_diff_renders_resolved_block - test_format_diff_renders_drift_block Small helper _summary_with_clusters() builds SessionSummary instances with controlled per-cluster max_duration_ms — keeps verdict tests isolated from the event pipeline. test_diff_detects_resolved_cluster filtered by an old symbol name from the soft fixture ([ImageDecoder decode:]). Updated to use the new equivalent. - ruff check: All checks passed - black --check (scripts): clean - python -m pytest tests/: 103 passed (was 88) - pytest tests/: 103 passed
This was referenced May 23, 2026
…tor Discovery sections Documents the new scripts that landed since the last doc pass: appearance.py, location.py, model_inspector.py, container.py, localization_audit.py, sim_list.py, simulator_selector.py. - CLAUDE.md: script count 21/27 → 29; add Device State (2) and Simulator Discovery (2) categories; reorganise Testing & Analysis (6 → 9). - SKILL.md: full 29-script listing with renumbered Lifecycle scripts (23-29). - marketplace.json, plugin.json: description bumped to 29. - validate-version.yml: fix SKILL.md path after directory restructure. - .gitignore: fix local config.json path to match new directory layout.
Pre-fix: passing --bundle-id appended (process == <app>) AND to the os_log predicate. The default hang predicate matches RunningBoard/SpringBoard/watchdog events that originate outside the target app's process, so the AND silently dropped the bulk of useful hang signal. Verified on Grapla: a 4-minute session with --bundle-id com.conorluddy.grapla captured 0 events while the simulator was actively logging hangs. Fix: predicate stays simulator-global; bundle filtering applied post-parse against the event's process field. Mirrors the existing _matches_bundle path in the legacy --watch/--since modes; promoted to module-level matches_bundle() so the HangBuster worker reuses it. - _resolve_predicate signature simplified: drops bundle_id arg entirely. Adding it back would resurrect the bug — locked down by test. - Worker (run_worker) applies matches_bundle() after parse, before matched++. - --bundle-id help text updated to clarify it's a post-parse filter. - HangWatcher._matches_bundle becomes a thin delegator (legacy callers). Tests: 8 new in tests/test_predicate.py — predicate composition, env var precedence, signature lockdown, post-parse match behaviour. 115 → 123 passing.
…re duration (closes #84) Pre-fix: log stream EOF after 30s of silence broke the read loop, wrote a stream_ended event, and exited. SessionStore.persist_worker_counters then overwrote the status back to 'running' — sessions appeared live forever in --list-sessions even though the worker had been dead for hours. Verified on Grapla: a 19s session showed counters.total=1, status='running' indefinitely. Fix split across three concerns: 1. Crash detection (run_worker rewrite): - Removed the 30s silence bail. A quiet log stream is not a dead one; bailing on silence was a false-positive death signal. - EOF on stdin or proc.poll() != None now triggers a stream_died event with exit_code and attempt number. - try/finally wrapping ensures mark_crashed runs whenever the worker exits involuntarily (subprocess died, restart budget exhausted). - Inner read loop extracted into _read_stream_into_events for clarity and direct testability. 2. Bounded auto-restart: - New env var IOS_SIM_HANG_MAX_RESTARTS (default 3). Each restart logs a stream_restart event with attempt number and a 2s backoff prevents tight respawn loops. - counters now tracks stream_restarts; surfaces in summary + --list-sessions. - SIGTERM during backoff or read loop exits cleanly without restart. 3. Capture-duration tracking (hang_sessions.py): - mark_crashed now also records stopped_at/stopped_at_ms — so crashed sessions have a real end-of-capture timestamp, not just a status flip. - persist_worker_counters preserves CRASHED status (alongside STOPPED) and no longer clobbers it back to RUNNING. - build_summary prefers meta.stopped_at_ms over now() so summaries reflect the capture window, not the time of inspection. - --list-sessions surfaces 'capture=Xs' and 'restarts=N' columns. Tests: 8 new (5 in tests/test_worker_loop.py, 3 in tests/test_hang_sessions.py). Fake subprocess.Popen drives EOF/restart/exhaustion scenarios deterministically. 123 → 131 passing.
…loses #85) The summarised pipeline (parse → bucket → cluster) is the right shape when you trust the bucketing, but sometimes you want full-fidelity log data and ad-hoc jq exploration. --raw-capture spawns 'log stream --style ndjson' and dumps lines verbatim to raw.ndjson, bypassing the clustering pipeline. Garbage collection is automatic and layered so a year of daily use ≈ a day: - Per-session size cap (--max-size-mb, default 10) — worker stops cleanly on cap hit, writes size_cap_hit marker, sets extras.truncated. - Gzip on stop (skippable with --no-gzip) — NDJSON compresses ~15-19x in practice. Verified live: 0.96 MB raw → 0.05 MB gzipped on Grapla today. - TTL prune (existing IOS_SIM_HANG_SESSION_TTL_HOURS=24) runs on every --start. New aggregate cap IOS_SIM_HANG_TOTAL_CAP_MB (default 100) runs alongside, evicting oldest sessions first when total skill state exceeds. Verified end-to-end on Grapla: a 12s session captured 737 raw NDJSON lines. 'jq -s group_by(.processImagePath)' shows runningboardd=544, SpringBoard=83, mediaremoted=39, locationd=39, grapla=28. The grapla-process share (4%) is exactly why the --bundle-id AND clause in #83 was wrong: it filtered out 96% of useful hang signal originating from system daemons. Implementation: - New args: --raw-capture, --max-size-mb, --no-gzip; new env var IOS_SIM_HANG_TOTAL_CAP_MB. Plumbed through start_args into meta.args. - run_worker grew a raw-capture branch that reuses the #84 restart/crash machinery: spawn log stream --style ndjson, feed to _read_stream_into_raw, which dumps lines verbatim and bails on size cap. - Non-JSON banners (log stream's 'Filtering the log data ...' first line, pwuid_r warnings) are skipped so consumers get strict NDJSON for jq. - _stop_raw_session gzips raw.ndjson → raw.ndjson.gz (via stdlib gzip + shutil.copyfileobj), removes the uncompressed file, records bytes in meta.extras. No summary.json for raw sessions. - _mark_truncated sets extras.truncated when the size cap fires. - SessionStore.session_total_bytes + prune_to_aggregate_cap: sum sizes per session, evict oldest first until total ≤ cap. - --list-sessions surfaces raw size, truncated flag, capture duration. - --get-details on a raw session points to the file with a zcat | jq hint. Tests: 7 new (in tests/test_worker_loop.py) — raw line writing, size-cap trigger, gzip round-trip + content preservation, --no-gzip preserves uncompressed, aggregate-cap eviction order, session_total_bytes. 131 → 138 passing.
…v vars (#83 #84 #85) Documents the three new capabilities landed in this PR across all three contributor entry points: SKILL.md: - Expanded hang_watcher entry: raw-capture sub-section (size cap, gzip, jq exploration), resilience section (auto-restart + capture duration + list-sessions surface), automatic cleanup section (TTL + aggregate cap), bundle-id post-parse note. - Two new env-var rows: IOS_SIM_HANG_MAX_RESTARTS, IOS_SIM_HANG_TOTAL_CAP_MB. - Updated IOS_SIM_HANG_PREDICATE description to clarify why predicate stays simulator-global (so users don't try to AND a process clause). - Two quick-start blocks: summarised mode (--start/--stop/--get-details) and raw-capture mode (jq cookbook with three production-tested queries). README.md: - hang_watcher row updated with raw-capture flags + auto-restart + cleanup. - Predicate env-var description hardened to explain post-parse bundle filter. - Two new env-var rows for restarts + total cap. CLAUDE.md (project guidance for contributors): - hang_watcher.py one-liner expanded with new modes + env vars. - hang_sessions.py blurb: status state machine, mark_crashed timestamps, aggregate cap + raw_path/session_total_bytes accessors, file layout includes raw.ndjson.gz + auto_samples.jsonl. No code changes. 138 tests still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Combined PR for the four HangBuster review follow-ups plus three live-test discoveries (#83/#84/#85). Each was originally a separate PR (#86, #87, #88, #89); merged into one branch on request. The seven logical commits remain atomic — easy to revert individually if needed.
Closes #79, #80, #81, #82, #83, #84, #85.
Commits in this PR
fix: append-only JSONL for auto_samples to remove read-modify-write racefix: handle zero-duration clusters in diff_sessions()feat: hash compute_fingerprint with sha256, bump FINGERPRINT_VERSION to 2test: harden HangBuster coverage — crashed-worker, fixture, verdictschore: doc sweep — bump script count to 29, add Device State + Simulator Discovery sectionsfix: --bundle-id no longer narrows hang predicatefix+feat: HangBuster worker EOF detect + bounded auto-restart + capture durationfeat: --raw-capture NDJSON mode + size cap + gzip + aggregate prune#81 — auto_samples JSONL (race fix)
HangBuster._stash_auto_sampledid a read-modify-write onauto_samples.jsonwith no locking; concurrent stashes from a busy worker could drop entries. Switch to append-onlyauto_samples.jsonlmirroring the existingevents.jsonldiscipline (line-buffered +fsync).New
SessionStore.stash_auto_sample()/read_auto_samples()(last-write-wins per fp). Worker delegates — no local I/O.#79 — zero-duration drift handling
diff_sessions()had a divide-by-zero guard that did a barecontinue, silently dropping zero-duration clusters from the diff. Worst case: a 0 → 800 ms regression vanished.Replaced with explicit 4-way branching:
A == 0, B == 0stable++A == 0, B > 0delta_pct = inf(rendered asnew)A > 0, B == 0delta_pct = -100.0A > 0, B > 0format_diffspecial-cases theinfdelta so it prints asnewrather than+inf%.#80 — sha256 fingerprint,
FINGERPRINT_VERSION=2compute_fingerprint()returned the rawsym:<symbol>/msg:<message_prefix>string. Upstream normalisation collapses tokens to<addr>/<pid>/<n>placeholders, so distinct messages with overlapping normalised prefixes collided. Now hashed with sha256[:16] (fp:<16hex>).Cluster.symbol_or_prefixalready serves as the human-readable label — fingerprint is purely identity. No newdisplay_prefixfield needed.FINGERPRINT_VERSIONbumps 1 → 2.diff_sessions()already warns + skips on version mismatch (existing test coverage), so old v1 summaries degrade cleanly.#82 — test coverage hardening
Gap 1:
mark_crashed+load_summary→None(4 new tests) — the crashed-worker recovery path was entirely untested.Gap 2: fixture too soft (5 new tests + fixture rewrite) — the soft fixture only produced CRITICAL events. L1 hit ~36% of budget, L2 only ~8%. New fixture spans all 4 severities, 12 clusters, 2 processes, populates bursts + quiet periods + process_distribution. All four L2 aggregate branches now actually render.
Gap 3: verdict +
format_diffrender coverage (6 new tests) — covers the 3 remaining verdict strings (new minor/improvement: resolved/drift: mixed signs) and the previously-unrendered new / resolved / drift output blocks.Combined branch unified two near-identical helper functions (
_zero_dur_summaryfrom #79 and_summary_with_clustersfrom #82) into a single shared_summary_with_clusters((fp, dur, sev) tuple)— kept the more flexible form.#83 —
--bundle-idno longer narrows the predicateLive-test surfaced an unfiled bug:
_resolve_predicate(predicate, bundle_id)appendedAND (process == <app>)to the os_log predicate. The default hang predicate matches RunningBoard / SpringBoard / watchdog events that originate outside the target app's process, so the AND silently dropped the bulk of useful hang signal.Verified on Grapla: a 4-minute session with
--bundle-id com.conorluddy.graplacaptured 0 events while the simulator was actively logging hangs. After removing the AND, an identical workload captures hundreds of RunningBoard events per minute.Fix: predicate stays simulator-global; bundle filtering applied post-parse against the event's process field. Promoted the existing
_matches_bundlehelper to a module-levelmatches_bundle()so the HangBuster worker reuses it._resolve_predicatesignature simplified — droppingbundle_idarg entirely and locking down viainspect.signatureregression test so the bug cannot return.#84 — Worker EOF detection + bounded auto-restart + capture duration
Live-test surfaced a second bug:
log streamEOF (or 30s of silence under the old false-positive bail) broke the read loop, wrote astream_endedevent, and exited.SessionStore.persist_worker_countersthen overwrote the status back torunning— sessions appeared live indefinitely even though the worker had been dead for hours.Three-part fix:
Crash detection (run_worker rewrite). Removed the 30s silence bail (a quiet log stream is not a dead one). EOF or
proc.poll() != Nonenow emits astream_diedevent with exit_code.try/finallywrapping ensuresmark_crashedruns whenever the worker exits involuntarily. Inner read loop extracted into_read_stream_into_eventsfor direct testability.Bounded auto-restart. New env var
IOS_SIM_HANG_MAX_RESTARTS(default 3). Each restart logs astream_restartevent with attempt number and a 2s backoff prevents tight respawn loops.counterstracksstream_restarts; surfaces in summary +--list-sessions. SIGTERM during backoff or read loop exits cleanly without restart.Capture-duration tracking.
mark_crashedrecordsstopped_at/stopped_at_ms.persist_worker_counterspreserves both CRASHED and STOPPED status.build_summaryprefersmeta.stopped_at_msovernow()so summaries reflect the capture window.--list-sessionssurfacescapture=Xsandrestarts=N.5 new worker-loop tests + 3 new session-store tests use a scripted fake
subprocess.Popento drive EOF / mid-stream death / SIGTERM / xcrun-missing scenarios deterministically.#85 —
--raw-captureNDJSON mode + size cap + gzip + aggregate pruneThe summarised pipeline is the right shape when you trust the bucketing, but sometimes you want full-fidelity log data and ad-hoc
jqexploration.--raw-capturespawnslog stream --style ndjsonand dumps lines verbatim toraw.ndjson, bypassing clustering.Garbage collection is automatic and layered:
--max-size-mb, default 10) — worker stops cleanly on cap hit, writessize_cap_hitmarker, setsextras.truncated.--no-gzip) — NDJSON compresses ~15-19× in practice.IOS_SIM_HANG_SESSION_TTL_HOURS=24) runs on every--start. New aggregate capIOS_SIM_HANG_TOTAL_CAP_MB(default 100) runs alongside, evicting oldest sessions first when total skill state exceeds.Verified end-to-end on Grapla: a 12-second session captured 737 raw NDJSON lines, 0.96 MB → 0.05 MB gzipped (~19×).
jqqueries work directly on the gzipped file:That process breakdown is the smoking gun for #83: only 4% of useful hang signal comes from the app process itself.
Implementation notes:
--raw-capture,--max-size-mb,--no-gzip. New env:IOS_SIM_HANG_TOTAL_CAP_MB.run_workergrew a raw-capture branch that reuses the bug: HangBuster worker silently dies after log stream EOF, session left in stale 'running' state #84 restart/crash machinery. Spawnslog stream --style ndjson, feeds to_read_stream_into_raw, which dumps lines verbatim and bails on size cap.log stream's "Filtering the log data ..." first line,pwuid_rwarnings) are skipped sojqconsumers get strict NDJSON._stop_raw_sessiongzips via stdlibgzip+shutil.copyfileobj, removes uncompressed file, records bytes inmeta.extras. Nosummary.jsonfor raw sessions;--get-detailspoints to the file with azcat | jqhint.SessionStore.session_total_bytes+prune_to_aggregate_cap— sum sizes per session, evict oldest first until total ≤ cap. Called fromHangBuster.start()alongsideprune_expired().--list-sessionssurfaces raw size (gzipped if present), truncated flag, capture duration.7 new tests in
tests/test_worker_loop.py— raw line writing, size-cap trigger, gzip round-trip + content preservation,--no-gzippreserves uncompressed, aggregate-cap eviction order,session_total_bytes.Test count
Main baseline: 88. After all seven: 138 (+50).
Verification
Live end-to-end on Grapla simulator:
--bundle-idno longer narrows predicate (5060+ matched lines vs pre-fix 0).--raw-capture --max-size-mb 1produced 737 lines / 0.05 MB gzipped, jq-queryable.Compat notes
FINGERPRINT_VERSION = 2writes into newsummary.jsonfiles. Old v1 summaries on disk still load;--diffbetween v1 and v2 fires the existing version-mismatch warning.auto_samples.json→auto_samples.jsonlfilename change. No external readers of the old file existed. Old files become inert and TTL-prune naturally._resolve_predicate(predicate)signature change (droppedbundle_idarg) — internal API only.--bundle-idflag still accepted; now filters post-parse.Out of scope (deferred from original review)
common/hang_pipeline.pyto a genericcommon/log_filters.py— AHA, wait for a second consumer._BARE_INTredaction narrowing (n4) — partially mitigated by enhancement: hash compute_fingerprint() and bump fingerprint_version to 2 #80's hashing absorbing remaining collision risk.--stop.