Skip to content

fix+feat+test: HangBuster review follow-ups + capture-path bugs + raw NDJSON mode (closes #79 #80 #81 #82 #83 #84 #85)#90

Merged
conorluddy merged 9 commits into
mainfrom
feat/hangbuster-review-followups
May 23, 2026
Merged

fix+feat+test: HangBuster review follow-ups + capture-path bugs + raw NDJSON mode (closes #79 #80 #81 #82 #83 #84 #85)#90
conorluddy merged 9 commits into
mainfrom
feat/hangbuster-review-followups

Conversation

@conorluddy

@conorluddy conorluddy commented May 23, 2026

Copy link
Copy Markdown
Owner

Combined PR for the four HangBuster review follow-ups plus three live-test discoveries (#83/#84/#85). Each was originally a separate PR (#86, #87, #88, #89); merged into one branch on request. The seven logical commits remain atomic — easy to revert individually if needed.

Closes #79, #80, #81, #82, #83, #84, #85.

Commits in this PR

Commit Closes Diff
fix: append-only JSONL for auto_samples to remove read-modify-write race #81 +109/-13
fix: handle zero-duration clusters in diff_sessions() #79 +100/-4
feat: hash compute_fingerprint with sha256, bump FINGERPRINT_VERSION to 2 #80 +39/-11
test: harden HangBuster coverage — crashed-worker, fixture, verdicts #82 +209/-64
chore: doc sweep — bump script count to 29, add Device State + Simulator Discovery sections +44/-16
fix: --bundle-id no longer narrows hang predicate #83 +107/-16
fix+feat: HangBuster worker EOF detect + bounded auto-restart + capture duration #84 +498/-74
feat: --raw-capture NDJSON mode + size cap + gzip + aggregate prune #85 +451/-14

#81 — auto_samples JSONL (race fix)

HangBuster._stash_auto_sample did a read-modify-write on auto_samples.json with no locking; concurrent stashes from a busy worker could drop entries. Switch to append-only auto_samples.jsonl mirroring the existing events.jsonl discipline (line-buffered + fsync).

New SessionStore.stash_auto_sample() / read_auto_samples() (last-write-wins per fp). Worker delegates — no local I/O.

#79 — zero-duration drift handling

diff_sessions() had a divide-by-zero guard that did a bare continue, silently dropping zero-duration clusters from the diff. Worst case: a 0 → 800 ms regression vanished.

Replaced with explicit 4-way branching:

Case Result
A == 0, B == 0 stable++
A == 0, B > 0 drift, delta_pct = inf (rendered as new)
A > 0, B == 0 drift, delta_pct = -100.0
A > 0, B > 0 existing percentage logic

format_diff special-cases the inf delta so it prints as new rather than +inf%.

#80 — sha256 fingerprint, FINGERPRINT_VERSION=2

compute_fingerprint() returned the raw sym:<symbol> / msg:<message_prefix> string. Upstream normalisation collapses tokens to <addr> / <pid> / <n> placeholders, so distinct messages with overlapping normalised prefixes collided. Now hashed with sha256[:16] (fp:<16hex>).

Cluster.symbol_or_prefix already serves as the human-readable label — fingerprint is purely identity. No new display_prefix field needed.

FINGERPRINT_VERSION bumps 1 → 2. diff_sessions() already warns + skips on version mismatch (existing test coverage), so old v1 summaries degrade cleanly.

#82 — test coverage hardening

Gap 1: mark_crashed + load_summary→None (4 new tests) — the crashed-worker recovery path was entirely untested.

Gap 2: fixture too soft (5 new tests + fixture rewrite) — the soft fixture only produced CRITICAL events. L1 hit ~36% of budget, L2 only ~8%. New fixture spans all 4 severities, 12 clusters, 2 processes, populates bursts + quiet periods + process_distribution. All four L2 aggregate branches now actually render.

Gap 3: verdict + format_diff render coverage (6 new tests) — covers the 3 remaining verdict strings (new minor / improvement: resolved / drift: mixed signs) and the previously-unrendered new / resolved / drift output blocks.

Combined branch unified two near-identical helper functions (_zero_dur_summary from #79 and _summary_with_clusters from #82) into a single shared _summary_with_clusters((fp, dur, sev) tuple) — kept the more flexible form.

#83--bundle-id no longer narrows the predicate

Live-test surfaced an unfiled bug: _resolve_predicate(predicate, bundle_id) appended AND (process == <app>) to the os_log predicate. The default hang predicate matches RunningBoard / SpringBoard / watchdog events that originate outside the target app's process, so the AND silently dropped the bulk of useful hang signal.

Verified on Grapla: a 4-minute session with --bundle-id com.conorluddy.grapla captured 0 events while the simulator was actively logging hangs. After removing the AND, an identical workload captures hundreds of RunningBoard events per minute.

Fix: predicate stays simulator-global; bundle filtering applied post-parse against the event's process field. Promoted the existing _matches_bundle helper to a module-level matches_bundle() so the HangBuster worker reuses it. _resolve_predicate signature simplified — dropping bundle_id arg entirely and locking down via inspect.signature regression test so the bug cannot return.

#84 — Worker EOF detection + bounded auto-restart + capture duration

Live-test surfaced a second bug: log stream EOF (or 30s of silence under the old false-positive bail) broke the read loop, wrote a stream_ended event, and exited. SessionStore.persist_worker_counters then overwrote the status back to running — sessions appeared live indefinitely even though the worker had been dead for hours.

Three-part fix:

  1. Crash detection (run_worker rewrite). Removed the 30s silence bail (a quiet log stream is not a dead one). EOF or proc.poll() != None now emits a stream_died event with exit_code. try/finally wrapping ensures mark_crashed runs whenever the worker exits involuntarily. Inner read loop extracted into _read_stream_into_events for direct testability.

  2. Bounded auto-restart. New env var IOS_SIM_HANG_MAX_RESTARTS (default 3). Each restart logs a stream_restart event with attempt number and a 2s backoff prevents tight respawn loops. counters tracks stream_restarts; surfaces in summary + --list-sessions. SIGTERM during backoff or read loop exits cleanly without restart.

  3. Capture-duration tracking. mark_crashed records stopped_at/stopped_at_ms. persist_worker_counters preserves both CRASHED and STOPPED status. build_summary prefers meta.stopped_at_ms over now() so summaries reflect the capture window. --list-sessions surfaces capture=Xs and restarts=N.

5 new worker-loop tests + 3 new session-store tests use a scripted fake subprocess.Popen to drive EOF / mid-stream death / SIGTERM / xcrun-missing scenarios deterministically.

#85--raw-capture NDJSON mode + size cap + gzip + aggregate prune

The summarised pipeline is the right shape when you trust the bucketing, but sometimes you want full-fidelity log data and ad-hoc jq exploration. --raw-capture spawns log stream --style ndjson and dumps lines verbatim to raw.ndjson, bypassing clustering.

Garbage collection is automatic and layered:

  • Per-session size cap (--max-size-mb, default 10) — worker stops cleanly on cap hit, writes size_cap_hit marker, sets extras.truncated.
  • Gzip on stop (skippable with --no-gzip) — NDJSON compresses ~15-19× in practice.
  • TTL prune (existing IOS_SIM_HANG_SESSION_TTL_HOURS=24) runs on every --start. New aggregate cap IOS_SIM_HANG_TOTAL_CAP_MB (default 100) runs alongside, evicting oldest sessions first when total skill state exceeds.

Verified end-to-end on Grapla: a 12-second session captured 737 raw NDJSON lines, 0.96 MB → 0.05 MB gzipped (~19×). jq queries work directly on the gzipped file:

$ zcat raw.ndjson.gz | jq -s 'group_by(.processImagePath) | map({proc: (.[0].processImagePath | split("/") | last), n: length}) | sort_by(-.n) | .[:5]'
[
  {"proc": "runningboardd", "n": 544},
  {"proc": "SpringBoard",   "n":  83},
  {"proc": "mediaremoted",  "n":  39},
  {"proc": "locationd",     "n":  39},
  {"proc": "grapla",        "n":  28}
]

That process breakdown is the smoking gun for #83: only 4% of useful hang signal comes from the app process itself.

Implementation notes:

  • New CLI: --raw-capture, --max-size-mb, --no-gzip. New env: IOS_SIM_HANG_TOTAL_CAP_MB.
  • run_worker grew a raw-capture branch that reuses the bug: HangBuster worker silently dies after log stream EOF, session left in stale 'running' state #84 restart/crash machinery. Spawns log stream --style ndjson, feeds to _read_stream_into_raw, which dumps lines verbatim and bails on size cap.
  • Non-JSON banners (log stream's "Filtering the log data ..." first line, pwuid_r warnings) are skipped so jq consumers get strict NDJSON.
  • _stop_raw_session gzips via stdlib gzip + shutil.copyfileobj, removes uncompressed file, records bytes in meta.extras. No summary.json for raw sessions; --get-details points to the file with a zcat | jq hint.
  • SessionStore.session_total_bytes + prune_to_aggregate_cap — sum sizes per session, evict oldest first until total ≤ cap. Called from HangBuster.start() alongside prune_expired().
  • --list-sessions surfaces raw size (gzipped if present), truncated flag, capture duration.

7 new tests in tests/test_worker_loop.py — raw line writing, size-cap trigger, gzip round-trip + content preservation, --no-gzip preserves uncompressed, aggregate-cap eviction order, session_total_bytes.

Test count

Main baseline: 88. After all seven: 138 (+50).

Verification

ruff check: All checks passed
black --check (scripts): clean
python -m pytest tests/: 138 passed in 0.47s

Live end-to-end on Grapla simulator:

  • --bundle-id no longer narrows predicate (5060+ matched lines vs pre-fix 0).
  • --raw-capture --max-size-mb 1 produced 737 lines / 0.05 MB gzipped, jq-queryable.
  • Bounded auto-restart confirmed by fake-subprocess unit tests.

Compat notes

  • FINGERPRINT_VERSION = 2 writes into new summary.json files. Old v1 summaries on disk still load; --diff between v1 and v2 fires the existing version-mismatch warning.
  • auto_samples.jsonauto_samples.jsonl filename change. No external readers of the old file existed. Old files become inert and TTL-prune naturally.
  • _resolve_predicate(predicate) signature change (dropped bundle_id arg) — internal API only. --bundle-id flag still accepted; now filters post-parse.

Out of scope (deferred from original review)

  • Promoting common/hang_pipeline.py to a generic common/log_filters.py — AHA, wait for a second consumer.
  • Temporal-burst overlapping-windows fix (m1 from original review) — defer until someone trips it.
  • _BARE_INT redaction narrowing (n4) — partially mitigated by enhancement: hash compute_fingerprint() and bump fingerprint_version to 2 #80's hashing absorbing remaining collision risk.
  • Streaming summary while session is live — current model: summary only on --stop.
  • Multi-simulator capture — one session = one UDID.

Closes #81.

The worker's auto-sample stash previously wrote auto_samples.json as a
read-modify-write of an aggregate dict, with no locking. Under
concurrent stashes (busy worker, many distinct fingerprints) one stash
could clobber another's writes.

Switch to auto_samples.jsonl: append-only, one record per line, fsync on
each write. Mirrors the events.jsonl discipline.

- SessionStore.stash_auto_sample(session_id, fingerprint, sample) — appends
- SessionStore.read_auto_samples(session_id) — reduces last-write-wins per fp
- HangBuster._stash_auto_sample now delegates to the store; no local I/O

Tests: 5 new in tests/test_hang_sessions.py covering basic append, last-write-wins,
missing-file, 50-thread concurrent stash, and corrupt-line skip.

Verification: ruff/black clean; 93 passed (88 → 93).
Closes #79.

The divide-by-zero guard in diff_sessions() used a bare `continue`,
silently dropping clusters whose A-side max_duration_ms was 0. A real
regression where a previously-silent cluster started hanging (0 → N ms)
was lost from the diff entirely.

Replace with explicit 4-way branching:

- both zero      → stable
- A=0, B>0       → drift, delta_pct = +inf (treated as worsened)
- A>0, B=0       → drift, delta_pct = -100.0 (fully improved)
- both nonzero   → existing percentage logic, unchanged

format_diff() renders the inf delta as `new` instead of `+inf%` for
readability.

Tests: 5 new in tests/test_diff.py covering each branch + a regression
guard that the standard nonzero path still uses the drift threshold,
plus format_diff render assertion for the inf case.

Verification: ruff/black clean; 93 passed (88 → 93).
…to 2

Closes #80.

compute_fingerprint() previously returned the raw `sym:<symbol>` or
`msg:<message_prefix>` string. Upstream normalisation (normalise_message
collapsing tokens to <addr>/<pid>/<n> placeholders) shrinks the effective
namespace dramatically — distinct hangs with overlapping normalised
prefixes silently clustered together.

Switch to sha256[:16]:

  key = f'sym:{symbol}' if symbol else f'msg:{message_prefix}'
  return f'fp:{hashlib.sha256(key.encode()).hexdigest()[:16]}'

Cluster.symbol_or_prefix already carries the human-readable label —
fingerprint is purely an identity key, so opacity is fine.

## Compat

FINGERPRINT_VERSION bumps from 1 → 2. `diff_sessions` already warns +
skips structural compare on version mismatch, so v1 summaries on disk
are gracefully handled.

## Tests

- test_compute_fingerprint_prefers_symbol: updated for new `fp:` prefix
- test_compute_fingerprint_falls_back_to_prefix: updated, asserts opacity
- test_compute_fingerprint_is_stable_across_calls: new (deterministic guard)
- test_compute_fingerprint_avoids_overlap_collision: new (the issue's
  motivating case — distinct messages sharing a 40-char head must not collide)
- test_build_normalised_event_full: `sym:` → `fp:` prefix update

Verification: ruff/black clean; 90 passed.
Closes #82.

Three coverage gaps from the PR #78 review, now closed.

The crashed-worker recovery path had no test coverage. Adds:
- test_mark_crashed_sets_status — direct call after create
- test_mark_crashed_after_claim_overrides_running — overrides running state
- test_mark_crashed_silent_on_missing_session — no-op on dead session ID
- test_load_summary_returns_none_when_missing — crashed session has no summary.json

The 50-event fixture in tests/fixtures/sample_session.py only ever produced CRITICAL
events (all duration ≥ 600 ms). L1 hit ~36% of its 200-token budget; L2 only ~8%
of 2000. Token-budget tests trivially passed.

Hardened make_events():
- 12 clusters spanning all 4 Severity tiers (MINOR/WARN/CRITICAL/FROZEN)
- One hot cluster at 18 events (near-max count); others spread across cold tail
- Two processes (MyApp + Helper) → process_distribution populates
- First-cluster events packed tightly → detect_temporal_bursts populates
- 7s gap inserted mid-window → detect_quiet_periods populates
- Long symbol names → per-line formatter stress

After hardening: L0=28/30 (93%), L1=91/200 (46%), L2=298/2000 (15%). L1/L2 are
naturally sparse at default top_n=3; the real win is that all four L2 aggregate
branches (Severity / Bursts / Quiet periods / Processes) now actually render.

Adds qualitative assertions in test_token_budgets.py:
- test_fixture_covers_all_severity_tiers — guard against fixture regression
- test_fixture_has_at_least_ten_clusters
- test_l2_renders_all_aggregate_branches — every L2 branch must populate
- test_l1_floor_is_meaningful — floor at 60 tokens
- test_l2_floor_renders_full_summary — floor at 200 tokens

diff_sessions() produces 5 verdict strings; only 3 were tested. format_diff
was only rendered for 'no change' and version mismatch — the new/resolved/drift
render paths were never asserted.

Adds 6 new in tests/test_diff.py:
- test_diff_verdict_regression_new_minor
- test_diff_verdict_improvement_resolved
- test_diff_verdict_drift_mixed_signs (the bidirectional drift case)
- test_format_diff_renders_new_block
- test_format_diff_renders_resolved_block
- test_format_diff_renders_drift_block

Small helper _summary_with_clusters() builds SessionSummary instances with
controlled per-cluster max_duration_ms — keeps verdict tests isolated from the
event pipeline.

test_diff_detects_resolved_cluster filtered by an old symbol name from the
soft fixture ([ImageDecoder decode:]). Updated to use the new equivalent.

- ruff check: All checks passed
- black --check (scripts): clean
- python -m pytest tests/: 103 passed (was 88)
- pytest tests/: 103 passed
…tor Discovery sections

Documents the new scripts that landed since the last doc pass: appearance.py,
location.py, model_inspector.py, container.py, localization_audit.py,
sim_list.py, simulator_selector.py.

- CLAUDE.md: script count 21/27 → 29; add Device State (2) and Simulator
  Discovery (2) categories; reorganise Testing & Analysis (6 → 9).
- SKILL.md: full 29-script listing with renumbered Lifecycle scripts (23-29).
- marketplace.json, plugin.json: description bumped to 29.
- validate-version.yml: fix SKILL.md path after directory restructure.
- .gitignore: fix local config.json path to match new directory layout.
Pre-fix: passing --bundle-id appended (process == <app>) AND to the os_log
predicate. The default hang predicate matches RunningBoard/SpringBoard/watchdog
events that originate outside the target app's process, so the AND silently
dropped the bulk of useful hang signal. Verified on Grapla: a 4-minute session
with --bundle-id com.conorluddy.grapla captured 0 events while the simulator
was actively logging hangs.

Fix: predicate stays simulator-global; bundle filtering applied post-parse
against the event's process field. Mirrors the existing _matches_bundle path
in the legacy --watch/--since modes; promoted to module-level matches_bundle()
so the HangBuster worker reuses it.

- _resolve_predicate signature simplified: drops bundle_id arg entirely.
  Adding it back would resurrect the bug — locked down by test.
- Worker (run_worker) applies matches_bundle() after parse, before matched++.
- --bundle-id help text updated to clarify it's a post-parse filter.
- HangWatcher._matches_bundle becomes a thin delegator (legacy callers).

Tests: 8 new in tests/test_predicate.py — predicate composition, env var
precedence, signature lockdown, post-parse match behaviour. 115 → 123 passing.
…re duration (closes #84)

Pre-fix: log stream EOF after 30s of silence broke the read loop, wrote a
stream_ended event, and exited. SessionStore.persist_worker_counters then
overwrote the status back to 'running' — sessions appeared live forever in
--list-sessions even though the worker had been dead for hours. Verified on
Grapla: a 19s session showed counters.total=1, status='running' indefinitely.

Fix split across three concerns:

1. Crash detection (run_worker rewrite):
   - Removed the 30s silence bail. A quiet log stream is not a dead one;
     bailing on silence was a false-positive death signal.
   - EOF on stdin or proc.poll() != None now triggers a stream_died event
     with exit_code and attempt number.
   - try/finally wrapping ensures mark_crashed runs whenever the worker
     exits involuntarily (subprocess died, restart budget exhausted).
   - Inner read loop extracted into _read_stream_into_events for clarity
     and direct testability.

2. Bounded auto-restart:
   - New env var IOS_SIM_HANG_MAX_RESTARTS (default 3). Each restart logs
     a stream_restart event with attempt number and a 2s backoff prevents
     tight respawn loops.
   - counters now tracks stream_restarts; surfaces in summary + --list-sessions.
   - SIGTERM during backoff or read loop exits cleanly without restart.

3. Capture-duration tracking (hang_sessions.py):
   - mark_crashed now also records stopped_at/stopped_at_ms — so crashed
     sessions have a real end-of-capture timestamp, not just a status flip.
   - persist_worker_counters preserves CRASHED status (alongside STOPPED)
     and no longer clobbers it back to RUNNING.
   - build_summary prefers meta.stopped_at_ms over now() so summaries
     reflect the capture window, not the time of inspection.
   - --list-sessions surfaces 'capture=Xs' and 'restarts=N' columns.

Tests: 8 new (5 in tests/test_worker_loop.py, 3 in tests/test_hang_sessions.py).
Fake subprocess.Popen drives EOF/restart/exhaustion scenarios deterministically.
123 → 131 passing.
…loses #85)

The summarised pipeline (parse → bucket → cluster) is the right shape when
you trust the bucketing, but sometimes you want full-fidelity log data and
ad-hoc jq exploration. --raw-capture spawns 'log stream --style ndjson' and
dumps lines verbatim to raw.ndjson, bypassing the clustering pipeline.

Garbage collection is automatic and layered so a year of daily use ≈ a day:
- Per-session size cap (--max-size-mb, default 10) — worker stops cleanly on
  cap hit, writes size_cap_hit marker, sets extras.truncated.
- Gzip on stop (skippable with --no-gzip) — NDJSON compresses ~15-19x in
  practice. Verified live: 0.96 MB raw → 0.05 MB gzipped on Grapla today.
- TTL prune (existing IOS_SIM_HANG_SESSION_TTL_HOURS=24) runs on every
  --start. New aggregate cap IOS_SIM_HANG_TOTAL_CAP_MB (default 100) runs
  alongside, evicting oldest sessions first when total skill state exceeds.

Verified end-to-end on Grapla: a 12s session captured 737 raw NDJSON lines.
'jq -s group_by(.processImagePath)' shows runningboardd=544, SpringBoard=83,
mediaremoted=39, locationd=39, grapla=28. The grapla-process share (4%) is
exactly why the --bundle-id AND clause in #83 was wrong: it filtered out
96% of useful hang signal originating from system daemons.

Implementation:
- New args: --raw-capture, --max-size-mb, --no-gzip; new env var
  IOS_SIM_HANG_TOTAL_CAP_MB. Plumbed through start_args into meta.args.
- run_worker grew a raw-capture branch that reuses the #84 restart/crash
  machinery: spawn log stream --style ndjson, feed to _read_stream_into_raw,
  which dumps lines verbatim and bails on size cap.
- Non-JSON banners (log stream's 'Filtering the log data ...' first line,
  pwuid_r warnings) are skipped so consumers get strict NDJSON for jq.
- _stop_raw_session gzips raw.ndjson → raw.ndjson.gz (via stdlib gzip +
  shutil.copyfileobj), removes the uncompressed file, records bytes in
  meta.extras. No summary.json for raw sessions.
- _mark_truncated sets extras.truncated when the size cap fires.
- SessionStore.session_total_bytes + prune_to_aggregate_cap: sum sizes
  per session, evict oldest first until total ≤ cap.
- --list-sessions surfaces raw size, truncated flag, capture duration.
- --get-details on a raw session points to the file with a zcat | jq hint.

Tests: 7 new (in tests/test_worker_loop.py) — raw line writing, size-cap
trigger, gzip round-trip + content preservation, --no-gzip preserves uncompressed,
aggregate-cap eviction order, session_total_bytes. 131 → 138 passing.
@conorluddy conorluddy changed the title fix+feat+test: HangBuster PR #78 review follow-ups (closes #79 #80 #81 #82) fix+feat+test: HangBuster review follow-ups + capture-path bugs + raw NDJSON mode (closes #79 #80 #81 #82 #83 #84 #85) May 23, 2026
…v vars (#83 #84 #85)

Documents the three new capabilities landed in this PR across all three
contributor entry points:

SKILL.md:
- Expanded hang_watcher entry: raw-capture sub-section (size cap, gzip,
  jq exploration), resilience section (auto-restart + capture duration +
  list-sessions surface), automatic cleanup section (TTL + aggregate cap),
  bundle-id post-parse note.
- Two new env-var rows: IOS_SIM_HANG_MAX_RESTARTS, IOS_SIM_HANG_TOTAL_CAP_MB.
- Updated IOS_SIM_HANG_PREDICATE description to clarify why predicate stays
  simulator-global (so users don't try to AND a process clause).
- Two quick-start blocks: summarised mode (--start/--stop/--get-details) and
  raw-capture mode (jq cookbook with three production-tested queries).

README.md:
- hang_watcher row updated with raw-capture flags + auto-restart + cleanup.
- Predicate env-var description hardened to explain post-parse bundle filter.
- Two new env-var rows for restarts + total cap.

CLAUDE.md (project guidance for contributors):
- hang_watcher.py one-liner expanded with new modes + env vars.
- hang_sessions.py blurb: status state machine, mark_crashed timestamps,
  aggregate cap + raw_path/session_total_bytes accessors, file layout
  includes raw.ndjson.gz + auto_samples.jsonl.

No code changes. 138 tests still pass.
@conorluddy conorluddy merged commit e8f55dc into main May 23, 2026
1 check passed
@conorluddy conorluddy deleted the feat/hangbuster-review-followups branch May 23, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: diff_sessions() silently drops zero-duration clusters

1 participant