Skip to content

Fix case-arm shadowing UB, tail-loop preemption, and OOM diagnostics#1

Merged
skyblanket merged 29 commits into
mainfrom
claude/optimistic-allen-cfl8xg
Jun 10, 2026
Merged

Fix case-arm shadowing UB, tail-loop preemption, and OOM diagnostics#1
skyblanket merged 29 commits into
mainfrom
claude/optimistic-allen-cfl8xg

Conversation

@skyblanket

@skyblanket skyblanket commented Jun 9, 2026

Copy link
Copy Markdown
Owner

What this changes

This PR fixes three bugs found in the Round-7 full-stack audit (REVIEW_FABLE_2026-06.md):

  1. Case-arm self-shadow codegen UB (F1, P0): When a case arm re-binds the subject's own name — idiomatic Erlang like case n { n when n > 0 -> ... } — the codegen emitted sw_val_t *n = n;, a C self-initialization that read indeterminate memory. This caused wrong guard results on calm stacks and SIGSEGV inside _op_cmp on dirty ones. The interpreter was always correct, making this also a compiled/interpreter divergence. Fix: snapshot the subject into a fresh compiler temp before any arm binds (one pointer copy per case). Tuple, cons, and map sub-binds are covered by the same change.

  2. Compiled tail loops had no yield points (F2, P0): Generated code never called sw_check_reds() / sw_yield(), so a self-tail-call loop was unpreemptible. On a single scheduler, an 80K spawn storm kept all children queued behind the running parent, with every child's 128 KB stack live simultaneously (~160K VMAs needed, over Linux's default vm.max_map_count of 65530). This caused mmap and calloc to fail, leading to SIGSEGV at nil. Result on this host: single-sched stress 0/50 before the fix. Why CI never caught it: Ubuntu 24.04 defaults vm.max_map_count to ~1M. Fix: emit if (sw_check_reds()) sw_yield(); at the self-tail-call backedge. One TLS decrement per turn, yields every 2000 turns. Result: single-sched stress 50/50 (was 0/50); full make stress passes both variants.

  3. Allocation failure SIGSEGV'd instead of failing loud (F3, P1): val_alloc/val_strdup fell back to calloc/strdup with no NULL check, and every sw_val_* constructor writes through the result unchecked. Under OOM or VMA exhaustion, the death was a bare segfault with no diagnosis. Fix: check for NULL and panic with the allocation size and likely cause (e.g., vm.max_map_count / live process count) before abort.

Related issue

n/a — findings from Round-7 external audit (REVIEW_FABLE_2026-06.md)

Checklist

  • make test-sw passes (added tests/sw/test_case_shadow_scrutinee.sw with 7 assertions covering all shadow patterns)
  • make stress completes all runs (single-sched variant now 50/50, was 0/50; full variant passes both)
  • Scheduler/arena path touched (tail-loop preemption, OOM handling) — ordering: reduction check at backedge is a no-op on interpreter path; OOM panic is fail-fast before any writes
  • Compiler ↔ interpreter parity: interpreter was already correct on all three; compiled now matches
  • New runtime behaviour noted in KNOWN_ISSUES.md (try/catch divergence O1, mutual tail recursion O2)

Notes for reviewers

F1 (case-shadow UB): The fix snapshots subj0 (the subject expression) into a fresh temp subj before any arm pattern binds. This is a single pointer copy per case statement, zero overhead. The with statement desugars to case, so it is covered by the same change.

F2 (tail-loop preemption): The reduction check if (sw_check_reds()) sw_yield(); is emitted only at self-tail-call backedges (the language's only unbounded loop construct). Both calls are no-ops outside a

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW

Summary by CodeRabbit

  • Bug Fixes

    • Unified try/error/panic behavior (panics now bypass catch; error() remains catchable); expect() fails on nil or 'false'; improved scheduler fairness to avoid starvation; louder spawn/out-of-memory/stack-overflow diagnostics.
  • New Features

    • UTF-8-aware string_chars; tuple-destructuring assignments; expanded receive/after and with/else forms; Std: map/filter/join auto-imports; REPL clears prior panic; dual-path conformance gate.
  • Documentation

    • Expanded language reference, known-issues, and roadmap updates.
  • Tests / CI

    • New conformance tests, LSan/TSan gates, timeouts, scheduler-aware test skips, and CI scheduler matrix.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Coordinated edits align interpreter and compiled error/panic semantics, add tuple-destructuring and with/else arms, introduce UTF‑8 codepoint string_chars, register real source paths for codegen, improve scheduler/crash handling and spawn/tail behavior, expand Std, and add a dual-path conformance gate with many tests and CI wiring.

Changes

Full-stack fixes and conformance gate

Layer / File(s) Summary
Native runtime and crash handling
src/swarmrt_native.c, src/swarmrt_native.h
Adds per-process compiled try_chain, altstack-based crash handling with stack-overflow reporting, watchdog condvar shutdown, scheduler idle spin/repick/lost-wakeup guard, tracing hooks, and atomic liveness flags with a sw_self_try_chain TLS accessor.
Codegen and compiled try/catch
src/swarmrt_codegen.c, src/swarmrt_codegen.h
Emit setjmp/longjmp try frames (_sw_try_frame_t), snapshot case subjects, yield at self-tail backedges, fail-fast spawn on exhaustion, validate qualified Module.func calls via a module→function table, and add sw_codegen_register_source for real-path diagnostics.
Builtins (compiled & interpreter)
src/swarmrt_builtins_studio.h, src/swarmrt_lang.c
Add UTF‑8-aware string_chars, make expect treat nil and 'false' as falsy (panic), and wire error()/expect()/panic() semantics between interpreter and compiled paths.
Parser language features
src/swarmrt_lang.c, docs/SW_LANGUAGE.md
Support multi-arm with ... else arms, receive after MS -> body arrow form, statement-level tuple-destructuring assignment (desugaring into temp + elem checks), and correct f-string nested-line diagnostics.
Interpreter semantics & dispatch parity
src/swarmrt_lang.c, src/swarmrt_lang.h, src/swarmrt_repl.c
Introduce panicking and try_depth interpreter state; make panics uncatchable while error() remains catchable when inside try; align pipe/call dispatch and reduce argument orders; update assert_raises and reset REPL panic state per evaluation.
Std module additions
lib/Std.sw
Add map, filter, and join to Std exports and provide order-tolerant implementations for map/filter and join as alias to string_join.
Conformance harness & tests
tests/sw/run_conform.sh, tests/sw/conform/*, tests/sw/*
Add dual-path conformance runner that diffs swc run vs swc build outputs and exit codes; add many conformance/regression tests (errors/panic, case, strings, collections, with/json, destructuring, std, tco depth, after-arrow, shadowed scrutinee).
LSan/TSan gates & stress probes
Makefile, tests/gc/lsan_lifecycle.sw, tests/gc/lsan.supp, tests/stress/*, scripts/gc_slope.sh
Add lsan-gate and tsan-gate Make targets, LSan lifecycle probe + suppressions, TSAN suppressions, spin-wedge hunt stress script, and timeouts/safety guards for long-running probes.
CI wiring and timeouts
.github/workflows/linux-quickstart.yml, tests/sw/run_tests.sh
CI now runs lsan-gate (continue-on-error) and a scheduler-count matrix running tests/conformance under SW_SCHEDULERS=1 and 8; test runners bound long-running tests with 180s timeouts and add low-scheduler skips for blocking curl-backed tests.
Docs and roadmap
docs/SW_LANGUAGE.md, docs/notes/KNOWN_ISSUES.md, docs/notes/REVIEW_FABLE_2026-06.md, docs/PRODUCTION_ROADMAP.md, README.md
Document recursion/TCO guidance, tuple-destructuring, with/else semantics, receive arrow form, string_chars builtin, unified error/panic semantics; update known issues, review notes, roadmap, and README test/std summaries.

Sequence Diagram (compiled try/catch routing)

sequenceDiagram
  participant GeneratedCode
  participant sw_self_try_chain
  participant _builtin_error
  participant CatchArm
  GeneratedCode->>sw_self_try_chain: push _sw_try_frame_t
  GeneratedCode->>GeneratedCode: execute try body
  _builtin_error->>sw_self_try_chain: longjmp to active try frame
  sw_self_try_chain->>CatchArm: route into catch path
  CatchArm->>sw_self_try_chain: pop frame and continue
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

A rabbit nibbles through the stack and wire,
Patches shadows, yields, and panic-fire,
Splits strings by codepoint, binds tuples with care,
Runs both paths until their outputs match — fair,
Hops off happy — docs, tests, and runtime repaired 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely summarizes the three main bug fixes: case-arm shadowing undefined behavior, tail-loop preemption, and OOM diagnostics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/optimistic-allen-cfl8xg

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/sw/test_case_shadow_scrutinee.sw (1)

1-95: 💤 Low value

Test coverage is good; consider adding a map pattern case.

The test thoroughly covers ident patterns with guards, tuple sub-binds, and cons head extractions. The PR objectives mention map sub-binds are also covered by the fix—adding a small test like this would complete the regression coverage:

# ---- map value that re-uses the subject name ----

fun value_of(m) {
    case m {
        %{m: m} -> m
        _ -> 'no_key'
    }
}

fun test_shadow_map_value() {
    assert_eq("shadow_map_value", value_of(%{m: 99}), 99)
}

Not blocking since the fix mechanism is shared and already verified by tuple/cons tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/sw/test_case_shadow_scrutinee.sw` around lines 1 - 95, Add a
map-pattern regression test to cover map sub-binds: implement the function
value_of(m) that matches case m { %{m: m} -> m _ -> 'no_key' } and a
test_shadow_map_value() that calls assert_eq("shadow_map_value", value_of(%{m:
99}), 99); then call test_shadow_map_value() from main and add its return to the
failures accumulator so the new test runs during the existing test suite (refer
to the existing functions value_of, test_shadow_map_value, and main to locate
where to insert).
docs/notes/REVIEW_FABLE_2026-06.md (1)

132-132: ⚡ Quick win

Clarify the cross-reference.

The notation "see L6" is ambiguous without specifying which file. Based on context, this likely refers to a line in docs/SW_LANGUAGE.md, but readers shouldn't have to guess.

📝 Suggested clarification
-| `string_length` = byte length | ✓ documented honestly (see L6) |
+| `string_length` = byte length | ✓ documented honestly (see SW_LANGUAGE.md line 6) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/notes/REVIEW_FABLE_2026-06.md` at line 132, Replace the ambiguous "see
L6" reference with an explicit cross-reference to the intended file and location
(for example "see docs/SW_LANGUAGE.md, line 6" or better, a section heading or
anchor such as "see docs/SW_LANGUAGE.md — Section: <heading> (L6)") so readers
know which file/section is meant; update the string "see L6" in
REVIEW_FABLE_2026-06.md to include the filename and, if possible, a section
title or link to the exact location.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 39-45: Update the "### Mutual tail recursion is not TCO'd; deep
chains overflow silently" entry to match the section’s non-crash framing by
removing or reclassifying the raw crash language: either reword the sentence
that describes a "raw SIGSEGV" to a non-crash severity (e.g., "can silently
overflow the fiber stack leading to undefined behavior") and note the
handler/sigaltstack detail as an implementation note, or split this into a
separate "Crash / Data-loss" subsection that explicitly references
REVIEW_FABLE_2026-06.md and the need for sigaltstack; edit the header and the
paragraph body accordingly so severity language is consistent with the rest of
the section.

In `@docs/SW_LANGUAGE.md`:
- Line 106: Update the sentence that currently reads "Recursion is the loop
construct (no `for`/`while`)" to accurately reflect supported syntax by removing
`for` from the negation—e.g. change it to "Recursion is the loop construct (no
`while`)"—and keep the rest of the paragraph about self-tail-call optimisation
and mutual tail-recursion unchanged; target the sentence containing "Recursion
is the loop construct (no `for`/`while`)" so the docs no longer incorrectly
state that `for` is unsupported.

---

Nitpick comments:
In `@docs/notes/REVIEW_FABLE_2026-06.md`:
- Line 132: Replace the ambiguous "see L6" reference with an explicit
cross-reference to the intended file and location (for example "see
docs/SW_LANGUAGE.md, line 6" or better, a section heading or anchor such as "see
docs/SW_LANGUAGE.md — Section: <heading> (L6)") so readers know which
file/section is meant; update the string "see L6" in REVIEW_FABLE_2026-06.md to
include the filename and, if possible, a section title or link to the exact
location.

In `@tests/sw/test_case_shadow_scrutinee.sw`:
- Around line 1-95: Add a map-pattern regression test to cover map sub-binds:
implement the function value_of(m) that matches case m { %{m: m} -> m _ ->
'no_key' } and a test_shadow_map_value() that calls
assert_eq("shadow_map_value", value_of(%{m: 99}), 99); then call
test_shadow_map_value() from main and add its return to the failures accumulator
so the new test runs during the existing test suite (refer to the existing
functions value_of, test_shadow_map_value, and main to locate where to insert).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 707de52f-77b4-4a3c-9457-db12f004a6d4

📥 Commits

Reviewing files that changed from the base of the PR and between 1a4324a and f11fb08.

📒 Files selected for processing (6)
  • docs/SW_LANGUAGE.md
  • docs/notes/KNOWN_ISSUES.md
  • docs/notes/REVIEW_FABLE_2026-06.md
  • src/swarmrt_codegen.c
  • src/swarmrt_lang.c
  • tests/sw/test_case_shadow_scrutinee.sw

Comment thread docs/notes/KNOWN_ISSUES.md
Comment thread docs/SW_LANGUAGE.md
- Parameters are positional; default values: `fun greet(name = "world") { ... }`.
- No explicit `return` — Erlang-style trailing-expression-is-the-value.
- Recursion is the loop construct (no `for`/`while`). Tail calls are detected and optimised by the codegen, so unbounded tail recursion doesn't blow the stack.
- Recursion is the loop construct (no `for`/`while`). **Self**-tail-calls are detected and optimised by the codegen, so unbounded `f -> f` tail recursion doesn't blow the stack. **Mutual** tail recursion (`a -> b -> a`, e.g. two state functions calling each other) is NOT yet optimised — each hop costs a C stack frame and deep chains overflow the 128KB process stack. Keep the recursive call in the same function (dispatch on an argument instead of bouncing between functions) for unbounded loops.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix loop wording to avoid contradicting supported syntax.

Line 106 says there is no for/while, but for is supported and documented in this reference. Please narrow this to “no while” (or equivalent) so users don’t get incorrect guidance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/SW_LANGUAGE.md` at line 106, Update the sentence that currently reads
"Recursion is the loop construct (no `for`/`while`)" to accurately reflect
supported syntax by removing `for` from the negation—e.g. change it to
"Recursion is the loop construct (no `while`)"—and keep the rest of the
paragraph about self-tail-call optimisation and mutual tail-recursion unchanged;
target the sentence containing "Recursion is the loop construct (no
`for`/`while`)" so the docs no longer incorrectly state that `for` is
unsupported.

claude added 9 commits June 9, 2026 22:03
…self-init UB

case n { n when n > 0 -> ... } bound the arm-local straight off the
subject identifier, generating 'sw_val_t *n = n;' — the new n shadows at
its own declarator and reads indeterminate memory. Guards silently fell
through to _ on a calm stack and SIGSEGV'd inside _op_cmp on a dirty
one; the interpreter was always correct (compiled/interp divergence).
Tuple/cons/map sub-binds ('n->v.tuple.items[0]') carried the same
hazard.

Fix: snapshot the case subject into a fresh compiler temp before any
arm binds (one pointer copy per case). 'with' desugars to case, so it
is covered by the same change.

Test: tests/sw/test_case_shadow_scrutinee.sw — ident+guard, tuple item,
and cons head all reusing the subject name; 7 assertions, passes
compiled and interpreted.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…ed by VMA exhaustion

The runtime has reduction accounting (sw_check_reds) and a 1ms SIGALRM
force-yield, but generated code never called either — a compiled
self-tail-call loop was unpreemptible. On one scheduler the 80K spawn
storm's parent therefore never let its children run: all 80K 128KB
fiber stacks stayed live at once, each its own mmap+guard (~2 VMAs), so
~160K VMAs were needed against Linux's default vm.max_map_count of
65530. Once maps ran out, mmap and then glibc calloc returned NULL and
the storm died as 'SIGSEGV at (nil)'.

This is why 'make stress' single-sched read 0/50 on a default-tuned
Linux box while CI stayed green: GitHub's Azure runners and Ubuntu
24.04 (the May-29 clearance host) raise vm.max_map_count, so the gate's
verdict silently depended on a host sysctl.

Fix: emit 'if (sw_check_reds()) sw_yield();' at the self-tail-call
backedge — the language's only unbounded loop. One TLS decrement per
turn, a yield every SWARM_CONTEXT_REDS (2000) turns; both calls are
no-ops on the interpreter path. This also makes the documented
'reduction-counted preemption' true for compiled CPU-bound loops.

Result on the failing host: single-sched stress 0/50 -> 50/50; full
'make stress' passes both variants; all 54 sw tests, phases 2-10,
gc-stress, and gc-slope stay green.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…EGV at nil

val_alloc/val_strdup fell back to calloc/strdup with no NULL check, and
every sw_val_* constructor writes through the result unchecked — so
heap/VMA exhaustion surfaced as a bare segfault. Now an OOM prints a
panic naming the allocation size and the most likely cause (VMA
exhaustion under spawn storms: each live fiber stack is an mmap; check
vm.max_map_count) before abort. Still fatal — but diagnosable, per the
loud-failure philosophy.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…+ TCO doc fix

- docs/notes/REVIEW_FABLE_2026-06.md: full audit — verified-claims
  table, three fixed findings (case-shadow UB, tail-loop preemption,
  loud OOM), eight open findings with repros, language-design
  assessment, and an ordered upgrade path.
- KNOWN_ISSUES.md: two new open entries with repros and workarounds:
  (1) try/catch catches builtin panics in the interpreter but not in
  compiled code; (2) mutual tail recursion is not TCO'd and overflows
  the fiber stack silently.
- SW_LANGUAGE.md: the TCO claim now says self-tail-calls (mutual
  recursion is not optimised), with the dispatch-on-argument idiom.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
One review pass, eight fixes — each was a place where the same program
meant different things under swc run vs swc build, or where a failure
was silent. All gated by the new dual-path conformance runner (next
commit) plus the existing suites.

Error model (the big one):
- error() now unwinds the full dynamic extent to the nearest try/catch
  on BOTH paths. Compiled: a per-process setjmp/longjmp chain
  (sw_self_try_chain; frame is a C local on the fiber stack, so a
  mid-try yield + resume on another scheduler thread is safe). The old
  codegen ran the whole try body and checked the sentinel once at the
  end — the statement AFTER an error() still executed, and could panic
  the process before the catch was consulted.
- error() outside any try is the documented silent continue-with-nil on
  both paths (the interpreter previously unwound the entire program).
- Panics are uncatchable on both paths (interpreter try/catch used to
  absorb them): interp->panicking bypasses N_TRY, surfaces at top level
  as exit 1. assert_raises remains the sanctioned interceptor; a failed
  expect() is now a real panic in the interpreter too (was a hard
  exit(1) that bypassed assert_raises).
- expect(): compiled treated only nil as falsy, so expect(a == b, msg)
  passed straight through on a 'false' — interpreter parity now (nil or
  'false' panic).
- The REPL clears the panicking flag per line so one panic doesn't
  poison the session.

Interpreter correctness:
- N_CALL aborts when an ARGUMENT eval raised — a panic inside an
  f-string interpolation was laundered into the string "nil" by the
  desugared to_string() and execution sailed on.
- N_PIPE now uses the same dispatch chain as N_CALL (module functions,
  primary builtins, the extra-builtin bridge, lambda-valued variables).
  The old whitelist covered 6 names; xs |> filter(f) was a silent NO-OP
  under swc run.
- f-string interpolations parse with the f-string's real line — every
  node in an interpolated expression carried line 1, so diagnostics
  pointed at the top of the file.

Builtin parity:
- reduce() accepts (fn, list, acc) or the piped (list, fn, acc) in both
  the codegen prelude and the interpreter — map/filter had the
  type-directed tolerance, reduce didn't, so xs |> reduce(f, 0)
  returned 0 compiled and the unfiltered list interpreted (both wrong).

Loud failures:
- spawn failure panics ("process table full — raise SW_MAX_PROCS")
  instead of continuing with a pid wrapping NULL whose sends silently
  vanished.
- Stack overflow on a fiber is now diagnosed: SA_ONSTACK + a per-thread
  sigaltstack (the handler used to fault on the overflowed fiber stack
  and the process died with NO output), and a guard-page hit prints
  "stack overflow in process #N — mutual tail calls are not TCO'd"
  instead of a bare SIGSEGV.

Language ergonomics (parse-time desugar, identical on both paths):
- Tuple-destructuring assignment: {a, b} = pair; {'ok', v} = fetch()
  (idents bind, _ skips, literals assert-match with a panic).
- with/else accepts MULTIPLE else arms with optional when guards, like
  case (Elixir shape; the second arm was a parse error).
- receive's after accepts the Erlang-reflex arrow form after MS -> body
  alongside after MS { body }.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…cross swc run + swc build

The language's core promise is that a program means ONE thing, but it
has two execution paths — and every semantic bug this review found
(case-guard UB, try/catch-vs-panic, pipe no-op, reduce arg order,
expect falsy) was exactly a divergence between them.

tests/sw/run_conform.sh runs every tests/sw/conform/*.sw through BOTH
paths and fails on any difference in stdout or exit code (stderr is
deliberately not compared — panic banners may differ in trace detail).
Wired into make test-sw, so CI enforces it on every push. The corpus
(t01–t08) covers: the error model incl. cross-function unwind and
uncatchable panics, case patterns + guards, strings + f-strings,
collections + map/filter/reduce, arithmetic + closures + pipes,
with/else (multi-arm, guards), JSON round-trips, destructuring
assignment, and expect semantics. On its first run the gate caught four
real divergences; all are fixed in the previous commit and held closed
here.

Also: tests/sw/test_tco_depth.sw (compiled-path 2M-deep self-tail-call
flatness + backedge-yield safety) and tests/sw/test_after_arrow.sw
(after MS -> body fires; needs the real scheduler so it lives in the
compiled suite).

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…atch divergence, add interpreter depth

SW_LANGUAGE.md: error() documented as unwinding the full dynamic extent
to the nearest catch (identical on both paths); expect() panics on nil
OR 'false'; tuple-destructuring assignment section (binds, _ skips,
literal match-asserts, statement position only); with/else multi-arm
shape; after MS -> arrow form.

KNOWN_ISSUES.md: the try/catch-vs-panic divergence moves to Recently
cleared (with the new semantics spelled out); new open entry for the
interpreter's bounded recursion depth (C-stack tree-walker, no TCO —
compile for depth); the stale 'swc run compiles via cc' description
corrected (it interprets).

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…uess work

Std.map, Std.filter (self-contained, order-tolerant like the global
builtins — NOT delegating to the same-named builtin, which would be a
resolution hazard) and Std.join (alias for string_join, the name people
guess). No Std.reduce: Std's own sum/product call the global reduce
bare, and a same-named module fun would shadow it — the codegen
denylist hint stays for Std.reduce/Std.pmap and is removed for
map/filter (they're real module functions now).

Conformance: tests/sw/conform/t09_std.sw runs the Std surface through
both paths, both argument orders.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…Round-7 report continuation status

README: test-sw bullet now describes the dual-path conformance gate and
the measured counts (56 files / 493 assertions); Std row lists the new
map/filter/join aliases. Round-7 report gains a continuation section:
O1/O2b/O3 fixed, O6 partial (f-string lines), O7 mostly fixed
(destructuring, multi-arm else, after-arrow, Std aliases), and the four
additional divergences the conformance gate caught on its first run.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
@skyblanket skyblanket force-pushed the claude/optimistic-allen-cfl8xg branch from f11fb08 to fda588a Compare June 9, 2026 22:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/swarmrt_lang.c (2)

2544-2548: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the uint64_t format mismatch.

CI is already warning here on Linux: pid is a uint64_t, but %llu expects unsigned long long. Use PRIu64 or an explicit cast so quickstart stays warning-free.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_lang.c` around lines 2544 - 2548, The fprintf calls in the
SW_VAL_PID and SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t
(v->v.pid->pid and v->v.rpid.id); replace the format specifier with PRIu64 from
<inttypes.h> (e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ?
v->v.pid->pid : 0)); and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ?
v->v.rpid.node : "?", (uint64_t)v->v.rpid.id)); or alternately cast to (unsigned
long long) if you prefer; ensure inttypes.h is included so PRIu64 is available.

Source: Pipeline failures


4563-4616: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bare val |> func still no-ops.

This new dispatch only runs when n->v.pipe.func is N_CALL. A plain value |> to_string or xs |> f still falls through to return val;, so the pipe remains a no-op for the exact val |> func form described in the comment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_lang.c` around lines 4563 - 4616, The pipe handler only
dispatches when n->v.pipe.func->type == N_CALL, so plain forms like `val |>
fname` or `val |> var` fall through and return val; update the N_PIPE case to
also handle non-call function heads: when n->v.pipe.func is a name/identifier
(e.g., N_NAME or the AST node type used for bare identifiers), extract the
symbol (n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same
dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname)
and call the module function, test primary builtins (builtin_print,
builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname,
args, nargs, n->line), and finally check env_get(env, fname) and call
interp_apply_fn for lambda-valued variables; ensure interp->error and recursion
checks mirror the N_CALL path and return the resulting sw_val_t instead of
falling through.
🧹 Nitpick comments (3)
lib/Std.sw (2)

323-333: ⚡ Quick win

Consider explicit predicate check for consistency.

The filter implementation uses if (pred(hd(lst))) (line 330), which relies on truthy/falsy evaluation. However, all other predicate-taking functions in this file explicitly check == 'true':

  • take_while (line 78): if (pred(h) == 'true')
  • partition (line 112): if (pred(h) == 'true')
  • any (line 190): if (pred(hd(lst)) == 'true')
  • all (line 196): if (pred(hd(lst)) == 'false')
  • count (line 205): if (pred(h) == 'true')

While the current implementation appears to work (conformance tests pass), aligning with the established pattern would improve maintainability and make predicate expectations consistent across the library.

♻️ Suggested change for consistency
 fun _filter_acc(lst, pred, acc) {
     if (length(lst) == 0) { reverse(acc) }
     else {
-        if (pred(hd(lst))) { _filter_acc(tl(lst), pred, [hd(lst) | acc]) }
+        if (pred(hd(lst)) == 'true') { _filter_acc(tl(lst), pred, [hd(lst) | acc]) }
         else { _filter_acc(tl(lst), pred, acc) }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/Std.sw` around lines 323 - 333, The filter implementation uses a truthy
check for predicates; update it to match the library's explicit predicate
pattern by checking equality to the string 'true'. Specifically, in function
_filter_acc replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) ==
'true') so predicates behave consistently with other functions (e.g.,
take_while, partition, any, all, count) while leaving filter and _filter_acc
signatures unchanged.

309-337: ⚡ Quick win

Consider moving list operations to the LIST OPS section.

The map and filter implementations operate on lists but are currently placed in the STRING OPS section (which starts at line 305). While join is a string operation alias, map and filter would be more discoverable in the LIST OPS section alongside other list helpers.

📁 Suggested organization

Move lines 309-334 (the map/filter implementations and their comment block) to appear after line 280 (after product) and before line 282 (the MAP OPS section header). Leave the join alias at lines 335-336 since it's a string operation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/Std.sw` around lines 309 - 337, Move the list-based implementations map,
_map_acc, filter, and _filter_acc out of the STRING OPS block into the LIST OPS
section so they are colocated with other list helpers (place them after the
product function and before the MAP OPS header); keep the join function
(string_join alias) in the STRING OPS area. Ensure you relocate the comment
block that documents map/filter along with those functions and update
surrounding spacing/headers so MAP OPS still follows the LIST OPS section.
tests/sw/conform/t05_collections.sw (1)

20-21: ⚡ Quick win

Document the intentional argument order difference.

Lines 20-21 use different argument orders for map (function-first) vs filter (list-first), which tests the documented order detection feature for map/filter builtins. Adding a brief comment would prevent future maintainers from misinterpreting this as an inconsistency.

📝 Suggested documentation
+    # Intentionally using different argument orders to validate order detection
     print(f"map={map(fun(x) { x * 2 }, xs)}")
     print(f"filter={filter(xs, fun(x) { x % 2 == 1 })}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/sw/conform/t05_collections.sw` around lines 20 - 21, Add a brief inline
comment above the two test calls explaining the intentional difference in
argument order to exercise the documented order-detection feature: note that
map(fun, xs) uses function-first while filter(xs, fun) uses list-first here on
the map and filter calls (symbols: map, filter, fun, xs) so future maintainers
don’t treat it as a bug.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 34-44: Update the KNOWN_ISSUES.md entry to stop saying mutual tail
recursion "overflows … as a raw SIGSEGV without even the crash banner" and
"needs `sigaltstack`" — note that `src/swarmrt_native.c` now installs a
per-thread `sigaltstack` and registers the SIGSEGV handler with `SA_ONSTACK`, so
stack-guard faults are detected and produce the diagnostic panic banner (e.g.
`panic: stack overflow in process `#N` (128KB fiber stack)`); keep the rest of the
explanation (only self-tail calls are TCO'd, impact on mutually recursive FSMs,
and the workaround to consolidate the loop into one function) but change the
wording to reflect that the runtime now prints a stack-overflow diagnostic
instead of silently crashing.

In `@src/swarmrt_builtins_studio.h`:
- Around line 5241-5246: The code currently treats any wake from
sw_receive_any(c->ms, &tag) as a timeout and fires the callback (sw_val_apply)
even when a real message arrived; change the logic so that when sw_receive_any
returns a non-NULL m (i.e. an actual message), you free m and do NOT call
sw_val_apply but instead continue waiting for the real timeout event, only
invoking sw_val_apply when sw_receive_any indicates a timeout (m == NULL) and
the process isn't cancelled (check self = sw_self() and self->kill_flag); apply
the same change to the equivalent block around lines 5271–5278 that uses c->ms,
tag, sw_receive_any and sw_val_apply.

In `@src/swarmrt_codegen.c`:
- Around line 2056-2068: In emit_pipe(), replicate the same qualified-builtin
denylist logic used in the direct-call path: when lowering a piped qualified
call (the branch that transforms xs |> Std.reduce(...) or xs |> Std.pmap(...))
detect that the module is "Std" and the function name is "reduce" or "pmap" and
emit the same diagnostic (the fprintf message that references gf/fname and
ctx->mod_name/n->line) and set ctx->had_unknown_fn so codegen fails early;
basically copy the check that uses gf/mod/strcmp from the direct-call branch
into the pipe-lowering branch so piped Std.reduce/Std.pmap produce the identical
hint and early exit.

In `@src/swarmrt_lang.c`:
- Around line 1773-1784: The current desugaring creates a user-visible temp
named "__destruct_%d" (dt_ctr/tmp) and emits a normal N_ASSIGN into the current
scope (bind->v.assign.name), which can collide with user bindings; instead
generate and use a non-user-visible temporary and mark it so both interpreter
and codegen treat it as internal. Change the tmp generation to produce an
internal-only identifier (e.g., use a reserved/internal prefix or a
hidden-marker byte) or create a dedicated internal bind kind, set bind so its
name is the internal temp and ensure the interpreter/codegen treat that name as
non-exported/hidden (or handle the new internal bind kind) so the temp cannot
collide with user code and is not visible to the user.

In `@src/swarmrt_native.c`:
- Around line 1179-1186: The _sw_install_altstack function currently ignores the
return value of sigaltstack; modify _sw_install_altstack to check sigaltstack's
return, and on failure capture errno and surface the error (e.g., call
fprintf(stderr, ...) or process logger and/or abort/exit or return an error
code) so SA_ONSTACK paths don't silently degrade; update callers or related
installs (the places handling alternate stack usage around _sw_install_altstack
and the SA_ONSTACK handling referenced at the earlier uses) to handle the
failure if you choose to return an error instead of aborting. Ensure you include
errno or strerror(errno) in the reported message to aid debugging and keep
function name _sw_install_altstack as the change location.
- Around line 1206-1208: The guard-window check hardcodes 8192 bytes; change it
to compute the window using the system page size so stack overflow detection
works on non-4KB systems: compute page_size (e.g., via sysconf(_SC_PAGESIZE) or
getpagesize()) and replace the constant with something like page_size * 2 (or
the same intended margin) when calculating guard_bot from guard_top; update the
check around the symbols guard_top, guard_bot, cur->stack_mem, and fault to use
the runtime-derived page_size.

---

Outside diff comments:
In `@src/swarmrt_lang.c`:
- Around line 2544-2548: The fprintf calls in the SW_VAL_PID and
SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t (v->v.pid->pid
and v->v.rpid.id); replace the format specifier with PRIu64 from <inttypes.h>
(e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ? v->v.pid->pid : 0));
and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ? v->v.rpid.node : "?",
(uint64_t)v->v.rpid.id)); or alternately cast to (unsigned long long) if you
prefer; ensure inttypes.h is included so PRIu64 is available.
- Around line 4563-4616: The pipe handler only dispatches when
n->v.pipe.func->type == N_CALL, so plain forms like `val |> fname` or `val |>
var` fall through and return val; update the N_PIPE case to also handle non-call
function heads: when n->v.pipe.func is a name/identifier (e.g., N_NAME or the
AST node type used for bare identifiers), extract the symbol
(n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same
dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname)
and call the module function, test primary builtins (builtin_print,
builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname,
args, nargs, n->line), and finally check env_get(env, fname) and call
interp_apply_fn for lambda-valued variables; ensure interp->error and recursion
checks mirror the N_CALL path and return the resulting sw_val_t instead of
falling through.

---

Nitpick comments:
In `@lib/Std.sw`:
- Around line 323-333: The filter implementation uses a truthy check for
predicates; update it to match the library's explicit predicate pattern by
checking equality to the string 'true'. Specifically, in function _filter_acc
replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) == 'true') so
predicates behave consistently with other functions (e.g., take_while,
partition, any, all, count) while leaving filter and _filter_acc signatures
unchanged.
- Around line 309-337: Move the list-based implementations map, _map_acc,
filter, and _filter_acc out of the STRING OPS block into the LIST OPS section so
they are colocated with other list helpers (place them after the product
function and before the MAP OPS header); keep the join function (string_join
alias) in the STRING OPS area. Ensure you relocate the comment block that
documents map/filter along with those functions and update surrounding
spacing/headers so MAP OPS still follows the LIST OPS section.

In `@tests/sw/conform/t05_collections.sw`:
- Around line 20-21: Add a brief inline comment above the two test calls
explaining the intentional difference in argument order to exercise the
documented order-detection feature: note that map(fun, xs) uses function-first
while filter(xs, fun) uses list-first here on the map and filter calls (symbols:
map, filter, fun, xs) so future maintainers don’t treat it as a bug.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 663b5f13-af94-4f8f-994f-39d1a57b29c7

📥 Commits

Reviewing files that changed from the base of the PR and between f11fb08 and fda588a.

📒 Files selected for processing (26)
  • Makefile
  • README.md
  • docs/SW_LANGUAGE.md
  • docs/notes/KNOWN_ISSUES.md
  • docs/notes/REVIEW_FABLE_2026-06.md
  • lib/Std.sw
  • src/swarmrt_builtins_studio.h
  • src/swarmrt_codegen.c
  • src/swarmrt_lang.c
  • src/swarmrt_lang.h
  • src/swarmrt_native.c
  • src/swarmrt_native.h
  • src/swarmrt_repl.c
  • tests/sw/conform/t01_errors.sw
  • tests/sw/conform/t02_panic_uncatchable.sw
  • tests/sw/conform/t03_case.sw
  • tests/sw/conform/t04_strings.sw
  • tests/sw/conform/t05_collections.sw
  • tests/sw/conform/t06_arith_lambdas.sw
  • tests/sw/conform/t07_with_json.sw
  • tests/sw/conform/t08_destructure_with.sw
  • tests/sw/conform/t09_std.sw
  • tests/sw/run_conform.sh
  • tests/sw/test_after_arrow.sw
  • tests/sw/test_case_shadow_scrutinee.sw
  • tests/sw/test_tco_depth.sw
✅ Files skipped from review due to trivial changes (3)
  • tests/sw/conform/t04_strings.sw
  • tests/sw/test_after_arrow.sw
  • README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/sw/test_case_shadow_scrutinee.sw

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/swarmrt_lang.c (2)

2544-2548: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the uint64_t format mismatch.

CI is already warning here on Linux: pid is a uint64_t, but %llu expects unsigned long long. Use PRIu64 or an explicit cast so quickstart stays warning-free.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_lang.c` around lines 2544 - 2548, The fprintf calls in the
SW_VAL_PID and SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t
(v->v.pid->pid and v->v.rpid.id); replace the format specifier with PRIu64 from
<inttypes.h> (e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ?
v->v.pid->pid : 0)); and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ?
v->v.rpid.node : "?", (uint64_t)v->v.rpid.id)); or alternately cast to (unsigned
long long) if you prefer; ensure inttypes.h is included so PRIu64 is available.

Source: Pipeline failures


4563-4616: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bare val |> func still no-ops.

This new dispatch only runs when n->v.pipe.func is N_CALL. A plain value |> to_string or xs |> f still falls through to return val;, so the pipe remains a no-op for the exact val |> func form described in the comment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_lang.c` around lines 4563 - 4616, The pipe handler only
dispatches when n->v.pipe.func->type == N_CALL, so plain forms like `val |>
fname` or `val |> var` fall through and return val; update the N_PIPE case to
also handle non-call function heads: when n->v.pipe.func is a name/identifier
(e.g., N_NAME or the AST node type used for bare identifiers), extract the
symbol (n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same
dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname)
and call the module function, test primary builtins (builtin_print,
builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname,
args, nargs, n->line), and finally check env_get(env, fname) and call
interp_apply_fn for lambda-valued variables; ensure interp->error and recursion
checks mirror the N_CALL path and return the resulting sw_val_t instead of
falling through.
🧹 Nitpick comments (3)
lib/Std.sw (2)

323-333: ⚡ Quick win

Consider explicit predicate check for consistency.

The filter implementation uses if (pred(hd(lst))) (line 330), which relies on truthy/falsy evaluation. However, all other predicate-taking functions in this file explicitly check == 'true':

  • take_while (line 78): if (pred(h) == 'true')
  • partition (line 112): if (pred(h) == 'true')
  • any (line 190): if (pred(hd(lst)) == 'true')
  • all (line 196): if (pred(hd(lst)) == 'false')
  • count (line 205): if (pred(h) == 'true')

While the current implementation appears to work (conformance tests pass), aligning with the established pattern would improve maintainability and make predicate expectations consistent across the library.

♻️ Suggested change for consistency
 fun _filter_acc(lst, pred, acc) {
     if (length(lst) == 0) { reverse(acc) }
     else {
-        if (pred(hd(lst))) { _filter_acc(tl(lst), pred, [hd(lst) | acc]) }
+        if (pred(hd(lst)) == 'true') { _filter_acc(tl(lst), pred, [hd(lst) | acc]) }
         else { _filter_acc(tl(lst), pred, acc) }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/Std.sw` around lines 323 - 333, The filter implementation uses a truthy
check for predicates; update it to match the library's explicit predicate
pattern by checking equality to the string 'true'. Specifically, in function
_filter_acc replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) ==
'true') so predicates behave consistently with other functions (e.g.,
take_while, partition, any, all, count) while leaving filter and _filter_acc
signatures unchanged.

309-337: ⚡ Quick win

Consider moving list operations to the LIST OPS section.

The map and filter implementations operate on lists but are currently placed in the STRING OPS section (which starts at line 305). While join is a string operation alias, map and filter would be more discoverable in the LIST OPS section alongside other list helpers.

📁 Suggested organization

Move lines 309-334 (the map/filter implementations and their comment block) to appear after line 280 (after product) and before line 282 (the MAP OPS section header). Leave the join alias at lines 335-336 since it's a string operation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/Std.sw` around lines 309 - 337, Move the list-based implementations map,
_map_acc, filter, and _filter_acc out of the STRING OPS block into the LIST OPS
section so they are colocated with other list helpers (place them after the
product function and before the MAP OPS header); keep the join function
(string_join alias) in the STRING OPS area. Ensure you relocate the comment
block that documents map/filter along with those functions and update
surrounding spacing/headers so MAP OPS still follows the LIST OPS section.
tests/sw/conform/t05_collections.sw (1)

20-21: ⚡ Quick win

Document the intentional argument order difference.

Lines 20-21 use different argument orders for map (function-first) vs filter (list-first), which tests the documented order detection feature for map/filter builtins. Adding a brief comment would prevent future maintainers from misinterpreting this as an inconsistency.

📝 Suggested documentation
+    # Intentionally using different argument orders to validate order detection
     print(f"map={map(fun(x) { x * 2 }, xs)}")
     print(f"filter={filter(xs, fun(x) { x % 2 == 1 })}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/sw/conform/t05_collections.sw` around lines 20 - 21, Add a brief inline
comment above the two test calls explaining the intentional difference in
argument order to exercise the documented order-detection feature: note that
map(fun, xs) uses function-first while filter(xs, fun) uses list-first here on
the map and filter calls (symbols: map, filter, fun, xs) so future maintainers
don’t treat it as a bug.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 34-44: Update the KNOWN_ISSUES.md entry to stop saying mutual tail
recursion "overflows … as a raw SIGSEGV without even the crash banner" and
"needs `sigaltstack`" — note that `src/swarmrt_native.c` now installs a
per-thread `sigaltstack` and registers the SIGSEGV handler with `SA_ONSTACK`, so
stack-guard faults are detected and produce the diagnostic panic banner (e.g.
`panic: stack overflow in process `#N` (128KB fiber stack)`); keep the rest of the
explanation (only self-tail calls are TCO'd, impact on mutually recursive FSMs,
and the workaround to consolidate the loop into one function) but change the
wording to reflect that the runtime now prints a stack-overflow diagnostic
instead of silently crashing.

In `@src/swarmrt_builtins_studio.h`:
- Around line 5241-5246: The code currently treats any wake from
sw_receive_any(c->ms, &tag) as a timeout and fires the callback (sw_val_apply)
even when a real message arrived; change the logic so that when sw_receive_any
returns a non-NULL m (i.e. an actual message), you free m and do NOT call
sw_val_apply but instead continue waiting for the real timeout event, only
invoking sw_val_apply when sw_receive_any indicates a timeout (m == NULL) and
the process isn't cancelled (check self = sw_self() and self->kill_flag); apply
the same change to the equivalent block around lines 5271–5278 that uses c->ms,
tag, sw_receive_any and sw_val_apply.

In `@src/swarmrt_codegen.c`:
- Around line 2056-2068: In emit_pipe(), replicate the same qualified-builtin
denylist logic used in the direct-call path: when lowering a piped qualified
call (the branch that transforms xs |> Std.reduce(...) or xs |> Std.pmap(...))
detect that the module is "Std" and the function name is "reduce" or "pmap" and
emit the same diagnostic (the fprintf message that references gf/fname and
ctx->mod_name/n->line) and set ctx->had_unknown_fn so codegen fails early;
basically copy the check that uses gf/mod/strcmp from the direct-call branch
into the pipe-lowering branch so piped Std.reduce/Std.pmap produce the identical
hint and early exit.

In `@src/swarmrt_lang.c`:
- Around line 1773-1784: The current desugaring creates a user-visible temp
named "__destruct_%d" (dt_ctr/tmp) and emits a normal N_ASSIGN into the current
scope (bind->v.assign.name), which can collide with user bindings; instead
generate and use a non-user-visible temporary and mark it so both interpreter
and codegen treat it as internal. Change the tmp generation to produce an
internal-only identifier (e.g., use a reserved/internal prefix or a
hidden-marker byte) or create a dedicated internal bind kind, set bind so its
name is the internal temp and ensure the interpreter/codegen treat that name as
non-exported/hidden (or handle the new internal bind kind) so the temp cannot
collide with user code and is not visible to the user.

In `@src/swarmrt_native.c`:
- Around line 1179-1186: The _sw_install_altstack function currently ignores the
return value of sigaltstack; modify _sw_install_altstack to check sigaltstack's
return, and on failure capture errno and surface the error (e.g., call
fprintf(stderr, ...) or process logger and/or abort/exit or return an error
code) so SA_ONSTACK paths don't silently degrade; update callers or related
installs (the places handling alternate stack usage around _sw_install_altstack
and the SA_ONSTACK handling referenced at the earlier uses) to handle the
failure if you choose to return an error instead of aborting. Ensure you include
errno or strerror(errno) in the reported message to aid debugging and keep
function name _sw_install_altstack as the change location.
- Around line 1206-1208: The guard-window check hardcodes 8192 bytes; change it
to compute the window using the system page size so stack overflow detection
works on non-4KB systems: compute page_size (e.g., via sysconf(_SC_PAGESIZE) or
getpagesize()) and replace the constant with something like page_size * 2 (or
the same intended margin) when calculating guard_bot from guard_top; update the
check around the symbols guard_top, guard_bot, cur->stack_mem, and fault to use
the runtime-derived page_size.

---

Outside diff comments:
In `@src/swarmrt_lang.c`:
- Around line 2544-2548: The fprintf calls in the SW_VAL_PID and
SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t (v->v.pid->pid
and v->v.rpid.id); replace the format specifier with PRIu64 from <inttypes.h>
(e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ? v->v.pid->pid : 0));
and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ? v->v.rpid.node : "?",
(uint64_t)v->v.rpid.id)); or alternately cast to (unsigned long long) if you
prefer; ensure inttypes.h is included so PRIu64 is available.
- Around line 4563-4616: The pipe handler only dispatches when
n->v.pipe.func->type == N_CALL, so plain forms like `val |> fname` or `val |>
var` fall through and return val; update the N_PIPE case to also handle non-call
function heads: when n->v.pipe.func is a name/identifier (e.g., N_NAME or the
AST node type used for bare identifiers), extract the symbol
(n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same
dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname)
and call the module function, test primary builtins (builtin_print,
builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname,
args, nargs, n->line), and finally check env_get(env, fname) and call
interp_apply_fn for lambda-valued variables; ensure interp->error and recursion
checks mirror the N_CALL path and return the resulting sw_val_t instead of
falling through.

---

Nitpick comments:
In `@lib/Std.sw`:
- Around line 323-333: The filter implementation uses a truthy check for
predicates; update it to match the library's explicit predicate pattern by
checking equality to the string 'true'. Specifically, in function _filter_acc
replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) == 'true') so
predicates behave consistently with other functions (e.g., take_while,
partition, any, all, count) while leaving filter and _filter_acc signatures
unchanged.
- Around line 309-337: Move the list-based implementations map, _map_acc,
filter, and _filter_acc out of the STRING OPS block into the LIST OPS section so
they are colocated with other list helpers (place them after the product
function and before the MAP OPS header); keep the join function (string_join
alias) in the STRING OPS area. Ensure you relocate the comment block that
documents map/filter along with those functions and update surrounding
spacing/headers so MAP OPS still follows the LIST OPS section.

In `@tests/sw/conform/t05_collections.sw`:
- Around line 20-21: Add a brief inline comment above the two test calls
explaining the intentional difference in argument order to exercise the
documented order-detection feature: note that map(fun, xs) uses function-first
while filter(xs, fun) uses list-first here on the map and filter calls (symbols:
map, filter, fun, xs) so future maintainers don’t treat it as a bug.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 663b5f13-af94-4f8f-994f-39d1a57b29c7

📥 Commits

Reviewing files that changed from the base of the PR and between f11fb08 and fda588a.

📒 Files selected for processing (26)
  • Makefile
  • README.md
  • docs/SW_LANGUAGE.md
  • docs/notes/KNOWN_ISSUES.md
  • docs/notes/REVIEW_FABLE_2026-06.md
  • lib/Std.sw
  • src/swarmrt_builtins_studio.h
  • src/swarmrt_codegen.c
  • src/swarmrt_lang.c
  • src/swarmrt_lang.h
  • src/swarmrt_native.c
  • src/swarmrt_native.h
  • src/swarmrt_repl.c
  • tests/sw/conform/t01_errors.sw
  • tests/sw/conform/t02_panic_uncatchable.sw
  • tests/sw/conform/t03_case.sw
  • tests/sw/conform/t04_strings.sw
  • tests/sw/conform/t05_collections.sw
  • tests/sw/conform/t06_arith_lambdas.sw
  • tests/sw/conform/t07_with_json.sw
  • tests/sw/conform/t08_destructure_with.sw
  • tests/sw/conform/t09_std.sw
  • tests/sw/run_conform.sh
  • tests/sw/test_after_arrow.sw
  • tests/sw/test_case_shadow_scrutinee.sw
  • tests/sw/test_tco_depth.sw
✅ Files skipped from review due to trivial changes (3)
  • tests/sw/conform/t04_strings.sw
  • tests/sw/test_after_arrow.sw
  • README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/sw/test_case_shadow_scrutinee.sw
🛑 Comments failed to post (6)
docs/notes/KNOWN_ISSUES.md (1)

34-44: ⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if mutual-recursion overflow now prints a diagnostic

# Create a mutual-recursion test
cat > /tmp/mutual_recurse.sw << 'EOF'
module MutualTest
export [main]

fun is_even(n) { if (n == 0) { 'true' } else { is_odd(n - 1) } }
fun is_odd(n)  { if (n == 0) { 'false' } else { is_even(n - 1) } }

fun main() {
    print(is_even(100000))
}
EOF

# Compile and run, capture stderr
./bin/swc build /tmp/mutual_recurse.sw -o /tmp/mutual_test 2>&1
/tmp/mutual_test 2>&1 | head -20

# Look for "stack overflow" message
if /tmp/mutual_test 2>&1 | grep -q "stack overflow"; then
  echo "FIXED: stack overflow diagnostic present"
else
  echo "NOT FIXED: no diagnostic, still raw SIGSEGV"
fi

Repository: skyblanket/swarmrt

Length of output: 227


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Show the current KNOWN_ISSUES entry (lines ~34-44)
echo "---- docs/notes/KNOWN_ISSUES.md:34-70 ----"
nl -ba docs/notes/KNOWN_ISSUES.md | sed -n '30,80p' || true

# 2) Locate the compiler executable used for building/swift-like sources
echo "---- repo root: top-level dirs/files ----"
ls -la

echo "---- list ./bin if present ----"
ls -la bin 2>/dev/null || echo "no ./bin"

echo "---- find executables named swc/similar ----"
# common candidates
find . -maxdepth 4 -type f \( -name 'swc' -o -name 'swc*' -o -name '*compiler*' -o -name '*fable*' \) 2>/dev/null | head -200 || true

echo "---- fd for swc in repo ----"
command -v fd >/dev/null 2>&1 && fd -a -t f 'swc' . | head -50 || true

# 3) Search for the promised stack-overflow diagnostics implementation
echo "---- rg sigaltstack/SA_ONSTACK/stack overflow messages ----"
rg -n --hidden --no-ignore-vcs -S "sigaltstack|SA_ONSTACK|stack overflow|overflow in process" . || true

# 4) If we find an executable, re-run the mutual recursion test using it.
# We'll try a few likely paths derived from what we find above.
# (If nothing executable is found, we’ll stop with a helpful message.)
EXEC=""
if [ -x ./bin/swc ]; then EXEC="./bin/swc"; fi
if [ -z "$EXEC" ]; then
  # try first match from find output
  CAND="$(find . -maxdepth 5 -type f \( -name 'swc' -o -name 'swc*' \) -perm -111 2>/dev/null | head -1 || true)"
  if [ -n "${CAND:-}" ]; then EXEC="$CAND"; fi
fi

if [ -z "$EXEC" ]; then
  echo "Could not locate an executable 'swc' (or similar) in this repo sandbox; cannot run the mutual-recursion test yet."
  exit 0
fi

echo "---- using compiler executable: $EXEC ----"

cat > /tmp/mutual_recurse.sw << 'EOF'
module MutualTest
export [main]

fun is_even(n) { if (n == 0) { 'true' } else { is_odd(n - 1) } }
fun is_odd(n)  { if (n == 0) { 'false' } else { is_even(n - 1) } }

fun main() {
    print(is_even(100000))
}
EOF

# Try common CLI shapes: build -o, or compile -o, etc. We’ll try a few without assuming.
set +e
OUTLOG=/tmp/mutual_test.log
rm -f /tmp/mutual_test /tmp/mutual_test.log

"$EXEC" build /tmp/mutual_recurse.sw -o /tmp/mutual_test 2>&1 | tee "$OUTLOG"
rc=$?
if [ $rc -ne 0 ]; then
  echo "Compile step failed with exit code $rc. Log tail:"
  tail -200 "$OUTLOG" || true
  exit 0
fi

set -e
/tmp/mutual_test 2>&1 | head -80 | tee /tmp/mutual_test.run.log

if ( /tmp/mutual_test 2>&1 | rg -qi "stack overflow|overflow in process" ); then
  echo "FIXED/DIAGNOSTIC: stack overflow diagnostic present"
else
  echo "NOT FIXED: still appears as raw crash without 'stack overflow' wording"
fi

Repository: skyblanket/swarmrt

Length of output: 4196


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "---- KNOWN_ISSUES.md lines 30-60 (no nl) ----"
sed -n '30,60p' docs/notes/KNOWN_ISSUES.md || true

echo "---- REVIEW_FABLE_2026-06.md around O2(b) (160-210) ----"
sed -n '150,210p' docs/notes/REVIEW_FABLE_2026-06.md || true

echo "---- REVIEW_FABLE_2026-06.md around table row (270-350) ----"
sed -n '260,340p' docs/notes/REVIEW_FABLE_2026-06.md || true

echo "---- src/swarmrt_native.c around altstack/panic (~1130-1235) ----"
sed -n '1125,1235p' src/swarmrt_native.c || true

echo "---- src/swarmrt_native.c around signal handler setup (~1360-1420) ----"
sed -n '1355,1425p' src/swarmrt_native.c || true

Repository: skyblanket/swarmrt

Length of output: 16989


Update KNOWN_ISSUES.md stack-overflow wording to match current sigaltstack handling
docs/notes/KNOWN_ISSUES.md (lines 34-44) still says mutual recursion “overflows … as a raw SIGSEGV without even the crash banner” and “needs sigaltstack”. src/swarmrt_native.c installs a per-thread sigaltstack and registers the SIGSEGV handler with SA_ONSTACK, which detects guard-page faults and prints panic: stack overflow in process #N (128KB fiber stack) (with the mutual tail-call explanation). Update the entry to reflect the diagnostic-on-overflow behavior instead of silent SIGSEGV.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/notes/KNOWN_ISSUES.md` around lines 34 - 44, Update the KNOWN_ISSUES.md
entry to stop saying mutual tail recursion "overflows … as a raw SIGSEGV without
even the crash banner" and "needs `sigaltstack`" — note that
`src/swarmrt_native.c` now installs a per-thread `sigaltstack` and registers the
SIGSEGV handler with `SA_ONSTACK`, so stack-guard faults are detected and
produce the diagnostic panic banner (e.g. `panic: stack overflow in process `#N`
(128KB fiber stack)`); keep the rest of the explanation (only self-tail calls
are TCO'd, impact on mutually recursive FSMs, and the workaround to consolidate
the loop into one function) but change the wording to reflect that the runtime
now prints a stack-overflow diagnostic instead of silently crashing.
src/swarmrt_builtins_studio.h (1)

5241-5246: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep waiting after spurious mailbox wakeups.

sw_receive_any(c->ms, ...) wakes on either timeout or any incoming message. These handlers discard the message but still run the callback immediately, so sending anything to the timer pid shortens delay/interval and can make delay() fire almost instantly.

💡 Suggested fix
 static void _after_entry(void *raw) {
     _timer_closure_t *c = (_timer_closure_t *)raw;
@@
-    uint64_t tag;
-    void *m = sw_receive_any(c->ms, &tag);
-    if (m) free(m);                       /* discard any spurious message */
-    sw_process_t *self = sw_self();
-    if (!(self && self->kill_flag))       /* fire only if not cancelled mid-wait */
-        sw_val_apply(c->fn, NULL, 0);
+    uint64_t deadline = _sw_now_ms() + c->ms;
+    for (;;) {
+        uint64_t now = _sw_now_ms();
+        if (now >= deadline) break;
+        uint64_t tag;
+        void *m = sw_receive_any(deadline - now, &tag);
+        if (m) free(m);                   /* ignore foreign messages, keep waiting */
+        sw_process_t *self = sw_self();
+        if (self && self->kill_flag) return;
+    }
+    sw_val_apply(c->fn, NULL, 0);
 }
@@
 static void _every_entry(void *raw) {
     _timer_closure_t *c = (_timer_closure_t *)raw;
@@
     for (;;) {
-        uint64_t tag;
-        void *m = sw_receive_any(c->ms, &tag);
-        if (m) free(m);                        /* discard any spurious message */
-        sw_process_t *self = sw_self();
-        if (self && self->kill_flag) break;    /* cancelled during the wait */
+        uint64_t deadline = _sw_now_ms() + c->ms;
+        for (;;) {
+            uint64_t now = _sw_now_ms();
+            if (now >= deadline) break;
+            uint64_t tag;
+            void *m = sw_receive_any(deadline - now, &tag);
+            if (m) free(m);
+            sw_process_t *self = sw_self();
+            if (self && self->kill_flag) return;
+        }
+        sw_process_t *self = sw_self();
+        if (self && self->kill_flag) break;
         sw_val_apply(c->fn, NULL, 0);
         self = sw_self();
         if (self && self->kill_flag) break;    /* cancelled during/after the tick */
     }
 }

Also applies to: 5271-5278

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_builtins_studio.h` around lines 5241 - 5246, The code currently
treats any wake from sw_receive_any(c->ms, &tag) as a timeout and fires the
callback (sw_val_apply) even when a real message arrived; change the logic so
that when sw_receive_any returns a non-NULL m (i.e. an actual message), you free
m and do NOT call sw_val_apply but instead continue waiting for the real timeout
event, only invoking sw_val_apply when sw_receive_any indicates a timeout (m ==
NULL) and the process isn't cancelled (check self = sw_self() and
self->kill_flag); apply the same change to the equivalent block around lines
5271–5278 that uses c->ms, tag, sw_receive_any and sw_val_apply.
src/swarmrt_codegen.c (1)

2056-2068: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mirror the qualified-builtin denylist in emit_pipe().

Line 2064 only catches direct Std.reduce(...) / Std.pmap(...) calls. The piped qualified-call branches still lower xs |> Std.reduce(f, 0) to Std_reduce(...), so this path falls through to a raw generated-C unknown-function failure instead of the new hint and early codegen failure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_codegen.c` around lines 2056 - 2068, In emit_pipe(), replicate
the same qualified-builtin denylist logic used in the direct-call path: when
lowering a piped qualified call (the branch that transforms xs |>
Std.reduce(...) or xs |> Std.pmap(...)) detect that the module is "Std" and the
function name is "reduce" or "pmap" and emit the same diagnostic (the fprintf
message that references gf/fname and ctx->mod_name/n->line) and set
ctx->had_unknown_fn so codegen fails early; basically copy the check that uses
gf/mod/strcmp from the direct-call branch into the pipe-lowering branch so piped
Std.reduce/Std.pmap produce the identical hint and early exit.
src/swarmrt_lang.c (1)

1773-1784: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Destructuring introduces a user-visible temp binding.

__destruct_N is lowered as a normal assignment in the current scope, so destructuring leaks an extra variable and can overwrite a user-defined __destruct_* binding. Because this desugar feeds both the interpreter and codegen, the collision changes program behavior on both paths. Please lower through a non-user-visible temporary instead of a regular identifier.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_lang.c` around lines 1773 - 1784, The current desugaring creates
a user-visible temp named "__destruct_%d" (dt_ctr/tmp) and emits a normal
N_ASSIGN into the current scope (bind->v.assign.name), which can collide with
user bindings; instead generate and use a non-user-visible temporary and mark it
so both interpreter and codegen treat it as internal. Change the tmp generation
to produce an internal-only identifier (e.g., use a reserved/internal prefix or
a hidden-marker byte) or create a dedicated internal bind kind, set bind so its
name is the internal temp and ensure the interpreter/codegen treat that name as
non-exported/hidden (or handle the new internal bind kind) so the temp cannot
collide with user code and is not visible to the user.
src/swarmrt_native.c (2)

1179-1186: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Check sigaltstack installation result and surface failures.

Line 1185 ignores sigaltstack errors; if installation fails, the SA_ONSTACK path silently degrades on Line 1161/Line 1393.

Suggested patch
-static void _sw_install_altstack(void);   /* defined with the crash handler below */
+static int _sw_install_altstack(void);   /* defined with the crash handler below */

 static void *scheduler_main(void *arg) {
     sw_scheduler_t *sched = (sw_scheduler_t *)arg;
-    _sw_install_altstack();   /* fiber-stack overflows fault on THIS thread */
+    if (_sw_install_altstack() != 0) {
+        perror("[SwarmRT] sigaltstack (scheduler thread)");
+    }
     sched->active = 1;
     scheduler_loop(sched);
     return NULL;
 }

-static void _sw_install_altstack(void) {
+static int _sw_install_altstack(void) {
     static __thread char altstack_mem[64 * 1024];
     stack_t ss;
     ss.ss_sp = altstack_mem;
     ss.ss_size = sizeof(altstack_mem);
     ss.ss_flags = 0;
-    sigaltstack(&ss, NULL);
+    return sigaltstack(&ss, NULL);
 }
...
-        _sw_install_altstack();   /* this thread; schedulers install their own */
+        if (_sw_install_altstack() != 0) {
+            perror("[SwarmRT] sigaltstack (init thread)");
+        }

Also applies to: 1161-1161, 1393-1393

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_native.c` around lines 1179 - 1186, The _sw_install_altstack
function currently ignores the return value of sigaltstack; modify
_sw_install_altstack to check sigaltstack's return, and on failure capture errno
and surface the error (e.g., call fprintf(stderr, ...) or process logger and/or
abort/exit or return an error code) so SA_ONSTACK paths don't silently degrade;
update callers or related installs (the places handling alternate stack usage
around _sw_install_altstack and the SA_ONSTACK handling referenced at the
earlier uses) to handle the failure if you choose to return an error instead of
aborting. Ensure you include errno or strerror(errno) in the reported message to
aid debugging and keep function name _sw_install_altstack as the change
location.

1206-1208: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use runtime page size for the guard-window check.

Line 1207 hardcodes 8192; on non-4KB page systems this can miss real guard-page overflows and skip the dedicated overflow banner.

Suggested patch
+static size_t g_page_size = 4096;
...
 int sw_init(const char *name, uint32_t num_schedulers) {
+    long ps = sysconf(_SC_PAGESIZE);
+    if (ps > 0) g_page_size = (size_t)ps;
...
     if (sig == SIGSEGV && cur && cur->stack_mem && info->si_addr) {
         uintptr_t fault = (uintptr_t)info->si_addr;
         uintptr_t guard_top = (uintptr_t)cur->stack_mem;       /* guard page is just below */
-        uintptr_t guard_bot = guard_top - 8192;                /* guard page + one page slack */
+        uintptr_t guard_bot = guard_top - (uintptr_t)(2 * g_page_size); /* guard + one-page slack */
         if (fault < guard_top && fault >= guard_bot) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_native.c` around lines 1206 - 1208, The guard-window check
hardcodes 8192 bytes; change it to compute the window using the system page size
so stack overflow detection works on non-4KB systems: compute page_size (e.g.,
via sysconf(_SC_PAGESIZE) or getpagesize()) and replace the constant with
something like page_size * 2 (or the same intended margin) when calculating
guard_bot from guard_top; update the check around the symbols guard_top,
guard_bot, cur->stack_mem, and fault to use the runtime-derived page_size.

claude added 5 commits June 9, 2026 22:31
…advisory CI leg)

The slope gates catch UNBOUNDED growth but are blind to bounded leaks
under their RSS noise floor, and every ASAN run to date used
detect_leaks=0 (macOS Apple-clang has no LSan) — the leak-blindness
PRODUCTION_ROADMAP.md 2.1 calls the highest-leverage gap.

make lsan-gate (Linux-only): tests/gc/lsan_lifecycle.sw churns every
lifecycle owner per round — one-shot timer fired, interval cancelled
(incl. the pre-trampoline kill window), dynamic + static supervisors
killed, ETS put/replace/delete, plain spawn/exit, cross-process compound
messages, ~64KB captures — exits cleanly, and LSan asserts zero
definitely-lost blocks at exit. Exit-time is the right checkpoint:
parked-fiber stacks (invisible to the reachability scan) are all gone,
and globals-reachable singletons are 'still reachable', which LSan does
not report. Suppressions in tests/gc/lsan.supp, each one tied to a
documented accepted-minor (currently just _builtin_supervise's caller
specs[]).

Proven BIDIRECTIONAL per the gate philosophy: an injected unreachable
31337-byte block through the full runtime link fails the gate; the
lifecycle probe passes. (Methodology note for future leak canaries: at
-O1 clang legally elides an unused malloc — escape the pointer through
a volatile global or the canary tests nothing. Cost a debugging hour.)

Result: Phase-1's ownership work holds under real LSan — 40 rounds of
full lifecycle churn, zero unsuppressed leaks.

Two adjacent fixes/findings:
- _sw_install_altstack is skipped under ASAN: ASAN installs and at
  thread exit munmaps its OWN per-thread altstack; replacing it made
  UnsetAlternateSignalStack CHECK-fail. The crash handler runs on
  ASAN's altstack under sanitizers.
- setup_preemption() is DEAD CODE — defined, never called: the 1ms
  preemption timer never existed at runtime; the only preemption is the
  reduction check at compiled tail-loop backedges. Annotated as
  deliberately-off with the wiring considerations (storm re-verification
  + SIGALRM-vs-LSan interaction) rather than silently deleted.

CI: advisory leg (continue-on-error) after the slope gate; promote to
blocking once green for a week.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…ring_chars

Closes the Round-7 O6 diagnostics remainder + the last O7 stdlib gap:

- Diagnostics and #line directives now reference the REAL source file:
  swc registers each parsed file's path (entry inputs + resolved
  imports) with codegen via sw_codegen_register_source, and
  guess_sw_path consults the registry before falling back to the
  src/<ModName>.sw convention. 'swc build /tmp/foo.sw' errors no longer
  point at a nonexistent src/Foo.sw.

- A qualified call into a module that IS being compiled now validates
  at codegen time with a proper suggestion ('module Std has no function
  mpa — did you mean Std.map?') instead of leaking a raw cc
  implicit-declaration warning + linker undefined-reference — the worst
  diagnostic in the compiler. Modules not compiled in are left to the
  linker as before.

- string_chars(s): codepoint-aware split (list of single-codepoint
  strings, UTF-8 lead-byte walk, invalid leads degrade to one byte) in
  BOTH paths — string_length is documented byte-length, and agent code
  slicing UTF-8 by bytes corrupts runes. length(string_chars(s)) is the
  codepoint count; rejoin with Std.join(cs, ""). Conform t04 holds the
  paths identical.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
… for blocking-loopback, CI legs

The matrix (suite + conformance under SW_SCHEDULERS=1/2/4/8) found a
real architecture issue: the curl-backed HTTP client builtins block
their scheduler OS THREAD, not just the calling fiber. A program with an
in-process server fiber that calls itself over HTTP therefore deadlocks
under SW_SCHEDULERS=1 (the blocked client owns the only scheduler — the
suite sat on test_http_request FOREVER), and under SW_SCHEDULERS=2
whenever client and server land on the same scheduler: a runnable fiber
in a blocked scheduler's LOCAL queue is unstealable (work stealing
covers only the global overflow queue, not peer local queues). The
deadlock watchdog cannot see this shape — the thread is busy inside
libcurl, not parked.

- run_tests.sh: every test bounded by timeout 180 — a hang is a loud
  RUN FAIL, never a wedged suite.
- The three self-loopback tests SKIP below 3 schedulers, printing the
  reason; full assertion counts unchanged at S>=3.
- KNOWN_ISSUES: new entry with the mechanism, impact, and the Phase-3
  fix direction (blocking transports -> I/O pool with fiber park/wake,
  like wsc_*).
- CI: matrix legs for S=1 and S=8 (oversubscribed) after the default
  run; matrix verified green locally at S=1/2/3/4/8.

(Also a self-caught language-trap data point for the eval suite: this
patch's first version wrote 'a or b' — sw only has '||' — and the
matrix caught the COMPILE FAIL immediately.)

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
… 58.4 -> 4.5us/rt (13x)

Every message to a process on another scheduler found that scheduler
PARKED on its idle condvar, so each cross-scheduler send paid a
mutex+signal+futex wake and each round trip paid two — measured 58.4us
per ping-pong round trip vs 0.92us same-scheduler (Round-7 audit, O4).

Fix: before parking, the idle loop spin-polls local pick + (every ~64
iterations) the steal queue for a bounded window. While spinning,
rq->idle stays 0, so the producer's sw_add_to_runq pays NOTHING — no
lock, no signal — and the consumer picks the work up within ns. The
window is per-idle-transition and env-tunable: SW_SPIN_US (default 30,
0 disables, capped 1000); the existing 0.5ms park is unchanged, as is
single-scheduler behavior.

Measured (4 schedulers, 50K ping-pong round trips, same host):
  spin 30us (default): 4.46us/rt   | 100K spawn/exit cycles: 6.4us each
  spin 0 (old path):   58.4us/rt   | 100K spawn/exit cycles: 8.7us each
  SW_SCHEDULERS=1:     0.92us/rt (unchanged)

Verified: full battery green — suite + conformance, stress both
variants, gc-slope, gc-stress, lsan-gate, phases 2-10, and the
scheduler matrix at S=1/2/3/4/8.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…4/O6/O7 fixed

Phase 2.1 (LSan gate) and 2.2 (scheduler matrix) marked done with their
findings inline: the bidirectional-proof requirement, the
-O1-elides-unused-malloc canary lesson, the blocking-builtin scheduler
deadlock, and the spin-before-park numbers (58.4 -> 4.5us/rt). Round-7
continuation table updated: O4 (wake cost), O6 (diagnostics), O7
(codepoint strings) all FIXED — remaining open: O5 startup time, O8
gate-portability docs, O2c general TCO.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/swarmrt_builtins_studio.h`:
- Around line 8489-8491: The temporary buffer "items" is allocated with raw
malloc in the body of string_chars/1 (cap, items, sw_val_t **) and can be
NULL-dereferenced on OOM; defer or move this allocation onto the checked-OOM
path: do not call raw malloc inline — instead either (a) allocate using the
project's checked allocator API (e.g. the same checked/aborting allocator used
elsewhere) when you know allocation must succeed, or (b) delay allocation until
after the function has performed its OOM/VMA checks and then call a checked
allocation routine (or check the malloc return and call the existing sw OOM
abort helper). Update the code around the "cap"/"items" use so items is only
allocated via the checked allocator (or null-checked and aborts with the
diagnostic) before any write occurs.
- Around line 8470-8495: The code trusts the UTF-8 lead byte in _sw_utf8_seq_len
and then blindly consumes sl bytes in _builtin_string_chars, so malformed leads
or bad continuation bytes get treated as multi-byte codepoints; update the logic
in _builtin_string_chars to validate each continuation byte (each must satisfy
(byte & 0xC0) == 0x80) before accepting sl, and also detect and reject
overlong/illegal sequences (e.g., 0xC0/0xC1 leads, surrogates, codepoints >
0x10FFFF, and invalid lead ranges 0xF5-0xFF) so that any invalid or truncated
sequence falls back to treating the first byte as a single 1-byte character; you
can still use _sw_utf8_seq_len to propose sl but perform the full validation
against s, len, i and adjust sl to 1 on failure before creating the char entry
in items.

In `@tests/gc/lsan_lifecycle.sw`:
- Around line 43-45: The test currently schedules the one-shot timer with
delay(0, fn() { cap }) but then uses sleep(1) which can pass without the
callback running; change the test to explicitly wait for the callback to run
instead of sleeping — e.g., have the callback signal a synchronization primitive
or set a boolean/flag that the test blocks on and asserts before proceeding
(reference the delay(...) call, the fn() { cap } callback, and the current
sleep(...)/PROBE_OK probe), ensuring the probe only advances once the
timer-fired signal is observed to preserve fired-timer leak coverage.

In `@tests/sw/run_tests.sh`:
- Around line 57-62: Replace the direct non-portable invocation `timeout 180
"$bin" >"$log" 2>&1` with the existing cross-platform timeout wrapper used
elsewhere in the script (use a helper such as a `timeout_wrapper` or
`run_with_alarm` that tries GNU `timeout` first and falls back to a Perl `alarm`
implementation), so that running "$bin" with a 180s bound logs to "$log" and
behaves correctly on macOS and Linux; update the `if` condition to call that
wrapper instead of `timeout 180 "$bin"` and ensure the wrapper preserves exit
codes and redirects stdout/stderr the same way.

In `@tests/sw/test_http_headers.sw`:
- Around line 64-69: The comment in the self-loopback test block is inconsistent
with the skip condition: the code skips schedulers "1" and "2" but the comment
says "Needs >=2 schedulers"; change the comment text to match the actual
condition (e.g., "Needs >=3 schedulers") in tests/sw/test_http_headers.sw and
make the identical edits in the copied blocks in
tests/sw/test_http_post_stream.sw and tests/sw/test_http_request.sw so the
printed message and the skip logic are consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 62405b32-6202-44fb-b4f5-30fc3d0288ca

📥 Commits

Reviewing files that changed from the base of the PR and between fda588a and df57d3c.

📒 Files selected for processing (19)
  • .github/workflows/linux-quickstart.yml
  • Makefile
  • docs/PRODUCTION_ROADMAP.md
  • docs/SW_LANGUAGE.md
  • docs/notes/KNOWN_ISSUES.md
  • docs/notes/REVIEW_FABLE_2026-06.md
  • src/swarmrt_builtins_studio.h
  • src/swarmrt_codegen.c
  • src/swarmrt_codegen.h
  • src/swarmrt_lang.c
  • src/swarmrt_native.c
  • src/swc.c
  • tests/gc/lsan.supp
  • tests/gc/lsan_lifecycle.sw
  • tests/sw/conform/t04_strings.sw
  • tests/sw/run_tests.sh
  • tests/sw/test_http_headers.sw
  • tests/sw/test_http_post_stream.sw
  • tests/sw/test_http_request.sw
✅ Files skipped from review due to trivial changes (5)
  • tests/gc/lsan.supp
  • docs/PRODUCTION_ROADMAP.md
  • docs/SW_LANGUAGE.md
  • docs/notes/KNOWN_ISSUES.md
  • docs/notes/REVIEW_FABLE_2026-06.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/sw/conform/t04_strings.sw
  • src/swarmrt_lang.c

Comment on lines +8470 to +8495
static int _sw_utf8_seq_len(unsigned char c) {
if (c < 0x80) return 1;
if ((c & 0xE0) == 0xC0) return 2;
if ((c & 0xF0) == 0xE0) return 3;
if ((c & 0xF8) == 0xF0) return 4;
return 1;
}

/* string_chars(s) → list of single-CODEPOINT strings. The codepoint-aware
* complement to byte-oriented string_length/string_sub: slicing UTF-8 text
* by bytes corrupts multi-byte runes (Round-7 audit O7), and agent code
* handles non-ASCII text constantly. length(string_chars(s)) is the
* codepoint length; reassemble slices with Std.join(chars, ""). Walks raw
* UTF-8 sequence lengths (no grapheme clustering — combining marks stay
* separate codepoints, documented). */
static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) {
if (n < 1 || !a[0] || a[0]->type != SW_VAL_STRING) return sw_val_list(NULL, 0);
const char *s = a[0]->v.str ? a[0]->v.str : "";
size_t len = strlen(s);
int cap = (int)len + 1;
sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap);
int cnt = 0;
size_t i = 0;
while (i < len) {
int sl = _sw_utf8_seq_len((unsigned char)s[i]);
if (i + (size_t)sl > len) sl = 1; /* truncated tail sequence */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate UTF-8 continuation bytes before consuming a full sequence.

Line 8494 only trusts the lead byte. Inputs with invalid leads (0xC0/0xC1, 0xF5-0xFF) or bad continuations (for example "\xE2A") are still consumed as multi-byte chars, so string_chars/1 mis-splits malformed strings even though the helper comment says malformed bytes should degrade to a single char.

Suggested fix
 static int _sw_utf8_seq_len(unsigned char c) {
     if (c < 0x80) return 1;
-    if ((c & 0xE0) == 0xC0) return 2;
-    if ((c & 0xF0) == 0xE0) return 3;
-    if ((c & 0xF8) == 0xF0) return 4;
+    if (c >= 0xC2 && c <= 0xDF) return 2;
+    if (c >= 0xE0 && c <= 0xEF) return 3;
+    if (c >= 0xF0 && c <= 0xF4) return 4;
     return 1;
 }
+
+static int _sw_utf8_is_cont(unsigned char c) {
+    return (c & 0xC0) == 0x80;
+}
 
 static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) {
@@
     while (i < len) {
         int sl = _sw_utf8_seq_len((unsigned char)s[i]);
         if (i + (size_t)sl > len) sl = 1;   /* truncated tail sequence */
+        for (int j = 1; j < sl; j++) {
+            if (!_sw_utf8_is_cont((unsigned char)s[i + (size_t)j])) {
+                sl = 1;
+                break;
+            }
+        }
         char buf[8];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static int _sw_utf8_seq_len(unsigned char c) {
if (c < 0x80) return 1;
if ((c & 0xE0) == 0xC0) return 2;
if ((c & 0xF0) == 0xE0) return 3;
if ((c & 0xF8) == 0xF0) return 4;
return 1;
}
/* string_chars(s) → list of single-CODEPOINT strings. The codepoint-aware
* complement to byte-oriented string_length/string_sub: slicing UTF-8 text
* by bytes corrupts multi-byte runes (Round-7 audit O7), and agent code
* handles non-ASCII text constantly. length(string_chars(s)) is the
* codepoint length; reassemble slices with Std.join(chars, ""). Walks raw
* UTF-8 sequence lengths (no grapheme clusteringcombining marks stay
* separate codepoints, documented). */
static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) {
if (n < 1 || !a[0] || a[0]->type != SW_VAL_STRING) return sw_val_list(NULL, 0);
const char *s = a[0]->v.str ? a[0]->v.str : "";
size_t len = strlen(s);
int cap = (int)len + 1;
sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap);
int cnt = 0;
size_t i = 0;
while (i < len) {
int sl = _sw_utf8_seq_len((unsigned char)s[i]);
if (i + (size_t)sl > len) sl = 1; /* truncated tail sequence */
static int _sw_utf8_seq_len(unsigned char c) {
if (c < 0x80) return 1;
if (c >= 0xC2 && c <= 0xDF) return 2;
if (c >= 0xE0 && c <= 0xEF) return 3;
if (c >= 0xF0 && c <= 0xF4) return 4;
return 1;
}
static int _sw_utf8_is_cont(unsigned char c) {
return (c & 0xC0) == 0x80;
}
/* string_chars(s) → list of single-CODEPOINT strings. The codepoint-aware
* complement to byte-oriented string_length/string_sub: slicing UTF-8 text
* by bytes corrupts multi-byte runes (Round-7 audit O7), and agent code
* handles non-ASCII text constantly. length(string_chars(s)) is the
* codepoint length; reassemble slices with Std.join(chars, ""). Walks raw
* UTF-8 sequence lengths (no grapheme clusteringcombining marks stay
* separate codepoints, documented). */
static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) {
if (n < 1 || !a[0] || a[0]->type != SW_VAL_STRING) return sw_val_list(NULL, 0);
const char *s = a[0]->v.str ? a[0]->v.str : "";
size_t len = strlen(s);
int cap = (int)len + 1;
sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap);
int cnt = 0;
size_t i = 0;
while (i < len) {
int sl = _sw_utf8_seq_len((unsigned char)s[i]);
if (i + (size_t)sl > len) sl = 1; /* truncated tail sequence */
for (int j = 1; j < sl; j++) {
if (!_sw_utf8_is_cont((unsigned char)s[i + (size_t)j])) {
sl = 1;
break;
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_builtins_studio.h` around lines 8470 - 8495, The code trusts the
UTF-8 lead byte in _sw_utf8_seq_len and then blindly consumes sl bytes in
_builtin_string_chars, so malformed leads or bad continuation bytes get treated
as multi-byte codepoints; update the logic in _builtin_string_chars to validate
each continuation byte (each must satisfy (byte & 0xC0) == 0x80) before
accepting sl, and also detect and reject overlong/illegal sequences (e.g.,
0xC0/0xC1 leads, surrogates, codepoints > 0x10FFFF, and invalid lead ranges
0xF5-0xFF) so that any invalid or truncated sequence falls back to treating the
first byte as a single 1-byte character; you can still use _sw_utf8_seq_len to
propose sl but perform the full validation against s, len, i and adjust sl to 1
on failure before creating the char entry in items.

Comment on lines +8489 to +8491
int cap = (int)len + 1;
sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap);
int cnt = 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep this temporary buffer on the checked OOM path.

Line 8490 allocates items with raw malloc() and Line 8499 writes through it immediately. Under the same OOM/VMA-exhaustion conditions this PR is hardening, string_chars/1 still falls back to a bare NULL-deref instead of the new diagnostic abort path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_builtins_studio.h` around lines 8489 - 8491, The temporary buffer
"items" is allocated with raw malloc in the body of string_chars/1 (cap, items,
sw_val_t **) and can be NULL-dereferenced on OOM; defer or move this allocation
onto the checked-OOM path: do not call raw malloc inline — instead either (a)
allocate using the project's checked allocator API (e.g. the same
checked/aborting allocator used elsewhere) when you know allocation must
succeed, or (b) delay allocation until after the function has performed its
OOM/VMA checks and then call a checked allocation routine (or check the malloc
return and call the existing sw OOM abort helper). Update the code around the
"cap"/"items" use so items is only allocated via the checked allocator (or
null-checked and aborts with the diagnostic) before any write occurs.

Comment on lines +43 to +45
delay(0, fn() { cap })
sleep(1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert timer fire explicitly instead of relying on fixed sleep.

Line 43 schedules the one-shot delay, but Line 44 only sleeps; this does not guarantee the callback actually ran. If it doesn’t fire, the probe can still reach PROBE_OK, weakening fired-timer leak coverage.

Suggested fix
 fun one_round(cap) {
     # one-shot delay: fires, sends, closure must be freed after firing
-    delay(0, fn() { cap })
-    sleep(1)
+    parent = self()
+    delay(0, fn() { send(parent, 'delay_fired') ; cap })
+    receive { 'delay_fired' -> 0 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gc/lsan_lifecycle.sw` around lines 43 - 45, The test currently
schedules the one-shot timer with delay(0, fn() { cap }) but then uses sleep(1)
which can pass without the callback running; change the test to explicitly wait
for the callback to run instead of sleeping — e.g., have the callback signal a
synchronization primitive or set a boolean/flag that the test blocks on and
asserts before proceeding (reference the delay(...) call, the fn() { cap }
callback, and the current sleep(...)/PROBE_OK probe), ensuring the probe only
advances once the timer-fired signal is observed to preserve fired-timer leak
coverage.

Comment thread tests/sw/run_tests.sh
Comment on lines +57 to +62
# Run. Bounded: a hung test must FAIL the suite loudly, not wedge it
# forever — the Phase-2.2 scheduler matrix found test_http_request
# deadlocking under SW_SCHEDULERS=1 (blocking curl client starves the
# in-process server fiber on the only scheduler) and the suite sat on
# it indefinitely. 180s is ~20x the slowest healthy test.
if timeout 180 "$bin" >"$log" 2>&1; then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -eu
echo "Platform: $(uname -s)"
echo "timeout:  $(command -v timeout || echo MISSING)"
echo "gtimeout: $(command -v gtimeout || echo MISSING)"
nl -ba tests/sw/run_tests.sh | sed -n '54,68p'

Repository: skyblanket/swarmrt

Length of output: 173


🏁 Script executed:

#!/bin/bash
set -eu

f="tests/sw/run_tests.sh"
echo "== timeout usage ==" 
rg -n --hidden --no-mmap '\btimeout\b' "$f" || true

echo
echo "== gtimeout usage =="
rg -n --hidden --no-mmap '\bgtimeout\b' "$f" || true

echo
echo "== perl alarm fallback =="
rg -n --hidden --no-mmap 'alarm\s+\$t|perl.*alarm|perl -e' "$f" || true

echo
echo "== context around lines 50-75 =="
nl -ba "$f" | sed -n '45,85p'

Repository: skyblanket/swarmrt

Length of output: 715


🏁 Script executed:

#!/bin/bash
set -eu

f="tests/sw/run_tests.sh"
echo "== timeout/gtimeout usage =="
rg -n --no-mmap '\btimeout\b|\bgtimeout\b' "$f" || true

echo
echo "== lines 45-90 (no nl available) =="
sed -n '45,90p' "$f"

echo
echo "== locate existing perl alarm usage block(s) =="
rg -n --no-mmap 'perl -e.*alarm|alarm\s+\$t|macOS has no `timeout`' "$f" || true

Repository: skyblanket/swarmrt

Length of output: 2458


🏁 Script executed:

#!/bin/bash
set -eu
f="tests/sw/run_tests.sh"

echo "== function definitions containing timeout/perl/alarm =="
rg -n --no-mmap '^\s*(run_with_timeout|timeout|alarm|perl -e|gtimeout)\b|^\s*[a-zA-Z0-9_]+\s*\(\)\s*\{' "$f" || true

echo
echo "== search for 'run_with_' wrappers =="
rg -n --no-mmap 'run_with_|\btimeout/|fallback|wrapper' "$f" || true

echo
echo "== print surrounding perl-alarm blocks (150-235) =="
sed -n '130,240p' "$f"

Repository: skyblanket/swarmrt

Length of output: 5432


🏁 Script executed:

#!/bin/bash
set -eu
f="tests/sw/run_tests.sh"
echo "== top of file =="
sed -n '1,120p' "$f"

Repository: skyblanket/swarmrt

Length of output: 4334


Replace the non-portable timeout call in the main run loop with a timeout/Perl fallback wrapper.

tests/sw/run_tests.sh invokes timeout 180 "$bin" directly, with no macOS fallback, while other parts of the script already use Perl alarm explicitly because macOS lacks timeout. On platforms without GNU timeout, this will fail the suite (e.g., exit 127) instead of properly bounding hangs.

Proposed fix
@@
 set -u
 SWARMRT_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
 SWC="$SWARMRT_ROOT/bin/swc"
 TESTS_DIR="$SWARMRT_ROOT/tests/sw"
 BUILD_DIR="$SWARMRT_ROOT/tests/sw/.build"
+
+run_with_timeout() {
+    secs="$1"; shift
+    if command -v timeout >/dev/null 2>&1; then
+        timeout "$secs" "$@"
+    else
+        perl -e 'my $t=shift `@ARGV`; alarm $t; exec `@ARGV` or die' "$secs" "$@"
+    fi
+}
@@
-    if timeout 180 "$bin" >"$log" 2>&1; then
+    if run_with_timeout 180 "$bin" >"$log" 2>&1; then
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Run. Bounded: a hung test must FAIL the suite loudly, not wedge it
# forever — the Phase-2.2 scheduler matrix found test_http_request
# deadlocking under SW_SCHEDULERS=1 (blocking curl client starves the
# in-process server fiber on the only scheduler) and the suite sat on
# it indefinitely. 180s is ~20x the slowest healthy test.
if timeout 180 "$bin" >"$log" 2>&1; then
# Run. Bounded: a hung test must FAIL the suite loudly, not wedge it
# forever — the Phase-2.2 scheduler matrix found test_http_request
# deadlocking under SW_SCHEDULERS=1 (blocking curl client starves the
# in-process server fiber on the only scheduler) and the suite sat on
# it indefinitely. 180s is ~20x the slowest healthy test.
if run_with_timeout 180 "$bin" >"$log" 2>&1; then
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/sw/run_tests.sh` around lines 57 - 62, Replace the direct non-portable
invocation `timeout 180 "$bin" >"$log" 2>&1` with the existing cross-platform
timeout wrapper used elsewhere in the script (use a helper such as a
`timeout_wrapper` or `run_with_alarm` that tries GNU `timeout` first and falls
back to a Perl `alarm` implementation), so that running "$bin" with a 180s bound
logs to "$log" and behaves correctly on macOS and Linux; update the `if`
condition to call that wrapper instead of `timeout 180 "$bin"` and ensure the
wrapper preserves exit codes and redirects stdout/stderr the same way.

Comment on lines +64 to +69
# Self-loopback test: an in-process HTTP server fiber + a BLOCKING
# curl-backed client on the same runtime. The client call occupies its
# scheduler OS THREAD (not just the fiber), so with one scheduler the
# server can never run -> deadlock/timeout (found by the Phase-2.2
# scheduler matrix). Needs >=2 schedulers until blocking builtins are
# moved off the scheduler thread (see KNOWN_ISSUES).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align threshold wording with the actual skip condition.

Line 68 says Needs >=2 schedulers, but Lines 71-73 skip both "1" and "2" and print needs >=3 schedulers. Please make the comment consistent (same copied block also appears in tests/sw/test_http_post_stream.sw and tests/sw/test_http_request.sw).

Proposed wording fix
-    # scheduler matrix). Needs >=2 schedulers until blocking builtins are
+    # scheduler matrix). Needs >=3 schedulers until blocking builtins are
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Self-loopback test: an in-process HTTP server fiber + a BLOCKING
# curl-backed client on the same runtime. The client call occupies its
# scheduler OS THREAD (not just the fiber), so with one scheduler the
# server can never run -> deadlock/timeout (found by the Phase-2.2
# scheduler matrix). Needs >=2 schedulers until blocking builtins are
# moved off the scheduler thread (see KNOWN_ISSUES).
# Self-loopback test: an in-process HTTP server fiber + a BLOCKING
# curl-backed client on the same runtime. The client call occupies its
# scheduler OS THREAD (not just the fiber), so with one scheduler the
# server can never run -> deadlock/timeout (found by the Phase-2.2
# scheduler matrix). Needs >=3 schedulers until blocking builtins are
# moved off the scheduler thread (see KNOWN_ISSUES).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/sw/test_http_headers.sw` around lines 64 - 69, The comment in the
self-loopback test block is inconsistent with the skip condition: the code skips
schedulers "1" and "2" but the comment says "Needs >=2 schedulers"; change the
comment text to match the actual condition (e.g., "Needs >=3 schedulers") in
tests/sw/test_http_headers.sw and make the identical edits in the copied blocks
in tests/sw/test_http_post_stream.sw and tests/sw/test_http_request.sw so the
printed message and the skip logic are consistent.

claude added 3 commits June 10, 2026 00:52
…45ms -> ~40ms (6ms small-arena)

The 'startup time' mystery (Round-7 O5: hello measured 110-145ms wall
vs a 3ms C baseline) was mostly SHUTDOWN: the deadlock watchdog slept
its 5s interval in 100ms nanosleep chunks polling the stop flag, and
sw_shutdown joins it — so EVERY binary paid avg ~50ms (worst 100ms) at
exit. strace showed the ~99ms gap on the watchdog thread-stack munmap.

Now the watchdog waits on a condvar with the interval as deadline;
shutdown signals it under the lock and the join returns immediately.
Also removes the 10/s idle wakeups the chunked sleep cost.

Measured (Linux x86_64, hello binary): 110-145ms -> 38-42ms wall at the
default 100K-slot arena (the remaining ~35ms is the documented arena
init), and ~5-6ms with SW_MAX_PROCS=1024 — the README's <10ms claim is
now reachable with the documented knob.

Bundled in this commit (same file, same verification battery):

scheduler idle-path hardening + an honest P1 finding:
- publish-idle-then-repoll guard before parking (standard sleeper
  protocol): a producer enqueueing between the consumer's last pick and
  the idle=1 store skipped the signal and the work ate the full 0.5ms
  park timeout.
- O4 spin-before-park lands OPT-IN, DEFAULT OFF (SW_SPIN_US, 0
  disables): it measures 58.4 -> 4.5us per cross-scheduler ping-pong
  round trip (13x), but GATES A LATENT DEADLOCK — depth-1 ping-pong
  wedges ~15% of runs with spin on (0/60 off, 9/60 on): both fibers
  parked WAITING, all runqs empty, zero enqueues forever. The repoll
  guard does not close it, so the loss is upstream of the park (runq
  push vs spinning pick, or the receive waiting-flag handoff). Filed as
  P1 in KNOWN_ISSUES with the reproducer; root-cause is the Phase
  2.3/2.4 TSan + atomics work, then the default flips on.
- SW_SCHED_TRACE=1: 1Hz scheduler counter telemetry on stderr (enq /
  sig / idle_seen / park / park_sig / park_to / spin_hit / guard_hit) —
  the instrument that proved the wedge was a total enqueue stall
  (enq=0, park_to~6600/s) rather than slow progress.

Verified: suite + conformance, storms (both scheduler modes), gc-slope,
gc-stress, lsan-gate, phases 2-10, spin_wedge_hunt 25/25 with defaults,
startup numbers above.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
tests/stress/spin_wedge_hunt.sh: regression gate for the spin-gated
scheduler deadlock (KNOWN_ISSUES P1) — N runs of the depth-1 ping-pong
probe under SW_SPIN_US, first wedge exits 1. Wire into the stress gate
with spin on once the race is fixed; passes 25/25 with today's
defaults (spin off).

scripts/gc_slope.sh: every probe run is bounded by timeout 600 — the
wedge ground the message probe from ~100ms to 25+ minutes and the gate
just sat on it; now a hang lands in the existing FAIL:exit path.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…up numbers measured

KNOWN_ISSUES: new P1 entry for the spin-gated scheduler deadlock with
the full evidence trail (incidence, telemetry signature, reproducer,
fix path). Roadmap 2.2 amended: spin ships opt-in/default-off until
2.3/2.4 close the race. README: SW_MAX_PROCS row now carries measured
boot numbers (~6ms small-arena / ~40ms default on Linux) after the
watchdog shutdown fix.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/gc_slope.sh`:
- Around line 40-41: The script currently sets TO="" when the `timeout` command
is absent, which silently removes the runtime bound; instead, update the `TO`
logic so the script fails fast or falls back to GNU coreutils `gtimeout`: detect
`timeout` first, then check for `gtimeout` and set TO accordingly (e.g.,
`TO="timeout 600"` or `TO="gtimeout 600"`), and if neither exists, exit with a
clear error rather than leaving TO empty; change the code that assigns `TO` (the
lines referencing the `TO` variable and the `command -v timeout` check) to
implement this fallback-or-fail behavior.

In `@tests/stress/spin_wedge_hunt.sh`:
- Around line 36-38: The current check treats any non-zero exit from the probe
as a timeout wedge; modify the block that runs SW_QUIET=1 timeout "$TIMEOUT_S"
"$BIN" 4000 to capture the actual exit status (e.g., save $? into a variable)
and distinguish timeout exit code 124 from other failures, then log different
messages: for 124 print the existing "WEDGE … exceeded ${TIMEOUT_S}s" message,
for other non-zero statuses print a failure/crash message including the numeric
exit code and context (e.g., that the binary crashed or exited abnormally).
Update the messages in the spin_wedge_hunt reporting lines so they reference the
captured status instead of always saying "exceeded timeout".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0af5a114-c45c-461c-8bcc-8d983a158d67

📥 Commits

Reviewing files that changed from the base of the PR and between df57d3c and 78931d0.

📒 Files selected for processing (6)
  • README.md
  • docs/PRODUCTION_ROADMAP.md
  • docs/notes/KNOWN_ISSUES.md
  • scripts/gc_slope.sh
  • src/swarmrt_native.c
  • tests/stress/spin_wedge_hunt.sh
✅ Files skipped from review due to trivial changes (2)
  • docs/PRODUCTION_ROADMAP.md
  • README.md

Comment thread scripts/gc_slope.sh
Comment on lines +40 to +41
local TO="timeout 600"
command -v timeout >/dev/null 2>&1 || TO=""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast (or fallback to gtimeout) instead of silently disabling bounds.

At Line 41, missing timeout sets TO="", which reintroduces unbounded hangs in the GC slope gate. That breaks the bounded-run contract this script is enforcing.

Proposed fix
-  local TO="timeout 600"
-  command -v timeout >/dev/null 2>&1 || TO=""
+  local -a TO
+  if command -v timeout >/dev/null 2>&1; then
+    TO=(timeout 600)
+  elif command -v gtimeout >/dev/null 2>&1; then
+    TO=(gtimeout 600)
+  else
+    echo "FAIL:no-timeout-tool"
+    return
+  fi
@@
-    gnu)   out=$(SW_RUNTIME_QUIET=1 $TO /usr/bin/time -v "$bin" "$count" 2>&1); rc=$?
+    gnu)   out=$(SW_RUNTIME_QUIET=1 "${TO[@]}" /usr/bin/time -v "$bin" "$count" 2>&1); rc=$?
@@
-    bsd)   out=$(SW_RUNTIME_QUIET=1 $TO /usr/bin/time -l "$bin" "$count" 2>&1); rc=$?
+    bsd)   out=$(SW_RUNTIME_QUIET=1 "${TO[@]}" /usr/bin/time -l "$bin" "$count" 2>&1); rc=$?
@@
-    gtime) out=$(SW_RUNTIME_QUIET=1 $TO gtime -v "$bin" "$count" 2>&1); rc=$?
+    gtime) out=$(SW_RUNTIME_QUIET=1 "${TO[@]}" gtime -v "$bin" "$count" 2>&1); rc=$?
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local TO="timeout 600"
command -v timeout >/dev/null 2>&1 || TO=""
local -a TO
if command -v timeout >/dev/null 2>&1; then
TO=(timeout 600)
elif command -v gtimeout >/dev/null 2>&1; then
TO=(gtimeout 600)
else
echo "FAIL:no-timeout-tool"
return
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/gc_slope.sh` around lines 40 - 41, The script currently sets TO=""
when the `timeout` command is absent, which silently removes the runtime bound;
instead, update the `TO` logic so the script fails fast or falls back to GNU
coreutils `gtimeout`: detect `timeout` first, then check for `gtimeout` and set
TO accordingly (e.g., `TO="timeout 600"` or `TO="gtimeout 600"`), and if neither
exists, exit with a clear error rather than leaving TO empty; change the code
that assigns `TO` (the lines referencing the `TO` variable and the `command -v
timeout` check) to implement this fallback-or-fail behavior.

Comment on lines +36 to +38
if ! SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1; then
echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)"
echo "spin_wedge_hunt: FAIL"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Differentiate timeout wedges from other probe failures.

At Line 36, all non-zero exits are reported as “WEDGE … exceeded timeout”. This conflates timeout (124) with ordinary failures (e.g., crash/exec error), which makes stress diagnostics misleading.

Proposed fix
-for i in $(seq 1 "$RUNS"); do
-    if ! SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1; then
-        echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)"
-        echo "spin_wedge_hunt: FAIL"
-        exit 1
-    fi
+for i in $(seq 1 "$RUNS"); do
+    rc=0
+    SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1 || rc=$?
+    if [ "$rc" -eq 124 ]; then
+        echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)"
+        echo "spin_wedge_hunt: FAIL"
+        exit 1
+    elif [ "$rc" -ne 0 ]; then
+        echo "probe failed on run $i (exit=$rc)"
+        echo "spin_wedge_hunt: FAIL"
+        exit 2
+    fi
 done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ! SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1; then
echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)"
echo "spin_wedge_hunt: FAIL"
for i in $(seq 1 "$RUNS"); do
rc=0
SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1 || rc=$?
if [ "$rc" -eq 124 ]; then
echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)"
echo "spin_wedge_hunt: FAIL"
exit 1
elif [ "$rc" -ne 0 ]; then
echo "probe failed on run $i (exit=$rc)"
echo "spin_wedge_hunt: FAIL"
exit 2
fi
done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/stress/spin_wedge_hunt.sh` around lines 36 - 38, The current check
treats any non-zero exit from the probe as a timeout wedge; modify the block
that runs SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 to capture the actual exit
status (e.g., save $? into a variable) and distinguish timeout exit code 124
from other failures, then log different messages: for 124 print the existing
"WEDGE … exceeded ${TIMEOUT_S}s" message, for other non-zero statuses print a
failure/crash message including the numeric exit code and context (e.g., that
the binary crashed or exited abnormally). Update the messages in the
spin_wedge_hunt reporting lines so they reference the captured status instead of
always saying "exceeded timeout".

claude added 6 commits June 10, 2026 01:05
…ecome _Atomic; trace init ordered

ThreadSanitizer on the message ping-pong + spawn storm flagged exactly
three races (the feared fiber false-positive storm did not materialize —
cross-thread fiber migration synchronizes through the runq's C11
atomics, so TSan sees the happens-before edges):

- sched->active (main's init readiness spin vs the scheduler thread's
  store), sched->should_exit and g_swarm->running (sw_shutdown's store
  vs every scheduler loop's read) were 'volatile int' — volatile is not
  synchronization in C11. Now _Atomic int: sw_scheduler/sw_swarm are
  not asm-offset-pinned (only sw_process is), and these are cold flags.
- sched_trace_maybe_start ran AFTER scheduler threads were created, so
  the g_sched_trace flag write raced their reads on every enqueue — now
  initialized before pthread_create (ordered by thread creation).

Verified: full battery green (suite + conformance, stress both
variants, gc-slope, gc-stress, lsan-gate, phases 2-10) and
'make tsan-gate' reports no unsuppressed races on msg, msg+spin, and
the full 80k storm.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…under TSan)

Builds the depth-1 message probe and the spawn storm with
-fsanitize=thread and fails on any unsuppressed race. Suppressions
(tests/stress/tsan.supp) cover ONLY the documented warn-only lock-free
watchdog scanner (watchdog_thread_fn + sw_io_active_port_count) — by
design it reads slab fields unsynchronized and tolerates stale values;
making sw_process.state _Atomic for a scanner-clean runtime is the
Phase-2.4 decision (state is written on every context switch).

Note for the P1 spin-gated deadlock: 40/40 wedge-free under TSan with
spin on — the interleave is timing-suppressed and shows NO C11 race,
confirming it is a protocol bug (waiting/idle flag handshake), not a
data race. tests/stress/spin_wedge_hunt.sh remains its gate.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…esults + worklist

run_stress.sh prints vm.max_map_count / nproc / ulimits on Linux — the
gate's verdict used to depend SILENTLY on a host sysctl (Round-7 O8).

Phase 2.4 audit (TSan over msg, storm, and phase 2/4/5 binaries):
mailbox/runq clean; control flags fixed in 2.3; remaining worklist
recorded in the roadmap — proc->state relaxed-atomicization (~50
mechanical sites, zero-cost), the timer-list unlocked reader, the
registry name-lifetime question, and two racy test-code counters.
sw_stats joins the watchdog in tsan.supp as a documented warn-only
printer.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…eive waiting-flag handshake; seq_cst; spin default ON

The ~15%-incidence total deadlock under depth-1 cross-scheduler
ping-pong (both fibers parked WAITING, zero enqueues forever) was ONE
LINE PAIR in sw_receive_any's fast path:

    atomic_store_explicit(&waiting, 1, memory_order_release);
    sig  = atomic_load_explicit(&sig_head, memory_order_acquire);

Dekker's pattern: store one variable, load a DIFFERENT one, with the
sender doing the mirror image (push sig_head, then exchange waiting).
Release->acquire across different objects orders NOTHING — the
receiver's waiting=1 sat in its store buffer past the sig_head load
(x86 StoreLoad reordering), so the receiver saw a pre-push sig_head and
parked, while the sender's wake-exchange read waiting==0 from memory
and skipped the enqueue. Both sides lost; the message sat in the
mailbox forever. The three sibling receive paths survived on x86 only
because mailbox_drain's locked xchg happens to be a full fence — which
arm64 (the macOS daily driver) does NOT promise, so they were latent
there too. TSan was silent throughout: every access was atomic; the
bug was pure ordering.

Fix: every waiting-flag participant is seq_cst — all four
about-to-sleep stores, the sig_head probe load, and mailbox_wake's
exchange. Total-order correctness on every architecture; the cost is an
xchg/fence on the COLD about-to-sleep and wake paths only.

Diagnosis tooling that found it (stays in the runtime, off by default):
- SW_SCHED_TRACE=1: 1Hz counters + a stall detector that fires a wedge
  AUTOPSY — every live process's state/waiting/mailbox pointers + runq
  internals — after 3 flatlined seconds. The corpse read: parked,
  waiting=1, sig_head!=NULL. That pair is impossible under a correct
  protocol and pointed straight at the handshake.
- SW_SCHED_TRACE=2: + a lock-free protocol event ring (push / wake /
  wset / drain / park / enq / pick) dumped with the autopsy.

Verification: 0/300 wedge-hunt runs at the fix commit (~45 expected at
the pre-fix incidence), then the full battery with spin DEFAULT ON
(SW_SPIN_US=30): tsan-gate, suite + conformance, stress both variants,
gc-slope, gc-stress, lsan-gate, phases 2-10, scheduler matrix
S=1/2/4/8, spin_wedge_hunt 40/40. Cross-scheduler ping-pong:
58.4us/rt at the audit baseline -> 3.0us/rt now (19x).

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
src/swarmrt_native.c (4)

924-1004: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tie the scheduler-trace thread to runtime shutdown.

pthread_detach() lets sched_trace_fn() outlive g_swarm, but it keeps dereferencing g_swarm, the slab, and the scheduler array on every tick and during autopsy dumps. With SW_SCHED_TRACE enabled, sw_shutdown() can therefore free those objects while the trace thread is still using them.

Also applies to: 1667-1770

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_native.c` around lines 924 - 1004, The sched-trace thread is
detached so it can outlive and dereference freed runtime state; fix by making it
joinable and joined at shutdown: in sched_trace_maybe_start(), replace the
pthread_detach() call with storing the pthread_t returned by pthread_create into
a global/thread-static (e.g. add pthread_t g_sched_trace_thread and a bool
g_sched_trace_started) and set that flag when creation succeeds; in
sw_shutdown() (or the runtime teardown function) check g_sched_trace_started and
call pthread_join(g_sched_trace_thread, NULL) before freeing g_swarm; keep
sched_trace_fn using g_swarm->running as its exit condition so the join will
complete cleanly.

1453-1474: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Derive the overflow window from the real guard-page size.

The stack allocator uses the runtime page size, but this detector hardcodes 8192. On systems with larger pages, many real guard-page faults will miss this branch and lose the new “stack overflow” diagnostic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_native.c` around lines 1453 - 1474, The overflow detector
hardcodes 8192 bytes for "guard page + one page slack" which breaks on systems
with different page sizes; replace that constant with the runtime page size
(e.g. call sysconf(_SC_PAGESIZE) or getpagesize()) and compute guard_bot as
guard_top - (page_size * 2) (or whatever count of pages the allocator actually
uses for guard+slack). Update the SIGSEGV handler block that reads
cur->stack_mem and computes guard_top/guard_bot to use the computed page_size so
faults in the real guard page range are detected correctly.

107-118: ⚠️ Potential issue | 🟠 Major

Make g_watchdog_stop race-free (atomic or fully mutex-guarded)

watchdog_thread_fn() reads g_watchdog_stop outside g_watchdog_lock (while (!g_watchdog_stop) at ~151 and if (g_watchdog_stop) break at ~165), while sw_shutdown() writes it under the lock (~1678-1681). With a non-atomic volatile int, this is still a data race/UB; use atomic_int/C11 atomics or ensure every access is protected by g_watchdog_lock (including the outer stop checks).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_native.c` around lines 107 - 118, The watcher-stop variable
g_watchdog_stop is accessed without synchronization in watchdog_thread_fn (reads
in the while/if) while sw_shutdown writes it under g_watchdog_lock, causing a
data race; fix by making g_watchdog_stop an atomic_int (C11) and replace its
reads/writes in watchdog_thread_fn and sw_shutdown with atomic loads/stores
using appropriate memory orders (e.g., atomic_load_explicit(...,
memory_order_acquire) and atomic_store_explicit(..., memory_order_release)), or
alternatively ensure every access (including the outer while(!g_watchdog_stop)
check and the in-loop if) is performed while holding g_watchdog_lock and use
pthread_cond_signal in sw_shutdown — reference g_watchdog_stop,
watchdog_thread_fn, sw_shutdown, and g_watchdog_lock when making the change.

1255-1265: ⚠️ Potential issue | 🟡 Minor

Fix data race on spin_iters in scheduler_loop()
scheduler_loop() contains static int spin_iters = -1;, and it’s executed by every scheduler thread (pthread_create(..., scheduler_main, ...) -> scheduler_loop). The first-time spin_iters initialization (if (spin_iters < 0) { ... }) is unsynchronized, so concurrent threads can race on that write (UB / TSAN noise). Guard the initialization (e.g., pthread_once) or move it to a single pre-thread init step (or use an atomic once pattern).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/swarmrt_native.c` around lines 1255 - 1265, scheduler_loop currently uses
a function-local static spin_iters initialized lazily (static int spin_iters =
-1) which races when multiple scheduler threads run scheduler_loop via
scheduler_main; fix by performing one-time initialization of spin_iters before
threads start or by guarding the lazy init with a pthread_once (or C11 atomic
once) helper so only one thread computes and stores the value derived from
getenv("SW_SPIN_US") and sets spin_iters; update references to spin_iters in
scheduler_loop to rely on the safely-initialized value and remove the
unsynchronized if (spin_iters < 0) block or replace it with a
pthread_once-invoked initializer function.
♻️ Duplicate comments (1)
docs/notes/KNOWN_ISSUES.md (1)

58-64: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unresolved framing inconsistency from prior review.

This entry still contains crash language ("raw SIGSEGV," Line 62) that conflicts with the "Open" section's statement that issues are "genuine limitations, not crashes" (Line 8-9). A previous review flagged this inconsistency and it remains unaddressed.

📝 Suggested alignment

Either:

  1. Move this entry to a separate "Crash / Data-loss" subsection within "Open"
  2. Reword to emphasize the limitation (no mutual TCO) and note the crash as an implementation detail: "...consume a C stack frame per hop and overflow the 128KB fiber stack at depth ~10⁴–10⁵. The overflow currently manifests as undefined behavior (see REVIEW_FABLE_2026-06.md, O2, regarding sigaltstack handling)."

This preserves the technical detail while aligning severity signals with the section's framing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/notes/KNOWN_ISSUES.md` around lines 58 - 64, The "Mutual tail recursion
is not TCO'd; deep chains overflow silently" entry contains crash wording ("raw
SIGSEGV") that conflicts with the "Open" section's framing; either move this
entry into a new "Crash / Data-loss" subsection under "Open" or reword the
paragraph to present it as a limitation and implementation detail (e.g., state
that mutual tail-calls consume a C stack frame per hop and can overflow the
128KB fiber stack at depth ~10⁴–10⁵, and note that the overflow currently
manifests as undefined behavior documented in REVIEW_FABLE_2026-06.md with
sigaltstack details) so the severity language matches the rest of the document.
🧹 Nitpick comments (1)
Makefile (1)

510-539: ⚡ Quick win

Consider adding PROBE_OK validation for robustness.

The tsan-gate target checks for TSan warnings and captures exit codes, but unlike lsan-gate (Line 415), it doesn't verify that the test programs actually completed their work by checking for a PROBE_OK marker in the output. A program could crash early or produce incomplete results without emitting TSan warnings, and the current grep-only check would miss this.

🛡️ Suggested enhancement

Add PROBE_OK checks similar to lsan-gate:

  `@rc`=0; \
   out1=$$(SW_QUIET=1 TSAN_OPTIONS="suppressions=tests/stress/tsan.supp" \
           timeout 300 $(BIN_DIR)/tsan_msg 4000 2>&1) || rc=1; \
+  echo "$$out1" | grep -q "PROBE_OK" || { echo "tsan-gate: msg test incomplete"; rc=1; }; \
   echo "$$out1" | grep -q "WARNING: ThreadSanitizer" && { echo "$$out1" | head -40; rc=1; }; \
   out2=$$(SW_QUIET=1 SW_SPIN_US=30 TSAN_OPTIONS="suppressions=tests/stress/tsan.supp" \
           timeout 300 $(BIN_DIR)/tsan_msg 4000 2>&1) || rc=1; \
+  echo "$$out2" | grep -q "PROBE_OK" || { echo "tsan-gate: msg+spin test incomplete"; rc=1; }; \
   echo "$$out2" | grep -q "WARNING: ThreadSanitizer" && { echo "$$out2" | head -40; rc=1; }; \
   out3=$$(SW_QUIET=1 TSAN_OPTIONS="suppressions=tests/stress/tsan.supp" \
           timeout 300 $(BIN_DIR)/tsan_storm 2>&1) || rc=1; \
+  echo "$$out3" | grep -q "PROBE_OK" || { echo "tsan-gate: storm test incomplete"; rc=1; }; \
   echo "$$out3" | grep -q "WARNING: ThreadSanitizer" && { echo "$$out3" | head -40; rc=1; }; \

This assumes the test programs print PROBE_OK on successful completion, matching the pattern used in other sanitizer gates.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 510 - 539, The tsan-gate target currently only greps
for "WARNING: ThreadSanitizer" and exit codes but does not verify successful
completion; update the tsan-gate recipe (target tsan-gate, the invocations
producing $(BIN_DIR)/tsan_msg and $(BIN_DIR)/tsan_storm and the out1/out2/out3
checks) to also assert each program emitted the PROBE_OK marker in their
captured output (out1/out2/out3) similar to lsan-gate: after running each test
capture its stdout/stderr into outN, check for PROBE_OK (e.g., grep -q
"PROBE_OK") and set rc=1 if missing, in addition to the existing ThreadSanitizer
checks, so failures that produce no TSan warnings but didn't complete are
detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/swarmrt_native.c`:
- Around line 924-1004: The sched-trace thread is detached so it can outlive and
dereference freed runtime state; fix by making it joinable and joined at
shutdown: in sched_trace_maybe_start(), replace the pthread_detach() call with
storing the pthread_t returned by pthread_create into a global/thread-static
(e.g. add pthread_t g_sched_trace_thread and a bool g_sched_trace_started) and
set that flag when creation succeeds; in sw_shutdown() (or the runtime teardown
function) check g_sched_trace_started and call
pthread_join(g_sched_trace_thread, NULL) before freeing g_swarm; keep
sched_trace_fn using g_swarm->running as its exit condition so the join will
complete cleanly.
- Around line 1453-1474: The overflow detector hardcodes 8192 bytes for "guard
page + one page slack" which breaks on systems with different page sizes;
replace that constant with the runtime page size (e.g. call
sysconf(_SC_PAGESIZE) or getpagesize()) and compute guard_bot as guard_top -
(page_size * 2) (or whatever count of pages the allocator actually uses for
guard+slack). Update the SIGSEGV handler block that reads cur->stack_mem and
computes guard_top/guard_bot to use the computed page_size so faults in the real
guard page range are detected correctly.
- Around line 107-118: The watcher-stop variable g_watchdog_stop is accessed
without synchronization in watchdog_thread_fn (reads in the while/if) while
sw_shutdown writes it under g_watchdog_lock, causing a data race; fix by making
g_watchdog_stop an atomic_int (C11) and replace its reads/writes in
watchdog_thread_fn and sw_shutdown with atomic loads/stores using appropriate
memory orders (e.g., atomic_load_explicit(..., memory_order_acquire) and
atomic_store_explicit(..., memory_order_release)), or alternatively ensure every
access (including the outer while(!g_watchdog_stop) check and the in-loop if) is
performed while holding g_watchdog_lock and use pthread_cond_signal in
sw_shutdown — reference g_watchdog_stop, watchdog_thread_fn, sw_shutdown, and
g_watchdog_lock when making the change.
- Around line 1255-1265: scheduler_loop currently uses a function-local static
spin_iters initialized lazily (static int spin_iters = -1) which races when
multiple scheduler threads run scheduler_loop via scheduler_main; fix by
performing one-time initialization of spin_iters before threads start or by
guarding the lazy init with a pthread_once (or C11 atomic once) helper so only
one thread computes and stores the value derived from getenv("SW_SPIN_US") and
sets spin_iters; update references to spin_iters in scheduler_loop to rely on
the safely-initialized value and remove the unsynchronized if (spin_iters < 0)
block or replace it with a pthread_once-invoked initializer function.

---

Duplicate comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 58-64: The "Mutual tail recursion is not TCO'd; deep chains
overflow silently" entry contains crash wording ("raw SIGSEGV") that conflicts
with the "Open" section's framing; either move this entry into a new "Crash /
Data-loss" subsection under "Open" or reword the paragraph to present it as a
limitation and implementation detail (e.g., state that mutual tail-calls consume
a C stack frame per hop and can overflow the 128KB fiber stack at depth
~10⁴–10⁵, and note that the overflow currently manifests as undefined behavior
documented in REVIEW_FABLE_2026-06.md with sigaltstack details) so the severity
language matches the rest of the document.

---

Nitpick comments:
In `@Makefile`:
- Around line 510-539: The tsan-gate target currently only greps for "WARNING:
ThreadSanitizer" and exit codes but does not verify successful completion;
update the tsan-gate recipe (target tsan-gate, the invocations producing
$(BIN_DIR)/tsan_msg and $(BIN_DIR)/tsan_storm and the out1/out2/out3 checks) to
also assert each program emitted the PROBE_OK marker in their captured output
(out1/out2/out3) similar to lsan-gate: after running each test capture its
stdout/stderr into outN, check for PROBE_OK (e.g., grep -q "PROBE_OK") and set
rc=1 if missing, in addition to the existing ThreadSanitizer checks, so failures
that produce no TSan warnings but didn't complete are detected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3c322608-9466-49f7-8581-23c226c66a0f

📥 Commits

Reviewing files that changed from the base of the PR and between 78931d0 and 4d2a692.

📒 Files selected for processing (7)
  • Makefile
  • docs/PRODUCTION_ROADMAP.md
  • docs/notes/KNOWN_ISSUES.md
  • src/swarmrt_native.c
  • src/swarmrt_native.h
  • tests/stress/run_stress.sh
  • tests/stress/tsan.supp
✅ Files skipped from review due to trivial changes (2)
  • tests/stress/tsan.supp
  • docs/PRODUCTION_ROADMAP.md

claude added 6 commits June 10, 2026 13:41
…n + nesting/node guards (Phase 3 fuzz)

make fuzz drove a single mutated input to ~16GB and OOM-killed swc. Three
parser DoS holes, all reachable from a malformed source file:

1. The OUTER receive-clause loop did not gate on !p->err (par_case's
   clause loop already did). A stray token in pattern position made a
   clause fail to parse; the cursor then stopped advancing
   (par_primary/par_expect short-circuit on err) while sitting on a token
   that is not RBRACE/AFTER/EOF, so the loop spun forever appending empty
   clauses — unbounded mknode + realloc. Repro:
   tests/fuzz/corpus/parse/receive_stray_token.sw (a stray 'D' before a
   receive pattern). Added .

2. Unbounded expression-nesting recursion: ((((... blew the C stack
   (ASAN stack-overflow in par_primary). par_primary is split into a
   guarded wrapper + par_primary_inner; SW_PARSE_MAX_DEPTH (128) plus the
   real stack-headroom probe (interp_stack_near_limit, parser frames are
   huge — tok_t carries an 8KB text buffer copied per frame) bound it.
   f-string interpolations inherit the parent's depth (a fresh sub-parser
   was depth-0 and recursed unbounded through nested f-strings).

3. Per-parse AST-node budget (SW_PARSE_MAX_NODES, 4M ≈ a 100k+-line
   program): every mknode + node_dup is counted, par_init resets, and
   par_primary aborts with a clean 'program too large' BEFORE its err
   short-circuit — the O(1) backstop for any future non-advancing or
   unbounded loop, regardless of where it lives.

Plus the fuzz harness leaked the AST each iteration (detect_leaks=0 hid
it): sw_lang_free_ast() is now public (node_free was file-static) and the
harness reclaims per parse.

Verified: make fuzz (parse+marshal+http, 20k mutations each) clean; the
catch-driver runs 200k mutations of the corpus with a 3s/parse alarm and
reports no slow parse; full battery green (suite+conformance, stress,
gc-slope, gc-stress, lsan, tsan, phases 2-10, scheduler matrix S=1/8);
all 28 doc programs still compile.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…at-death; TSan gate covers phase 2/5

TSan-driven shared-state hardening. All changes verified perf-neutral and
TSan-clean (make tsan-gate now covers msg / msg+spin / 80k storm / phase 2
/ phase 5; full battery green).

Runtime:
- sw_process.state is now _Atomic sw_proc_state_t. The 30 hot writes are
  explicit atomic_store relaxed (free codegen — plain mov on x86/arm64);
  reads stay implicit (free on x86, seq_cst loads are plain mov). Same
  size/alignment, so the asm-pinned ctx offset is unchanged (the
  _Static_asserts pass). Closes the scheduler-swap-in write vs
  sw_monitor / sw_send_after cross-thread reads that TSan flagged.
  Measured perf unchanged: spawn 4.2 us/cycle, ping-pong 2.6 us/rt.
- Timer list: fire_timers' unlocked head-peek (`if (!tl->head) return`)
  now reads under tl->lock — it raced the lock-protected insert in
  sw_send_after. Uncontended in the common no-timer case, so free in
  practice.
- Monitor-at-death: sw_monitor's already-dead fast path reads
  exit_reason/panic_msg AND registers the monitor all under link_lock,
  and process_exit finalizes exit_reason under the same lock. Closes the
  data race (monitoring a process at its exact death instant) and makes
  DOWN delivery exactly-once (register-before-exit-locks => delivered by
  process_exit; already-EXITING-when-we-lock => we deliver, don't
  register).

Test harness (fix, don't suppress):
- static volatile int flags in test_phase2/4/5 -> _Atomic int (volatile
  is not synchronization — the same class of bug as the runtime fix);
  __sync_fetch_and_add -> C11 atomic_fetch_add.

tsan.supp: documents the two by-design warn-only readers (watchdog
scanner, sw_stats debug printer).

Remaining 2.4 worklist (documented in PRODUCTION_ROADMAP.md): the
registry by-name read path (sw_register) and the dynamic-supervisor
child list (dynsup_entry / sup_start_child) still race — they surfaced
once the test-timing shift perturbed the interleaving. phase 4 joins the
tsan-gate once those close.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…closed; runtime TSan-clean (phase 2/4/5)

The supervisor crash-restart races that surfaced once the test-timing
shift perturbed the interleaving are fixed; the runtime is now TSan-clean
on GenServer/Supervisor/Agent/DynSup/StateMachine/ProcessGroup plus
message-passing and the spawn storm. make tsan-gate now gates phase
2/4/5 alongside msg/msg+spin/storm.

Fixes (all verified perf-neutral — ping-pong 2.8us, spawn 5.6us):
- exit_reason -> _Atomic: user/runtime code writes sw_self()->exit_reason
  = N directly (the documented "set reason then return" idiom, exercised
  by b3_crasher) while sw_monitor and process_exit read it.
- reg_entry -> _Atomic: a child's process_exit reads it while sw_register
  (from sup_start_child on another scheduler) writes it under the
  registry rwlock.
- kill_flag -> _Atomic (was volatile): scheduler-loop read vs the
  cross-thread exit-signal write. (x86 seq_cst loads are plain mov, and
  every write is a cold path, so free.)
- registry_remove_proc hashes entry->name UNDER the registry wrlock — it
  was computed before the lock, racing a concurrent reuse that rewrote
  the entry's name storage.
- USE-AFTER-FREE closed: process_destroy frees panic_msg under link_lock
  and NULLs it there; sw_monitor's already-dead path strdup-copies it
  under the same lock before delivering. A late monitor on a process
  mid-teardown previously read a panic_msg the teardown was freeing.

All _Atomic fields keep their size/alignment, so the asm-pinned ctx
offset is unchanged (the _Static_asserts pass).

Phase 2 status: 2.1 (LSan) / 2.2 (sched matrix) / 2.3 (TSan) / 2.4
(atomics) all DONE. Remaining: 2.5 alloc-failure injection, 2.6 soak.

Verified: full battery green — test-sw + conformance, stress both
variants, gc-slope, gc-stress, lsan-gate, tsan-gate (incl. phase 2/4/5),
phases 2-10 functional, all 28 doc programs compile.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…ault); 120 fail-points, 0 memory errors

Closes the Phase 2.5 exit criterion: an injected malloc/region failure on
any path either returns an error or terminates only the affected process,
and a partial ownership transfer (spawn region adopted but child died)
never leaks or double-frees.

- sw_alloc_fault_tick() (swarmrt_native.c, #ifdef SW_ALLOC_FAULT only):
  atomic countdown armed by SW_FAIL_ALLOC_AT=N, fires once. Wired into
  val_alloc's global-heap fallback and every varena chunk_new (the value
  + region choke points where the GC ownership-transfer logic lives).
  Zero cost in shipping builds — entirely behind the ifdef.
- tests/gc/alloc_fault.sw: spawn fan-out + message round-trips +
  supervised child + ETS churn — the ownership-transfer sites.
- make alloc-fault (Linux/ASAN): sweeps N=1..120, each run must end clean
  or with a loud OOM/spawn panic; FAILS only on an ASAN report
  (use-after-free / double-free / heap overflow).

Result: 120 fail-points, 100 clean degradations (region alloc fails ->
spawn falls back to a global-heap arg copy or errors cleanly), 20
loud-OOM aborts on the value path, 0 memory errors. The documented
spawn-region-adopted-but-child-died hazard does not corrupt memory under
injected failure.

CI: tsan-gate (was local-only) + alloc-fault added as blocking legs to
linux-quickstart.yml alongside the advisory lsan-gate.

Phase 2 now 2.1-2.5 done; only 2.6 (soak) remains.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…e 2 COMPLETE

The last Phase 2 item. tests/soak/soak.sw runs a production-shaped mix —
actor fan-out with ~8KB message round-trips + supervisor crash/restart +
ETS put/replace/delete + one-shot & interval timers + a long tail loop —
in a tight loop for SOAK_SECONDS, and tests/soak/run_soak.sh samples RSS
and asserts clean exit (PROBE_OK, no watchdog/crash on stderr) + peak RSS
under SOAK_RSS_BUDGET_MB. Proves the WHOLE mix stays flat under sustained
churn, not just the per-owner slope probes.

Local results (this host):
  20s  -> 2364 rounds,  peak RSS 53 MB
  180s -> 19978 rounds, peak RSS 58 MB
Arena base is ~48 MB, so ~5-10 MB working set over the entire mix, flat
across a 9x duration / 8x round increase — bounded.

make soak runs the 60s CI smoke (advisory/blocking leg added to
linux-quickstart.yml). The full 24h soak is the SAME binary on a
dedicated host:
  SOAK_SECONDS=86400 SOAK_RSS_BUDGET_MB=512 ./tests/soak/run_soak.sh
That sustained run is the one sign-off that needs infra rather than code;
the harness is ready and documented in PRODUCTION_ROADMAP.md 2.6.

Phase 2 (runtime hardening) is now complete: 2.1 LSan, 2.2 scheduler
matrix, 2.3 TSan, 2.4 atomics/race-free, 2.5 alloc-failure, 2.6 soak.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…a map-free leak

make fuzz-json (the agent-facing input boundary: every LLM tool-call
response / parsed HTTP body flows through sw_lang_json_decode) found two
real heap-buffer-overflows on its first run, both OOB reads on malformed
JSON:

1. _jd_parse number path: when a value was expected but the char was
   unparseable, it force-advanced (*pp)++ to guarantee progress WITHOUT
   checking for the NUL terminator — at end-of-input (e.g. `{"a":` then
   EOF) it stepped past `\0`, and the next _jd_skip_ws read past the
   buffer. Now it never steps past NUL; the enclosing array/object loops
   gate on **pp and terminate.
2. _jd_parse_string escape path: a string whose last byte is a backslash
   (`\` then NUL) consumed the backslash, then the switch's default case
   copied the NUL and (*pp)++ stepped past the terminator. Now a
   trailing backslash at EOF stops the loop.

Both are OOB READS on untrusted input — exactly the class Phase-3 fuzzing
exists to catch. The recursion-depth guards (SW_JD_MAX_DEPTH etc.) were
already correct; these were pointer-advance-past-NUL bugs.

Also fixed a latent leak surfaced while making the fuzz harness free its
output: sw_val_free's MAP case freed the key/val arrays but not the
element nodes (the tuple/list case already recursed). Now it recurses —
safe (all callers free fully-owned trees; none alias map sub-values) and
it plugs a real runtime leak anywhere maps are sw_val_free'd.

- tests/fuzz/fuzz_json.c + corpus (incl. a 5000-deep array): 20k
  mutations, clean. Wired into make fuzz and the CI fuzz leg (parser /
  JSON / distribution-unmarshal / HTTP, all under ASAN+UBSAN).

Verified: all 4 fuzz targets clean, test-sw + conformance, gc-stress,
lsan-gate, 28 doc programs — green.

https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
@skyblanket skyblanket merged commit ec1ded6 into main Jun 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants