Fix case-arm shadowing UB, tail-loop preemption, and OOM diagnostics#1
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughCoordinated edits align interpreter and compiled error/panic semantics, add tuple-destructuring and with/else arms, introduce UTF‑8 codepoint string_chars, register real source paths for codegen, improve scheduler/crash handling and spawn/tail behavior, expand Std, and add a dual-path conformance gate with many tests and CI wiring. ChangesFull-stack fixes and conformance gate
Sequence Diagram (compiled try/catch routing)sequenceDiagram
participant GeneratedCode
participant sw_self_try_chain
participant _builtin_error
participant CatchArm
GeneratedCode->>sw_self_try_chain: push _sw_try_frame_t
GeneratedCode->>GeneratedCode: execute try body
_builtin_error->>sw_self_try_chain: longjmp to active try frame
sw_self_try_chain->>CatchArm: route into catch path
CatchArm->>sw_self_try_chain: pop frame and continue
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/sw/test_case_shadow_scrutinee.sw (1)
1-95: 💤 Low valueTest coverage is good; consider adding a map pattern case.
The test thoroughly covers ident patterns with guards, tuple sub-binds, and cons head extractions. The PR objectives mention map sub-binds are also covered by the fix—adding a small test like this would complete the regression coverage:
# ---- map value that re-uses the subject name ---- fun value_of(m) { case m { %{m: m} -> m _ -> 'no_key' } } fun test_shadow_map_value() { assert_eq("shadow_map_value", value_of(%{m: 99}), 99) }Not blocking since the fix mechanism is shared and already verified by tuple/cons tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/sw/test_case_shadow_scrutinee.sw` around lines 1 - 95, Add a map-pattern regression test to cover map sub-binds: implement the function value_of(m) that matches case m { %{m: m} -> m _ -> 'no_key' } and a test_shadow_map_value() that calls assert_eq("shadow_map_value", value_of(%{m: 99}), 99); then call test_shadow_map_value() from main and add its return to the failures accumulator so the new test runs during the existing test suite (refer to the existing functions value_of, test_shadow_map_value, and main to locate where to insert).docs/notes/REVIEW_FABLE_2026-06.md (1)
132-132: ⚡ Quick winClarify the cross-reference.
The notation "see L6" is ambiguous without specifying which file. Based on context, this likely refers to a line in
docs/SW_LANGUAGE.md, but readers shouldn't have to guess.📝 Suggested clarification
-| `string_length` = byte length | ✓ documented honestly (see L6) | +| `string_length` = byte length | ✓ documented honestly (see SW_LANGUAGE.md line 6) |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/notes/REVIEW_FABLE_2026-06.md` at line 132, Replace the ambiguous "see L6" reference with an explicit cross-reference to the intended file and location (for example "see docs/SW_LANGUAGE.md, line 6" or better, a section heading or anchor such as "see docs/SW_LANGUAGE.md — Section: <heading> (L6)") so readers know which file/section is meant; update the string "see L6" in REVIEW_FABLE_2026-06.md to include the filename and, if possible, a section title or link to the exact location.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 39-45: Update the "### Mutual tail recursion is not TCO'd; deep
chains overflow silently" entry to match the section’s non-crash framing by
removing or reclassifying the raw crash language: either reword the sentence
that describes a "raw SIGSEGV" to a non-crash severity (e.g., "can silently
overflow the fiber stack leading to undefined behavior") and note the
handler/sigaltstack detail as an implementation note, or split this into a
separate "Crash / Data-loss" subsection that explicitly references
REVIEW_FABLE_2026-06.md and the need for sigaltstack; edit the header and the
paragraph body accordingly so severity language is consistent with the rest of
the section.
In `@docs/SW_LANGUAGE.md`:
- Line 106: Update the sentence that currently reads "Recursion is the loop
construct (no `for`/`while`)" to accurately reflect supported syntax by removing
`for` from the negation—e.g. change it to "Recursion is the loop construct (no
`while`)"—and keep the rest of the paragraph about self-tail-call optimisation
and mutual tail-recursion unchanged; target the sentence containing "Recursion
is the loop construct (no `for`/`while`)" so the docs no longer incorrectly
state that `for` is unsupported.
---
Nitpick comments:
In `@docs/notes/REVIEW_FABLE_2026-06.md`:
- Line 132: Replace the ambiguous "see L6" reference with an explicit
cross-reference to the intended file and location (for example "see
docs/SW_LANGUAGE.md, line 6" or better, a section heading or anchor such as "see
docs/SW_LANGUAGE.md — Section: <heading> (L6)") so readers know which
file/section is meant; update the string "see L6" in REVIEW_FABLE_2026-06.md to
include the filename and, if possible, a section title or link to the exact
location.
In `@tests/sw/test_case_shadow_scrutinee.sw`:
- Around line 1-95: Add a map-pattern regression test to cover map sub-binds:
implement the function value_of(m) that matches case m { %{m: m} -> m _ ->
'no_key' } and a test_shadow_map_value() that calls
assert_eq("shadow_map_value", value_of(%{m: 99}), 99); then call
test_shadow_map_value() from main and add its return to the failures accumulator
so the new test runs during the existing test suite (refer to the existing
functions value_of, test_shadow_map_value, and main to locate where to insert).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 707de52f-77b4-4a3c-9457-db12f004a6d4
📒 Files selected for processing (6)
docs/SW_LANGUAGE.mddocs/notes/KNOWN_ISSUES.mddocs/notes/REVIEW_FABLE_2026-06.mdsrc/swarmrt_codegen.csrc/swarmrt_lang.ctests/sw/test_case_shadow_scrutinee.sw
| - Parameters are positional; default values: `fun greet(name = "world") { ... }`. | ||
| - No explicit `return` — Erlang-style trailing-expression-is-the-value. | ||
| - Recursion is the loop construct (no `for`/`while`). Tail calls are detected and optimised by the codegen, so unbounded tail recursion doesn't blow the stack. | ||
| - Recursion is the loop construct (no `for`/`while`). **Self**-tail-calls are detected and optimised by the codegen, so unbounded `f -> f` tail recursion doesn't blow the stack. **Mutual** tail recursion (`a -> b -> a`, e.g. two state functions calling each other) is NOT yet optimised — each hop costs a C stack frame and deep chains overflow the 128KB process stack. Keep the recursive call in the same function (dispatch on an argument instead of bouncing between functions) for unbounded loops. |
There was a problem hiding this comment.
Fix loop wording to avoid contradicting supported syntax.
Line 106 says there is no for/while, but for is supported and documented in this reference. Please narrow this to “no while” (or equivalent) so users don’t get incorrect guidance.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/SW_LANGUAGE.md` at line 106, Update the sentence that currently reads
"Recursion is the loop construct (no `for`/`while`)" to accurately reflect
supported syntax by removing `for` from the negation—e.g. change it to
"Recursion is the loop construct (no `while`)"—and keep the rest of the
paragraph about self-tail-call optimisation and mutual tail-recursion unchanged;
target the sentence containing "Recursion is the loop construct (no
`for`/`while`)" so the docs no longer incorrectly state that `for` is
unsupported.
…self-init UB
case n { n when n > 0 -> ... } bound the arm-local straight off the
subject identifier, generating 'sw_val_t *n = n;' — the new n shadows at
its own declarator and reads indeterminate memory. Guards silently fell
through to _ on a calm stack and SIGSEGV'd inside _op_cmp on a dirty
one; the interpreter was always correct (compiled/interp divergence).
Tuple/cons/map sub-binds ('n->v.tuple.items[0]') carried the same
hazard.
Fix: snapshot the case subject into a fresh compiler temp before any
arm binds (one pointer copy per case). 'with' desugars to case, so it
is covered by the same change.
Test: tests/sw/test_case_shadow_scrutinee.sw — ident+guard, tuple item,
and cons head all reusing the subject name; 7 assertions, passes
compiled and interpreted.
https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…ed by VMA exhaustion The runtime has reduction accounting (sw_check_reds) and a 1ms SIGALRM force-yield, but generated code never called either — a compiled self-tail-call loop was unpreemptible. On one scheduler the 80K spawn storm's parent therefore never let its children run: all 80K 128KB fiber stacks stayed live at once, each its own mmap+guard (~2 VMAs), so ~160K VMAs were needed against Linux's default vm.max_map_count of 65530. Once maps ran out, mmap and then glibc calloc returned NULL and the storm died as 'SIGSEGV at (nil)'. This is why 'make stress' single-sched read 0/50 on a default-tuned Linux box while CI stayed green: GitHub's Azure runners and Ubuntu 24.04 (the May-29 clearance host) raise vm.max_map_count, so the gate's verdict silently depended on a host sysctl. Fix: emit 'if (sw_check_reds()) sw_yield();' at the self-tail-call backedge — the language's only unbounded loop. One TLS decrement per turn, a yield every SWARM_CONTEXT_REDS (2000) turns; both calls are no-ops on the interpreter path. This also makes the documented 'reduction-counted preemption' true for compiled CPU-bound loops. Result on the failing host: single-sched stress 0/50 -> 50/50; full 'make stress' passes both variants; all 54 sw tests, phases 2-10, gc-stress, and gc-slope stay green. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…EGV at nil val_alloc/val_strdup fell back to calloc/strdup with no NULL check, and every sw_val_* constructor writes through the result unchecked — so heap/VMA exhaustion surfaced as a bare segfault. Now an OOM prints a panic naming the allocation size and the most likely cause (VMA exhaustion under spawn storms: each live fiber stack is an mmap; check vm.max_map_count) before abort. Still fatal — but diagnosable, per the loud-failure philosophy. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…+ TCO doc fix - docs/notes/REVIEW_FABLE_2026-06.md: full audit — verified-claims table, three fixed findings (case-shadow UB, tail-loop preemption, loud OOM), eight open findings with repros, language-design assessment, and an ordered upgrade path. - KNOWN_ISSUES.md: two new open entries with repros and workarounds: (1) try/catch catches builtin panics in the interpreter but not in compiled code; (2) mutual tail recursion is not TCO'd and overflows the fiber stack silently. - SW_LANGUAGE.md: the TCO claim now says self-tail-calls (mutual recursion is not optimised), with the dispatch-on-argument idiom. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
One review pass, eight fixes — each was a place where the same program
meant different things under swc run vs swc build, or where a failure
was silent. All gated by the new dual-path conformance runner (next
commit) plus the existing suites.
Error model (the big one):
- error() now unwinds the full dynamic extent to the nearest try/catch
on BOTH paths. Compiled: a per-process setjmp/longjmp chain
(sw_self_try_chain; frame is a C local on the fiber stack, so a
mid-try yield + resume on another scheduler thread is safe). The old
codegen ran the whole try body and checked the sentinel once at the
end — the statement AFTER an error() still executed, and could panic
the process before the catch was consulted.
- error() outside any try is the documented silent continue-with-nil on
both paths (the interpreter previously unwound the entire program).
- Panics are uncatchable on both paths (interpreter try/catch used to
absorb them): interp->panicking bypasses N_TRY, surfaces at top level
as exit 1. assert_raises remains the sanctioned interceptor; a failed
expect() is now a real panic in the interpreter too (was a hard
exit(1) that bypassed assert_raises).
- expect(): compiled treated only nil as falsy, so expect(a == b, msg)
passed straight through on a 'false' — interpreter parity now (nil or
'false' panic).
- The REPL clears the panicking flag per line so one panic doesn't
poison the session.
Interpreter correctness:
- N_CALL aborts when an ARGUMENT eval raised — a panic inside an
f-string interpolation was laundered into the string "nil" by the
desugared to_string() and execution sailed on.
- N_PIPE now uses the same dispatch chain as N_CALL (module functions,
primary builtins, the extra-builtin bridge, lambda-valued variables).
The old whitelist covered 6 names; xs |> filter(f) was a silent NO-OP
under swc run.
- f-string interpolations parse with the f-string's real line — every
node in an interpolated expression carried line 1, so diagnostics
pointed at the top of the file.
Builtin parity:
- reduce() accepts (fn, list, acc) or the piped (list, fn, acc) in both
the codegen prelude and the interpreter — map/filter had the
type-directed tolerance, reduce didn't, so xs |> reduce(f, 0)
returned 0 compiled and the unfiltered list interpreted (both wrong).
Loud failures:
- spawn failure panics ("process table full — raise SW_MAX_PROCS")
instead of continuing with a pid wrapping NULL whose sends silently
vanished.
- Stack overflow on a fiber is now diagnosed: SA_ONSTACK + a per-thread
sigaltstack (the handler used to fault on the overflowed fiber stack
and the process died with NO output), and a guard-page hit prints
"stack overflow in process #N — mutual tail calls are not TCO'd"
instead of a bare SIGSEGV.
Language ergonomics (parse-time desugar, identical on both paths):
- Tuple-destructuring assignment: {a, b} = pair; {'ok', v} = fetch()
(idents bind, _ skips, literals assert-match with a panic).
- with/else accepts MULTIPLE else arms with optional when guards, like
case (Elixir shape; the second arm was a parse error).
- receive's after accepts the Erlang-reflex arrow form after MS -> body
alongside after MS { body }.
https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…cross swc run + swc build The language's core promise is that a program means ONE thing, but it has two execution paths — and every semantic bug this review found (case-guard UB, try/catch-vs-panic, pipe no-op, reduce arg order, expect falsy) was exactly a divergence between them. tests/sw/run_conform.sh runs every tests/sw/conform/*.sw through BOTH paths and fails on any difference in stdout or exit code (stderr is deliberately not compared — panic banners may differ in trace detail). Wired into make test-sw, so CI enforces it on every push. The corpus (t01–t08) covers: the error model incl. cross-function unwind and uncatchable panics, case patterns + guards, strings + f-strings, collections + map/filter/reduce, arithmetic + closures + pipes, with/else (multi-arm, guards), JSON round-trips, destructuring assignment, and expect semantics. On its first run the gate caught four real divergences; all are fixed in the previous commit and held closed here. Also: tests/sw/test_tco_depth.sw (compiled-path 2M-deep self-tail-call flatness + backedge-yield safety) and tests/sw/test_after_arrow.sw (after MS -> body fires; needs the real scheduler so it lives in the compiled suite). https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…atch divergence, add interpreter depth SW_LANGUAGE.md: error() documented as unwinding the full dynamic extent to the nearest catch (identical on both paths); expect() panics on nil OR 'false'; tuple-destructuring assignment section (binds, _ skips, literal match-asserts, statement position only); with/else multi-arm shape; after MS -> arrow form. KNOWN_ISSUES.md: the try/catch-vs-panic divergence moves to Recently cleared (with the new semantics spelled out); new open entry for the interpreter's bounded recursion depth (C-stack tree-walker, no TCO — compile for depth); the stale 'swc run compiles via cc' description corrected (it interprets). https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…uess work Std.map, Std.filter (self-contained, order-tolerant like the global builtins — NOT delegating to the same-named builtin, which would be a resolution hazard) and Std.join (alias for string_join, the name people guess). No Std.reduce: Std's own sum/product call the global reduce bare, and a same-named module fun would shadow it — the codegen denylist hint stays for Std.reduce/Std.pmap and is removed for map/filter (they're real module functions now). Conformance: tests/sw/conform/t09_std.sw runs the Std surface through both paths, both argument orders. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…Round-7 report continuation status README: test-sw bullet now describes the dual-path conformance gate and the measured counts (56 files / 493 assertions); Std row lists the new map/filter/join aliases. Round-7 report gains a continuation section: O1/O2b/O3 fixed, O6 partial (f-string lines), O7 mostly fixed (destructuring, multi-arm else, after-arrow, Std aliases), and the four additional divergences the conformance gate caught on its first run. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
f11fb08 to
fda588a
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/swarmrt_lang.c (2)
2544-2548:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix the
uint64_tformat mismatch.CI is already warning here on Linux:
pidis auint64_t, but%lluexpectsunsigned long long. UsePRIu64or an explicit cast so quickstart stays warning-free.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_lang.c` around lines 2544 - 2548, The fprintf calls in the SW_VAL_PID and SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t (v->v.pid->pid and v->v.rpid.id); replace the format specifier with PRIu64 from <inttypes.h> (e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ? v->v.pid->pid : 0)); and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ? v->v.rpid.node : "?", (uint64_t)v->v.rpid.id)); or alternately cast to (unsigned long long) if you prefer; ensure inttypes.h is included so PRIu64 is available.Source: Pipeline failures
4563-4616:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBare
val |> funcstill no-ops.This new dispatch only runs when
n->v.pipe.funcisN_CALL. A plainvalue |> to_stringorxs |> fstill falls through toreturn val;, so the pipe remains a no-op for the exactval |> funcform described in the comment.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_lang.c` around lines 4563 - 4616, The pipe handler only dispatches when n->v.pipe.func->type == N_CALL, so plain forms like `val |> fname` or `val |> var` fall through and return val; update the N_PIPE case to also handle non-call function heads: when n->v.pipe.func is a name/identifier (e.g., N_NAME or the AST node type used for bare identifiers), extract the symbol (n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname) and call the module function, test primary builtins (builtin_print, builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname, args, nargs, n->line), and finally check env_get(env, fname) and call interp_apply_fn for lambda-valued variables; ensure interp->error and recursion checks mirror the N_CALL path and return the resulting sw_val_t instead of falling through.
🧹 Nitpick comments (3)
lib/Std.sw (2)
323-333: ⚡ Quick winConsider explicit predicate check for consistency.
The
filterimplementation usesif (pred(hd(lst)))(line 330), which relies on truthy/falsy evaluation. However, all other predicate-taking functions in this file explicitly check== 'true':
take_while(line 78):if (pred(h) == 'true')partition(line 112):if (pred(h) == 'true')any(line 190):if (pred(hd(lst)) == 'true')all(line 196):if (pred(hd(lst)) == 'false')count(line 205):if (pred(h) == 'true')While the current implementation appears to work (conformance tests pass), aligning with the established pattern would improve maintainability and make predicate expectations consistent across the library.
♻️ Suggested change for consistency
fun _filter_acc(lst, pred, acc) { if (length(lst) == 0) { reverse(acc) } else { - if (pred(hd(lst))) { _filter_acc(tl(lst), pred, [hd(lst) | acc]) } + if (pred(hd(lst)) == 'true') { _filter_acc(tl(lst), pred, [hd(lst) | acc]) } else { _filter_acc(tl(lst), pred, acc) } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/Std.sw` around lines 323 - 333, The filter implementation uses a truthy check for predicates; update it to match the library's explicit predicate pattern by checking equality to the string 'true'. Specifically, in function _filter_acc replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) == 'true') so predicates behave consistently with other functions (e.g., take_while, partition, any, all, count) while leaving filter and _filter_acc signatures unchanged.
309-337: ⚡ Quick winConsider moving list operations to the LIST OPS section.
The
mapandfilterimplementations operate on lists but are currently placed in the STRING OPS section (which starts at line 305). Whilejoinis a string operation alias,mapandfilterwould be more discoverable in the LIST OPS section alongside other list helpers.📁 Suggested organization
Move lines 309-334 (the map/filter implementations and their comment block) to appear after line 280 (after
product) and before line 282 (the MAP OPS section header). Leave thejoinalias at lines 335-336 since it's a string operation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/Std.sw` around lines 309 - 337, Move the list-based implementations map, _map_acc, filter, and _filter_acc out of the STRING OPS block into the LIST OPS section so they are colocated with other list helpers (place them after the product function and before the MAP OPS header); keep the join function (string_join alias) in the STRING OPS area. Ensure you relocate the comment block that documents map/filter along with those functions and update surrounding spacing/headers so MAP OPS still follows the LIST OPS section.tests/sw/conform/t05_collections.sw (1)
20-21: ⚡ Quick winDocument the intentional argument order difference.
Lines 20-21 use different argument orders for
map(function-first) vsfilter(list-first), which tests the documented order detection feature for map/filter builtins. Adding a brief comment would prevent future maintainers from misinterpreting this as an inconsistency.📝 Suggested documentation
+ # Intentionally using different argument orders to validate order detection print(f"map={map(fun(x) { x * 2 }, xs)}") print(f"filter={filter(xs, fun(x) { x % 2 == 1 })}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/sw/conform/t05_collections.sw` around lines 20 - 21, Add a brief inline comment above the two test calls explaining the intentional difference in argument order to exercise the documented order-detection feature: note that map(fun, xs) uses function-first while filter(xs, fun) uses list-first here on the map and filter calls (symbols: map, filter, fun, xs) so future maintainers don’t treat it as a bug.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 34-44: Update the KNOWN_ISSUES.md entry to stop saying mutual tail
recursion "overflows … as a raw SIGSEGV without even the crash banner" and
"needs `sigaltstack`" — note that `src/swarmrt_native.c` now installs a
per-thread `sigaltstack` and registers the SIGSEGV handler with `SA_ONSTACK`, so
stack-guard faults are detected and produce the diagnostic panic banner (e.g.
`panic: stack overflow in process `#N` (128KB fiber stack)`); keep the rest of the
explanation (only self-tail calls are TCO'd, impact on mutually recursive FSMs,
and the workaround to consolidate the loop into one function) but change the
wording to reflect that the runtime now prints a stack-overflow diagnostic
instead of silently crashing.
In `@src/swarmrt_builtins_studio.h`:
- Around line 5241-5246: The code currently treats any wake from
sw_receive_any(c->ms, &tag) as a timeout and fires the callback (sw_val_apply)
even when a real message arrived; change the logic so that when sw_receive_any
returns a non-NULL m (i.e. an actual message), you free m and do NOT call
sw_val_apply but instead continue waiting for the real timeout event, only
invoking sw_val_apply when sw_receive_any indicates a timeout (m == NULL) and
the process isn't cancelled (check self = sw_self() and self->kill_flag); apply
the same change to the equivalent block around lines 5271–5278 that uses c->ms,
tag, sw_receive_any and sw_val_apply.
In `@src/swarmrt_codegen.c`:
- Around line 2056-2068: In emit_pipe(), replicate the same qualified-builtin
denylist logic used in the direct-call path: when lowering a piped qualified
call (the branch that transforms xs |> Std.reduce(...) or xs |> Std.pmap(...))
detect that the module is "Std" and the function name is "reduce" or "pmap" and
emit the same diagnostic (the fprintf message that references gf/fname and
ctx->mod_name/n->line) and set ctx->had_unknown_fn so codegen fails early;
basically copy the check that uses gf/mod/strcmp from the direct-call branch
into the pipe-lowering branch so piped Std.reduce/Std.pmap produce the identical
hint and early exit.
In `@src/swarmrt_lang.c`:
- Around line 1773-1784: The current desugaring creates a user-visible temp
named "__destruct_%d" (dt_ctr/tmp) and emits a normal N_ASSIGN into the current
scope (bind->v.assign.name), which can collide with user bindings; instead
generate and use a non-user-visible temporary and mark it so both interpreter
and codegen treat it as internal. Change the tmp generation to produce an
internal-only identifier (e.g., use a reserved/internal prefix or a
hidden-marker byte) or create a dedicated internal bind kind, set bind so its
name is the internal temp and ensure the interpreter/codegen treat that name as
non-exported/hidden (or handle the new internal bind kind) so the temp cannot
collide with user code and is not visible to the user.
In `@src/swarmrt_native.c`:
- Around line 1179-1186: The _sw_install_altstack function currently ignores the
return value of sigaltstack; modify _sw_install_altstack to check sigaltstack's
return, and on failure capture errno and surface the error (e.g., call
fprintf(stderr, ...) or process logger and/or abort/exit or return an error
code) so SA_ONSTACK paths don't silently degrade; update callers or related
installs (the places handling alternate stack usage around _sw_install_altstack
and the SA_ONSTACK handling referenced at the earlier uses) to handle the
failure if you choose to return an error instead of aborting. Ensure you include
errno or strerror(errno) in the reported message to aid debugging and keep
function name _sw_install_altstack as the change location.
- Around line 1206-1208: The guard-window check hardcodes 8192 bytes; change it
to compute the window using the system page size so stack overflow detection
works on non-4KB systems: compute page_size (e.g., via sysconf(_SC_PAGESIZE) or
getpagesize()) and replace the constant with something like page_size * 2 (or
the same intended margin) when calculating guard_bot from guard_top; update the
check around the symbols guard_top, guard_bot, cur->stack_mem, and fault to use
the runtime-derived page_size.
---
Outside diff comments:
In `@src/swarmrt_lang.c`:
- Around line 2544-2548: The fprintf calls in the SW_VAL_PID and
SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t (v->v.pid->pid
and v->v.rpid.id); replace the format specifier with PRIu64 from <inttypes.h>
(e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ? v->v.pid->pid : 0));
and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ? v->v.rpid.node : "?",
(uint64_t)v->v.rpid.id)); or alternately cast to (unsigned long long) if you
prefer; ensure inttypes.h is included so PRIu64 is available.
- Around line 4563-4616: The pipe handler only dispatches when
n->v.pipe.func->type == N_CALL, so plain forms like `val |> fname` or `val |>
var` fall through and return val; update the N_PIPE case to also handle non-call
function heads: when n->v.pipe.func is a name/identifier (e.g., N_NAME or the
AST node type used for bare identifiers), extract the symbol
(n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same
dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname)
and call the module function, test primary builtins (builtin_print,
builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname,
args, nargs, n->line), and finally check env_get(env, fname) and call
interp_apply_fn for lambda-valued variables; ensure interp->error and recursion
checks mirror the N_CALL path and return the resulting sw_val_t instead of
falling through.
---
Nitpick comments:
In `@lib/Std.sw`:
- Around line 323-333: The filter implementation uses a truthy check for
predicates; update it to match the library's explicit predicate pattern by
checking equality to the string 'true'. Specifically, in function _filter_acc
replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) == 'true') so
predicates behave consistently with other functions (e.g., take_while,
partition, any, all, count) while leaving filter and _filter_acc signatures
unchanged.
- Around line 309-337: Move the list-based implementations map, _map_acc,
filter, and _filter_acc out of the STRING OPS block into the LIST OPS section so
they are colocated with other list helpers (place them after the product
function and before the MAP OPS header); keep the join function (string_join
alias) in the STRING OPS area. Ensure you relocate the comment block that
documents map/filter along with those functions and update surrounding
spacing/headers so MAP OPS still follows the LIST OPS section.
In `@tests/sw/conform/t05_collections.sw`:
- Around line 20-21: Add a brief inline comment above the two test calls
explaining the intentional difference in argument order to exercise the
documented order-detection feature: note that map(fun, xs) uses function-first
while filter(xs, fun) uses list-first here on the map and filter calls (symbols:
map, filter, fun, xs) so future maintainers don’t treat it as a bug.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 663b5f13-af94-4f8f-994f-39d1a57b29c7
📒 Files selected for processing (26)
MakefileREADME.mddocs/SW_LANGUAGE.mddocs/notes/KNOWN_ISSUES.mddocs/notes/REVIEW_FABLE_2026-06.mdlib/Std.swsrc/swarmrt_builtins_studio.hsrc/swarmrt_codegen.csrc/swarmrt_lang.csrc/swarmrt_lang.hsrc/swarmrt_native.csrc/swarmrt_native.hsrc/swarmrt_repl.ctests/sw/conform/t01_errors.swtests/sw/conform/t02_panic_uncatchable.swtests/sw/conform/t03_case.swtests/sw/conform/t04_strings.swtests/sw/conform/t05_collections.swtests/sw/conform/t06_arith_lambdas.swtests/sw/conform/t07_with_json.swtests/sw/conform/t08_destructure_with.swtests/sw/conform/t09_std.swtests/sw/run_conform.shtests/sw/test_after_arrow.swtests/sw/test_case_shadow_scrutinee.swtests/sw/test_tco_depth.sw
✅ Files skipped from review due to trivial changes (3)
- tests/sw/conform/t04_strings.sw
- tests/sw/test_after_arrow.sw
- README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/sw/test_case_shadow_scrutinee.sw
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/swarmrt_lang.c (2)
2544-2548:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix the
uint64_tformat mismatch.CI is already warning here on Linux:
pidis auint64_t, but%lluexpectsunsigned long long. UsePRIu64or an explicit cast so quickstart stays warning-free.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_lang.c` around lines 2544 - 2548, The fprintf calls in the SW_VAL_PID and SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t (v->v.pid->pid and v->v.rpid.id); replace the format specifier with PRIu64 from <inttypes.h> (e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ? v->v.pid->pid : 0)); and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ? v->v.rpid.node : "?", (uint64_t)v->v.rpid.id)); or alternately cast to (unsigned long long) if you prefer; ensure inttypes.h is included so PRIu64 is available.Source: Pipeline failures
4563-4616:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBare
val |> funcstill no-ops.This new dispatch only runs when
n->v.pipe.funcisN_CALL. A plainvalue |> to_stringorxs |> fstill falls through toreturn val;, so the pipe remains a no-op for the exactval |> funcform described in the comment.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_lang.c` around lines 4563 - 4616, The pipe handler only dispatches when n->v.pipe.func->type == N_CALL, so plain forms like `val |> fname` or `val |> var` fall through and return val; update the N_PIPE case to also handle non-call function heads: when n->v.pipe.func is a name/identifier (e.g., N_NAME or the AST node type used for bare identifiers), extract the symbol (n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname) and call the module function, test primary builtins (builtin_print, builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname, args, nargs, n->line), and finally check env_get(env, fname) and call interp_apply_fn for lambda-valued variables; ensure interp->error and recursion checks mirror the N_CALL path and return the resulting sw_val_t instead of falling through.
🧹 Nitpick comments (3)
lib/Std.sw (2)
323-333: ⚡ Quick winConsider explicit predicate check for consistency.
The
filterimplementation usesif (pred(hd(lst)))(line 330), which relies on truthy/falsy evaluation. However, all other predicate-taking functions in this file explicitly check== 'true':
take_while(line 78):if (pred(h) == 'true')partition(line 112):if (pred(h) == 'true')any(line 190):if (pred(hd(lst)) == 'true')all(line 196):if (pred(hd(lst)) == 'false')count(line 205):if (pred(h) == 'true')While the current implementation appears to work (conformance tests pass), aligning with the established pattern would improve maintainability and make predicate expectations consistent across the library.
♻️ Suggested change for consistency
fun _filter_acc(lst, pred, acc) { if (length(lst) == 0) { reverse(acc) } else { - if (pred(hd(lst))) { _filter_acc(tl(lst), pred, [hd(lst) | acc]) } + if (pred(hd(lst)) == 'true') { _filter_acc(tl(lst), pred, [hd(lst) | acc]) } else { _filter_acc(tl(lst), pred, acc) } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/Std.sw` around lines 323 - 333, The filter implementation uses a truthy check for predicates; update it to match the library's explicit predicate pattern by checking equality to the string 'true'. Specifically, in function _filter_acc replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) == 'true') so predicates behave consistently with other functions (e.g., take_while, partition, any, all, count) while leaving filter and _filter_acc signatures unchanged.
309-337: ⚡ Quick winConsider moving list operations to the LIST OPS section.
The
mapandfilterimplementations operate on lists but are currently placed in the STRING OPS section (which starts at line 305). Whilejoinis a string operation alias,mapandfilterwould be more discoverable in the LIST OPS section alongside other list helpers.📁 Suggested organization
Move lines 309-334 (the map/filter implementations and their comment block) to appear after line 280 (after
product) and before line 282 (the MAP OPS section header). Leave thejoinalias at lines 335-336 since it's a string operation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/Std.sw` around lines 309 - 337, Move the list-based implementations map, _map_acc, filter, and _filter_acc out of the STRING OPS block into the LIST OPS section so they are colocated with other list helpers (place them after the product function and before the MAP OPS header); keep the join function (string_join alias) in the STRING OPS area. Ensure you relocate the comment block that documents map/filter along with those functions and update surrounding spacing/headers so MAP OPS still follows the LIST OPS section.tests/sw/conform/t05_collections.sw (1)
20-21: ⚡ Quick winDocument the intentional argument order difference.
Lines 20-21 use different argument orders for
map(function-first) vsfilter(list-first), which tests the documented order detection feature for map/filter builtins. Adding a brief comment would prevent future maintainers from misinterpreting this as an inconsistency.📝 Suggested documentation
+ # Intentionally using different argument orders to validate order detection print(f"map={map(fun(x) { x * 2 }, xs)}") print(f"filter={filter(xs, fun(x) { x % 2 == 1 })}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/sw/conform/t05_collections.sw` around lines 20 - 21, Add a brief inline comment above the two test calls explaining the intentional difference in argument order to exercise the documented order-detection feature: note that map(fun, xs) uses function-first while filter(xs, fun) uses list-first here on the map and filter calls (symbols: map, filter, fun, xs) so future maintainers don’t treat it as a bug.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 34-44: Update the KNOWN_ISSUES.md entry to stop saying mutual tail
recursion "overflows … as a raw SIGSEGV without even the crash banner" and
"needs `sigaltstack`" — note that `src/swarmrt_native.c` now installs a
per-thread `sigaltstack` and registers the SIGSEGV handler with `SA_ONSTACK`, so
stack-guard faults are detected and produce the diagnostic panic banner (e.g.
`panic: stack overflow in process `#N` (128KB fiber stack)`); keep the rest of the
explanation (only self-tail calls are TCO'd, impact on mutually recursive FSMs,
and the workaround to consolidate the loop into one function) but change the
wording to reflect that the runtime now prints a stack-overflow diagnostic
instead of silently crashing.
In `@src/swarmrt_builtins_studio.h`:
- Around line 5241-5246: The code currently treats any wake from
sw_receive_any(c->ms, &tag) as a timeout and fires the callback (sw_val_apply)
even when a real message arrived; change the logic so that when sw_receive_any
returns a non-NULL m (i.e. an actual message), you free m and do NOT call
sw_val_apply but instead continue waiting for the real timeout event, only
invoking sw_val_apply when sw_receive_any indicates a timeout (m == NULL) and
the process isn't cancelled (check self = sw_self() and self->kill_flag); apply
the same change to the equivalent block around lines 5271–5278 that uses c->ms,
tag, sw_receive_any and sw_val_apply.
In `@src/swarmrt_codegen.c`:
- Around line 2056-2068: In emit_pipe(), replicate the same qualified-builtin
denylist logic used in the direct-call path: when lowering a piped qualified
call (the branch that transforms xs |> Std.reduce(...) or xs |> Std.pmap(...))
detect that the module is "Std" and the function name is "reduce" or "pmap" and
emit the same diagnostic (the fprintf message that references gf/fname and
ctx->mod_name/n->line) and set ctx->had_unknown_fn so codegen fails early;
basically copy the check that uses gf/mod/strcmp from the direct-call branch
into the pipe-lowering branch so piped Std.reduce/Std.pmap produce the identical
hint and early exit.
In `@src/swarmrt_lang.c`:
- Around line 1773-1784: The current desugaring creates a user-visible temp
named "__destruct_%d" (dt_ctr/tmp) and emits a normal N_ASSIGN into the current
scope (bind->v.assign.name), which can collide with user bindings; instead
generate and use a non-user-visible temporary and mark it so both interpreter
and codegen treat it as internal. Change the tmp generation to produce an
internal-only identifier (e.g., use a reserved/internal prefix or a
hidden-marker byte) or create a dedicated internal bind kind, set bind so its
name is the internal temp and ensure the interpreter/codegen treat that name as
non-exported/hidden (or handle the new internal bind kind) so the temp cannot
collide with user code and is not visible to the user.
In `@src/swarmrt_native.c`:
- Around line 1179-1186: The _sw_install_altstack function currently ignores the
return value of sigaltstack; modify _sw_install_altstack to check sigaltstack's
return, and on failure capture errno and surface the error (e.g., call
fprintf(stderr, ...) or process logger and/or abort/exit or return an error
code) so SA_ONSTACK paths don't silently degrade; update callers or related
installs (the places handling alternate stack usage around _sw_install_altstack
and the SA_ONSTACK handling referenced at the earlier uses) to handle the
failure if you choose to return an error instead of aborting. Ensure you include
errno or strerror(errno) in the reported message to aid debugging and keep
function name _sw_install_altstack as the change location.
- Around line 1206-1208: The guard-window check hardcodes 8192 bytes; change it
to compute the window using the system page size so stack overflow detection
works on non-4KB systems: compute page_size (e.g., via sysconf(_SC_PAGESIZE) or
getpagesize()) and replace the constant with something like page_size * 2 (or
the same intended margin) when calculating guard_bot from guard_top; update the
check around the symbols guard_top, guard_bot, cur->stack_mem, and fault to use
the runtime-derived page_size.
---
Outside diff comments:
In `@src/swarmrt_lang.c`:
- Around line 2544-2548: The fprintf calls in the SW_VAL_PID and
SW_VAL_REMOTE_PID cases use "%llu" for values that are uint64_t (v->v.pid->pid
and v->v.rpid.id); replace the format specifier with PRIu64 from <inttypes.h>
(e.g. fprintf(f, "<pid:%" PRIu64 ">", (uint64_t)(v->v.pid ? v->v.pid->pid : 0));
and fprintf(f, "<rpid:%s:%" PRIu64 ">", v->v.rpid.node ? v->v.rpid.node : "?",
(uint64_t)v->v.rpid.id)); or alternately cast to (unsigned long long) if you
prefer; ensure inttypes.h is included so PRIu64 is available.
- Around line 4563-4616: The pipe handler only dispatches when
n->v.pipe.func->type == N_CALL, so plain forms like `val |> fname` or `val |>
var` fall through and return val; update the N_PIPE case to also handle non-call
function heads: when n->v.pipe.func is a name/identifier (e.g., N_NAME or the
AST node type used for bare identifiers), extract the symbol
(n->v.pipe.func->v.sval), build args[] with args[0]=val, and run the same
dispatch sequence used for N_CALL — check find_fun(interp->module_ast, fname)
and call the module function, test primary builtins (builtin_print,
builtin_length, builtin_assert, etc.), call interp_extra_builtin(interp, fname,
args, nargs, n->line), and finally check env_get(env, fname) and call
interp_apply_fn for lambda-valued variables; ensure interp->error and recursion
checks mirror the N_CALL path and return the resulting sw_val_t instead of
falling through.
---
Nitpick comments:
In `@lib/Std.sw`:
- Around line 323-333: The filter implementation uses a truthy check for
predicates; update it to match the library's explicit predicate pattern by
checking equality to the string 'true'. Specifically, in function _filter_acc
replace the condition if (pred(hd(lst))) with if (pred(hd(lst)) == 'true') so
predicates behave consistently with other functions (e.g., take_while,
partition, any, all, count) while leaving filter and _filter_acc signatures
unchanged.
- Around line 309-337: Move the list-based implementations map, _map_acc,
filter, and _filter_acc out of the STRING OPS block into the LIST OPS section so
they are colocated with other list helpers (place them after the product
function and before the MAP OPS header); keep the join function (string_join
alias) in the STRING OPS area. Ensure you relocate the comment block that
documents map/filter along with those functions and update surrounding
spacing/headers so MAP OPS still follows the LIST OPS section.
In `@tests/sw/conform/t05_collections.sw`:
- Around line 20-21: Add a brief inline comment above the two test calls
explaining the intentional difference in argument order to exercise the
documented order-detection feature: note that map(fun, xs) uses function-first
while filter(xs, fun) uses list-first here on the map and filter calls (symbols:
map, filter, fun, xs) so future maintainers don’t treat it as a bug.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 663b5f13-af94-4f8f-994f-39d1a57b29c7
📒 Files selected for processing (26)
MakefileREADME.mddocs/SW_LANGUAGE.mddocs/notes/KNOWN_ISSUES.mddocs/notes/REVIEW_FABLE_2026-06.mdlib/Std.swsrc/swarmrt_builtins_studio.hsrc/swarmrt_codegen.csrc/swarmrt_lang.csrc/swarmrt_lang.hsrc/swarmrt_native.csrc/swarmrt_native.hsrc/swarmrt_repl.ctests/sw/conform/t01_errors.swtests/sw/conform/t02_panic_uncatchable.swtests/sw/conform/t03_case.swtests/sw/conform/t04_strings.swtests/sw/conform/t05_collections.swtests/sw/conform/t06_arith_lambdas.swtests/sw/conform/t07_with_json.swtests/sw/conform/t08_destructure_with.swtests/sw/conform/t09_std.swtests/sw/run_conform.shtests/sw/test_after_arrow.swtests/sw/test_case_shadow_scrutinee.swtests/sw/test_tco_depth.sw
✅ Files skipped from review due to trivial changes (3)
- tests/sw/conform/t04_strings.sw
- tests/sw/test_after_arrow.sw
- README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/sw/test_case_shadow_scrutinee.sw
🛑 Comments failed to post (6)
docs/notes/KNOWN_ISSUES.md (1)
34-44:
⚠️ Potential issue | 🟡 Minor🧩 Analysis chain
🏁 Script executed:
#!/bin/bash # Description: Check if mutual-recursion overflow now prints a diagnostic # Create a mutual-recursion test cat > /tmp/mutual_recurse.sw << 'EOF' module MutualTest export [main] fun is_even(n) { if (n == 0) { 'true' } else { is_odd(n - 1) } } fun is_odd(n) { if (n == 0) { 'false' } else { is_even(n - 1) } } fun main() { print(is_even(100000)) } EOF # Compile and run, capture stderr ./bin/swc build /tmp/mutual_recurse.sw -o /tmp/mutual_test 2>&1 /tmp/mutual_test 2>&1 | head -20 # Look for "stack overflow" message if /tmp/mutual_test 2>&1 | grep -q "stack overflow"; then echo "FIXED: stack overflow diagnostic present" else echo "NOT FIXED: no diagnostic, still raw SIGSEGV" fiRepository: skyblanket/swarmrt
Length of output: 227
🏁 Script executed:
#!/bin/bash set -euo pipefail # 1) Show the current KNOWN_ISSUES entry (lines ~34-44) echo "---- docs/notes/KNOWN_ISSUES.md:34-70 ----" nl -ba docs/notes/KNOWN_ISSUES.md | sed -n '30,80p' || true # 2) Locate the compiler executable used for building/swift-like sources echo "---- repo root: top-level dirs/files ----" ls -la echo "---- list ./bin if present ----" ls -la bin 2>/dev/null || echo "no ./bin" echo "---- find executables named swc/similar ----" # common candidates find . -maxdepth 4 -type f \( -name 'swc' -o -name 'swc*' -o -name '*compiler*' -o -name '*fable*' \) 2>/dev/null | head -200 || true echo "---- fd for swc in repo ----" command -v fd >/dev/null 2>&1 && fd -a -t f 'swc' . | head -50 || true # 3) Search for the promised stack-overflow diagnostics implementation echo "---- rg sigaltstack/SA_ONSTACK/stack overflow messages ----" rg -n --hidden --no-ignore-vcs -S "sigaltstack|SA_ONSTACK|stack overflow|overflow in process" . || true # 4) If we find an executable, re-run the mutual recursion test using it. # We'll try a few likely paths derived from what we find above. # (If nothing executable is found, we’ll stop with a helpful message.) EXEC="" if [ -x ./bin/swc ]; then EXEC="./bin/swc"; fi if [ -z "$EXEC" ]; then # try first match from find output CAND="$(find . -maxdepth 5 -type f \( -name 'swc' -o -name 'swc*' \) -perm -111 2>/dev/null | head -1 || true)" if [ -n "${CAND:-}" ]; then EXEC="$CAND"; fi fi if [ -z "$EXEC" ]; then echo "Could not locate an executable 'swc' (or similar) in this repo sandbox; cannot run the mutual-recursion test yet." exit 0 fi echo "---- using compiler executable: $EXEC ----" cat > /tmp/mutual_recurse.sw << 'EOF' module MutualTest export [main] fun is_even(n) { if (n == 0) { 'true' } else { is_odd(n - 1) } } fun is_odd(n) { if (n == 0) { 'false' } else { is_even(n - 1) } } fun main() { print(is_even(100000)) } EOF # Try common CLI shapes: build -o, or compile -o, etc. We’ll try a few without assuming. set +e OUTLOG=/tmp/mutual_test.log rm -f /tmp/mutual_test /tmp/mutual_test.log "$EXEC" build /tmp/mutual_recurse.sw -o /tmp/mutual_test 2>&1 | tee "$OUTLOG" rc=$? if [ $rc -ne 0 ]; then echo "Compile step failed with exit code $rc. Log tail:" tail -200 "$OUTLOG" || true exit 0 fi set -e /tmp/mutual_test 2>&1 | head -80 | tee /tmp/mutual_test.run.log if ( /tmp/mutual_test 2>&1 | rg -qi "stack overflow|overflow in process" ); then echo "FIXED/DIAGNOSTIC: stack overflow diagnostic present" else echo "NOT FIXED: still appears as raw crash without 'stack overflow' wording" fiRepository: skyblanket/swarmrt
Length of output: 4196
🏁 Script executed:
#!/bin/bash set -euo pipefail echo "---- KNOWN_ISSUES.md lines 30-60 (no nl) ----" sed -n '30,60p' docs/notes/KNOWN_ISSUES.md || true echo "---- REVIEW_FABLE_2026-06.md around O2(b) (160-210) ----" sed -n '150,210p' docs/notes/REVIEW_FABLE_2026-06.md || true echo "---- REVIEW_FABLE_2026-06.md around table row (270-350) ----" sed -n '260,340p' docs/notes/REVIEW_FABLE_2026-06.md || true echo "---- src/swarmrt_native.c around altstack/panic (~1130-1235) ----" sed -n '1125,1235p' src/swarmrt_native.c || true echo "---- src/swarmrt_native.c around signal handler setup (~1360-1420) ----" sed -n '1355,1425p' src/swarmrt_native.c || trueRepository: skyblanket/swarmrt
Length of output: 16989
Update KNOWN_ISSUES.md stack-overflow wording to match current sigaltstack handling
docs/notes/KNOWN_ISSUES.md(lines 34-44) still says mutual recursion “overflows … as a raw SIGSEGV without even the crash banner” and “needssigaltstack”.src/swarmrt_native.cinstalls a per-threadsigaltstackand registers the SIGSEGV handler withSA_ONSTACK, which detects guard-page faults and printspanic: stack overflow in process#N(128KB fiber stack)(with the mutual tail-call explanation). Update the entry to reflect the diagnostic-on-overflow behavior instead of silent SIGSEGV.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/notes/KNOWN_ISSUES.md` around lines 34 - 44, Update the KNOWN_ISSUES.md entry to stop saying mutual tail recursion "overflows … as a raw SIGSEGV without even the crash banner" and "needs `sigaltstack`" — note that `src/swarmrt_native.c` now installs a per-thread `sigaltstack` and registers the SIGSEGV handler with `SA_ONSTACK`, so stack-guard faults are detected and produce the diagnostic panic banner (e.g. `panic: stack overflow in process `#N` (128KB fiber stack)`); keep the rest of the explanation (only self-tail calls are TCO'd, impact on mutually recursive FSMs, and the workaround to consolidate the loop into one function) but change the wording to reflect that the runtime now prints a stack-overflow diagnostic instead of silently crashing.src/swarmrt_builtins_studio.h (1)
5241-5246:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep waiting after spurious mailbox wakeups.
sw_receive_any(c->ms, ...)wakes on either timeout or any incoming message. These handlers discard the message but still run the callback immediately, so sending anything to the timer pid shortensdelay/intervaland can makedelay()fire almost instantly.💡 Suggested fix
static void _after_entry(void *raw) { _timer_closure_t *c = (_timer_closure_t *)raw; @@ - uint64_t tag; - void *m = sw_receive_any(c->ms, &tag); - if (m) free(m); /* discard any spurious message */ - sw_process_t *self = sw_self(); - if (!(self && self->kill_flag)) /* fire only if not cancelled mid-wait */ - sw_val_apply(c->fn, NULL, 0); + uint64_t deadline = _sw_now_ms() + c->ms; + for (;;) { + uint64_t now = _sw_now_ms(); + if (now >= deadline) break; + uint64_t tag; + void *m = sw_receive_any(deadline - now, &tag); + if (m) free(m); /* ignore foreign messages, keep waiting */ + sw_process_t *self = sw_self(); + if (self && self->kill_flag) return; + } + sw_val_apply(c->fn, NULL, 0); } @@ static void _every_entry(void *raw) { _timer_closure_t *c = (_timer_closure_t *)raw; @@ for (;;) { - uint64_t tag; - void *m = sw_receive_any(c->ms, &tag); - if (m) free(m); /* discard any spurious message */ - sw_process_t *self = sw_self(); - if (self && self->kill_flag) break; /* cancelled during the wait */ + uint64_t deadline = _sw_now_ms() + c->ms; + for (;;) { + uint64_t now = _sw_now_ms(); + if (now >= deadline) break; + uint64_t tag; + void *m = sw_receive_any(deadline - now, &tag); + if (m) free(m); + sw_process_t *self = sw_self(); + if (self && self->kill_flag) return; + } + sw_process_t *self = sw_self(); + if (self && self->kill_flag) break; sw_val_apply(c->fn, NULL, 0); self = sw_self(); if (self && self->kill_flag) break; /* cancelled during/after the tick */ } }Also applies to: 5271-5278
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_builtins_studio.h` around lines 5241 - 5246, The code currently treats any wake from sw_receive_any(c->ms, &tag) as a timeout and fires the callback (sw_val_apply) even when a real message arrived; change the logic so that when sw_receive_any returns a non-NULL m (i.e. an actual message), you free m and do NOT call sw_val_apply but instead continue waiting for the real timeout event, only invoking sw_val_apply when sw_receive_any indicates a timeout (m == NULL) and the process isn't cancelled (check self = sw_self() and self->kill_flag); apply the same change to the equivalent block around lines 5271–5278 that uses c->ms, tag, sw_receive_any and sw_val_apply.src/swarmrt_codegen.c (1)
2056-2068:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winMirror the qualified-builtin denylist in
emit_pipe().Line 2064 only catches direct
Std.reduce(...)/Std.pmap(...)calls. The piped qualified-call branches still lowerxs |> Std.reduce(f, 0)toStd_reduce(...), so this path falls through to a raw generated-C unknown-function failure instead of the new hint and early codegen failure.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_codegen.c` around lines 2056 - 2068, In emit_pipe(), replicate the same qualified-builtin denylist logic used in the direct-call path: when lowering a piped qualified call (the branch that transforms xs |> Std.reduce(...) or xs |> Std.pmap(...)) detect that the module is "Std" and the function name is "reduce" or "pmap" and emit the same diagnostic (the fprintf message that references gf/fname and ctx->mod_name/n->line) and set ctx->had_unknown_fn so codegen fails early; basically copy the check that uses gf/mod/strcmp from the direct-call branch into the pipe-lowering branch so piped Std.reduce/Std.pmap produce the identical hint and early exit.src/swarmrt_lang.c (1)
1773-1784:
⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDestructuring introduces a user-visible temp binding.
__destruct_Nis lowered as a normal assignment in the current scope, so destructuring leaks an extra variable and can overwrite a user-defined__destruct_*binding. Because this desugar feeds both the interpreter and codegen, the collision changes program behavior on both paths. Please lower through a non-user-visible temporary instead of a regular identifier.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_lang.c` around lines 1773 - 1784, The current desugaring creates a user-visible temp named "__destruct_%d" (dt_ctr/tmp) and emits a normal N_ASSIGN into the current scope (bind->v.assign.name), which can collide with user bindings; instead generate and use a non-user-visible temporary and mark it so both interpreter and codegen treat it as internal. Change the tmp generation to produce an internal-only identifier (e.g., use a reserved/internal prefix or a hidden-marker byte) or create a dedicated internal bind kind, set bind so its name is the internal temp and ensure the interpreter/codegen treat that name as non-exported/hidden (or handle the new internal bind kind) so the temp cannot collide with user code and is not visible to the user.src/swarmrt_native.c (2)
1179-1186:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCheck
sigaltstackinstallation result and surface failures.Line 1185 ignores
sigaltstackerrors; if installation fails, the SA_ONSTACK path silently degrades on Line 1161/Line 1393.Suggested patch
-static void _sw_install_altstack(void); /* defined with the crash handler below */ +static int _sw_install_altstack(void); /* defined with the crash handler below */ static void *scheduler_main(void *arg) { sw_scheduler_t *sched = (sw_scheduler_t *)arg; - _sw_install_altstack(); /* fiber-stack overflows fault on THIS thread */ + if (_sw_install_altstack() != 0) { + perror("[SwarmRT] sigaltstack (scheduler thread)"); + } sched->active = 1; scheduler_loop(sched); return NULL; } -static void _sw_install_altstack(void) { +static int _sw_install_altstack(void) { static __thread char altstack_mem[64 * 1024]; stack_t ss; ss.ss_sp = altstack_mem; ss.ss_size = sizeof(altstack_mem); ss.ss_flags = 0; - sigaltstack(&ss, NULL); + return sigaltstack(&ss, NULL); } ... - _sw_install_altstack(); /* this thread; schedulers install their own */ + if (_sw_install_altstack() != 0) { + perror("[SwarmRT] sigaltstack (init thread)"); + }Also applies to: 1161-1161, 1393-1393
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_native.c` around lines 1179 - 1186, The _sw_install_altstack function currently ignores the return value of sigaltstack; modify _sw_install_altstack to check sigaltstack's return, and on failure capture errno and surface the error (e.g., call fprintf(stderr, ...) or process logger and/or abort/exit or return an error code) so SA_ONSTACK paths don't silently degrade; update callers or related installs (the places handling alternate stack usage around _sw_install_altstack and the SA_ONSTACK handling referenced at the earlier uses) to handle the failure if you choose to return an error instead of aborting. Ensure you include errno or strerror(errno) in the reported message to aid debugging and keep function name _sw_install_altstack as the change location.
1206-1208:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse runtime page size for the guard-window check.
Line 1207 hardcodes
8192; on non-4KB page systems this can miss real guard-page overflows and skip the dedicated overflow banner.Suggested patch
+static size_t g_page_size = 4096; ... int sw_init(const char *name, uint32_t num_schedulers) { + long ps = sysconf(_SC_PAGESIZE); + if (ps > 0) g_page_size = (size_t)ps; ... if (sig == SIGSEGV && cur && cur->stack_mem && info->si_addr) { uintptr_t fault = (uintptr_t)info->si_addr; uintptr_t guard_top = (uintptr_t)cur->stack_mem; /* guard page is just below */ - uintptr_t guard_bot = guard_top - 8192; /* guard page + one page slack */ + uintptr_t guard_bot = guard_top - (uintptr_t)(2 * g_page_size); /* guard + one-page slack */ if (fault < guard_top && fault >= guard_bot) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_native.c` around lines 1206 - 1208, The guard-window check hardcodes 8192 bytes; change it to compute the window using the system page size so stack overflow detection works on non-4KB systems: compute page_size (e.g., via sysconf(_SC_PAGESIZE) or getpagesize()) and replace the constant with something like page_size * 2 (or the same intended margin) when calculating guard_bot from guard_top; update the check around the symbols guard_top, guard_bot, cur->stack_mem, and fault to use the runtime-derived page_size.
…advisory CI leg) The slope gates catch UNBOUNDED growth but are blind to bounded leaks under their RSS noise floor, and every ASAN run to date used detect_leaks=0 (macOS Apple-clang has no LSan) — the leak-blindness PRODUCTION_ROADMAP.md 2.1 calls the highest-leverage gap. make lsan-gate (Linux-only): tests/gc/lsan_lifecycle.sw churns every lifecycle owner per round — one-shot timer fired, interval cancelled (incl. the pre-trampoline kill window), dynamic + static supervisors killed, ETS put/replace/delete, plain spawn/exit, cross-process compound messages, ~64KB captures — exits cleanly, and LSan asserts zero definitely-lost blocks at exit. Exit-time is the right checkpoint: parked-fiber stacks (invisible to the reachability scan) are all gone, and globals-reachable singletons are 'still reachable', which LSan does not report. Suppressions in tests/gc/lsan.supp, each one tied to a documented accepted-minor (currently just _builtin_supervise's caller specs[]). Proven BIDIRECTIONAL per the gate philosophy: an injected unreachable 31337-byte block through the full runtime link fails the gate; the lifecycle probe passes. (Methodology note for future leak canaries: at -O1 clang legally elides an unused malloc — escape the pointer through a volatile global or the canary tests nothing. Cost a debugging hour.) Result: Phase-1's ownership work holds under real LSan — 40 rounds of full lifecycle churn, zero unsuppressed leaks. Two adjacent fixes/findings: - _sw_install_altstack is skipped under ASAN: ASAN installs and at thread exit munmaps its OWN per-thread altstack; replacing it made UnsetAlternateSignalStack CHECK-fail. The crash handler runs on ASAN's altstack under sanitizers. - setup_preemption() is DEAD CODE — defined, never called: the 1ms preemption timer never existed at runtime; the only preemption is the reduction check at compiled tail-loop backedges. Annotated as deliberately-off with the wiring considerations (storm re-verification + SIGALRM-vs-LSan interaction) rather than silently deleted. CI: advisory leg (continue-on-error) after the slope gate; promote to blocking once green for a week. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…ring_chars
Closes the Round-7 O6 diagnostics remainder + the last O7 stdlib gap:
- Diagnostics and #line directives now reference the REAL source file:
swc registers each parsed file's path (entry inputs + resolved
imports) with codegen via sw_codegen_register_source, and
guess_sw_path consults the registry before falling back to the
src/<ModName>.sw convention. 'swc build /tmp/foo.sw' errors no longer
point at a nonexistent src/Foo.sw.
- A qualified call into a module that IS being compiled now validates
at codegen time with a proper suggestion ('module Std has no function
mpa — did you mean Std.map?') instead of leaking a raw cc
implicit-declaration warning + linker undefined-reference — the worst
diagnostic in the compiler. Modules not compiled in are left to the
linker as before.
- string_chars(s): codepoint-aware split (list of single-codepoint
strings, UTF-8 lead-byte walk, invalid leads degrade to one byte) in
BOTH paths — string_length is documented byte-length, and agent code
slicing UTF-8 by bytes corrupts runes. length(string_chars(s)) is the
codepoint count; rejoin with Std.join(cs, ""). Conform t04 holds the
paths identical.
https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
… for blocking-loopback, CI legs The matrix (suite + conformance under SW_SCHEDULERS=1/2/4/8) found a real architecture issue: the curl-backed HTTP client builtins block their scheduler OS THREAD, not just the calling fiber. A program with an in-process server fiber that calls itself over HTTP therefore deadlocks under SW_SCHEDULERS=1 (the blocked client owns the only scheduler — the suite sat on test_http_request FOREVER), and under SW_SCHEDULERS=2 whenever client and server land on the same scheduler: a runnable fiber in a blocked scheduler's LOCAL queue is unstealable (work stealing covers only the global overflow queue, not peer local queues). The deadlock watchdog cannot see this shape — the thread is busy inside libcurl, not parked. - run_tests.sh: every test bounded by timeout 180 — a hang is a loud RUN FAIL, never a wedged suite. - The three self-loopback tests SKIP below 3 schedulers, printing the reason; full assertion counts unchanged at S>=3. - KNOWN_ISSUES: new entry with the mechanism, impact, and the Phase-3 fix direction (blocking transports -> I/O pool with fiber park/wake, like wsc_*). - CI: matrix legs for S=1 and S=8 (oversubscribed) after the default run; matrix verified green locally at S=1/2/3/4/8. (Also a self-caught language-trap data point for the eval suite: this patch's first version wrote 'a or b' — sw only has '||' — and the matrix caught the COMPILE FAIL immediately.) https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
… 58.4 -> 4.5us/rt (13x) Every message to a process on another scheduler found that scheduler PARKED on its idle condvar, so each cross-scheduler send paid a mutex+signal+futex wake and each round trip paid two — measured 58.4us per ping-pong round trip vs 0.92us same-scheduler (Round-7 audit, O4). Fix: before parking, the idle loop spin-polls local pick + (every ~64 iterations) the steal queue for a bounded window. While spinning, rq->idle stays 0, so the producer's sw_add_to_runq pays NOTHING — no lock, no signal — and the consumer picks the work up within ns. The window is per-idle-transition and env-tunable: SW_SPIN_US (default 30, 0 disables, capped 1000); the existing 0.5ms park is unchanged, as is single-scheduler behavior. Measured (4 schedulers, 50K ping-pong round trips, same host): spin 30us (default): 4.46us/rt | 100K spawn/exit cycles: 6.4us each spin 0 (old path): 58.4us/rt | 100K spawn/exit cycles: 8.7us each SW_SCHEDULERS=1: 0.92us/rt (unchanged) Verified: full battery green — suite + conformance, stress both variants, gc-slope, gc-stress, lsan-gate, phases 2-10, and the scheduler matrix at S=1/2/3/4/8. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…4/O6/O7 fixed Phase 2.1 (LSan gate) and 2.2 (scheduler matrix) marked done with their findings inline: the bidirectional-proof requirement, the -O1-elides-unused-malloc canary lesson, the blocking-builtin scheduler deadlock, and the spin-before-park numbers (58.4 -> 4.5us/rt). Round-7 continuation table updated: O4 (wake cost), O6 (diagnostics), O7 (codepoint strings) all FIXED — remaining open: O5 startup time, O8 gate-portability docs, O2c general TCO. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/swarmrt_builtins_studio.h`:
- Around line 8489-8491: The temporary buffer "items" is allocated with raw
malloc in the body of string_chars/1 (cap, items, sw_val_t **) and can be
NULL-dereferenced on OOM; defer or move this allocation onto the checked-OOM
path: do not call raw malloc inline — instead either (a) allocate using the
project's checked allocator API (e.g. the same checked/aborting allocator used
elsewhere) when you know allocation must succeed, or (b) delay allocation until
after the function has performed its OOM/VMA checks and then call a checked
allocation routine (or check the malloc return and call the existing sw OOM
abort helper). Update the code around the "cap"/"items" use so items is only
allocated via the checked allocator (or null-checked and aborts with the
diagnostic) before any write occurs.
- Around line 8470-8495: The code trusts the UTF-8 lead byte in _sw_utf8_seq_len
and then blindly consumes sl bytes in _builtin_string_chars, so malformed leads
or bad continuation bytes get treated as multi-byte codepoints; update the logic
in _builtin_string_chars to validate each continuation byte (each must satisfy
(byte & 0xC0) == 0x80) before accepting sl, and also detect and reject
overlong/illegal sequences (e.g., 0xC0/0xC1 leads, surrogates, codepoints >
0x10FFFF, and invalid lead ranges 0xF5-0xFF) so that any invalid or truncated
sequence falls back to treating the first byte as a single 1-byte character; you
can still use _sw_utf8_seq_len to propose sl but perform the full validation
against s, len, i and adjust sl to 1 on failure before creating the char entry
in items.
In `@tests/gc/lsan_lifecycle.sw`:
- Around line 43-45: The test currently schedules the one-shot timer with
delay(0, fn() { cap }) but then uses sleep(1) which can pass without the
callback running; change the test to explicitly wait for the callback to run
instead of sleeping — e.g., have the callback signal a synchronization primitive
or set a boolean/flag that the test blocks on and asserts before proceeding
(reference the delay(...) call, the fn() { cap } callback, and the current
sleep(...)/PROBE_OK probe), ensuring the probe only advances once the
timer-fired signal is observed to preserve fired-timer leak coverage.
In `@tests/sw/run_tests.sh`:
- Around line 57-62: Replace the direct non-portable invocation `timeout 180
"$bin" >"$log" 2>&1` with the existing cross-platform timeout wrapper used
elsewhere in the script (use a helper such as a `timeout_wrapper` or
`run_with_alarm` that tries GNU `timeout` first and falls back to a Perl `alarm`
implementation), so that running "$bin" with a 180s bound logs to "$log" and
behaves correctly on macOS and Linux; update the `if` condition to call that
wrapper instead of `timeout 180 "$bin"` and ensure the wrapper preserves exit
codes and redirects stdout/stderr the same way.
In `@tests/sw/test_http_headers.sw`:
- Around line 64-69: The comment in the self-loopback test block is inconsistent
with the skip condition: the code skips schedulers "1" and "2" but the comment
says "Needs >=2 schedulers"; change the comment text to match the actual
condition (e.g., "Needs >=3 schedulers") in tests/sw/test_http_headers.sw and
make the identical edits in the copied blocks in
tests/sw/test_http_post_stream.sw and tests/sw/test_http_request.sw so the
printed message and the skip logic are consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 62405b32-6202-44fb-b4f5-30fc3d0288ca
📒 Files selected for processing (19)
.github/workflows/linux-quickstart.ymlMakefiledocs/PRODUCTION_ROADMAP.mddocs/SW_LANGUAGE.mddocs/notes/KNOWN_ISSUES.mddocs/notes/REVIEW_FABLE_2026-06.mdsrc/swarmrt_builtins_studio.hsrc/swarmrt_codegen.csrc/swarmrt_codegen.hsrc/swarmrt_lang.csrc/swarmrt_native.csrc/swc.ctests/gc/lsan.supptests/gc/lsan_lifecycle.swtests/sw/conform/t04_strings.swtests/sw/run_tests.shtests/sw/test_http_headers.swtests/sw/test_http_post_stream.swtests/sw/test_http_request.sw
✅ Files skipped from review due to trivial changes (5)
- tests/gc/lsan.supp
- docs/PRODUCTION_ROADMAP.md
- docs/SW_LANGUAGE.md
- docs/notes/KNOWN_ISSUES.md
- docs/notes/REVIEW_FABLE_2026-06.md
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/sw/conform/t04_strings.sw
- src/swarmrt_lang.c
| static int _sw_utf8_seq_len(unsigned char c) { | ||
| if (c < 0x80) return 1; | ||
| if ((c & 0xE0) == 0xC0) return 2; | ||
| if ((c & 0xF0) == 0xE0) return 3; | ||
| if ((c & 0xF8) == 0xF0) return 4; | ||
| return 1; | ||
| } | ||
|
|
||
| /* string_chars(s) → list of single-CODEPOINT strings. The codepoint-aware | ||
| * complement to byte-oriented string_length/string_sub: slicing UTF-8 text | ||
| * by bytes corrupts multi-byte runes (Round-7 audit O7), and agent code | ||
| * handles non-ASCII text constantly. length(string_chars(s)) is the | ||
| * codepoint length; reassemble slices with Std.join(chars, ""). Walks raw | ||
| * UTF-8 sequence lengths (no grapheme clustering — combining marks stay | ||
| * separate codepoints, documented). */ | ||
| static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) { | ||
| if (n < 1 || !a[0] || a[0]->type != SW_VAL_STRING) return sw_val_list(NULL, 0); | ||
| const char *s = a[0]->v.str ? a[0]->v.str : ""; | ||
| size_t len = strlen(s); | ||
| int cap = (int)len + 1; | ||
| sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap); | ||
| int cnt = 0; | ||
| size_t i = 0; | ||
| while (i < len) { | ||
| int sl = _sw_utf8_seq_len((unsigned char)s[i]); | ||
| if (i + (size_t)sl > len) sl = 1; /* truncated tail sequence */ |
There was a problem hiding this comment.
Validate UTF-8 continuation bytes before consuming a full sequence.
Line 8494 only trusts the lead byte. Inputs with invalid leads (0xC0/0xC1, 0xF5-0xFF) or bad continuations (for example "\xE2A") are still consumed as multi-byte chars, so string_chars/1 mis-splits malformed strings even though the helper comment says malformed bytes should degrade to a single char.
Suggested fix
static int _sw_utf8_seq_len(unsigned char c) {
if (c < 0x80) return 1;
- if ((c & 0xE0) == 0xC0) return 2;
- if ((c & 0xF0) == 0xE0) return 3;
- if ((c & 0xF8) == 0xF0) return 4;
+ if (c >= 0xC2 && c <= 0xDF) return 2;
+ if (c >= 0xE0 && c <= 0xEF) return 3;
+ if (c >= 0xF0 && c <= 0xF4) return 4;
return 1;
}
+
+static int _sw_utf8_is_cont(unsigned char c) {
+ return (c & 0xC0) == 0x80;
+}
static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) {
@@
while (i < len) {
int sl = _sw_utf8_seq_len((unsigned char)s[i]);
if (i + (size_t)sl > len) sl = 1; /* truncated tail sequence */
+ for (int j = 1; j < sl; j++) {
+ if (!_sw_utf8_is_cont((unsigned char)s[i + (size_t)j])) {
+ sl = 1;
+ break;
+ }
+ }
char buf[8];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static int _sw_utf8_seq_len(unsigned char c) { | |
| if (c < 0x80) return 1; | |
| if ((c & 0xE0) == 0xC0) return 2; | |
| if ((c & 0xF0) == 0xE0) return 3; | |
| if ((c & 0xF8) == 0xF0) return 4; | |
| return 1; | |
| } | |
| /* string_chars(s) → list of single-CODEPOINT strings. The codepoint-aware | |
| * complement to byte-oriented string_length/string_sub: slicing UTF-8 text | |
| * by bytes corrupts multi-byte runes (Round-7 audit O7), and agent code | |
| * handles non-ASCII text constantly. length(string_chars(s)) is the | |
| * codepoint length; reassemble slices with Std.join(chars, ""). Walks raw | |
| * UTF-8 sequence lengths (no grapheme clustering — combining marks stay | |
| * separate codepoints, documented). */ | |
| static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) { | |
| if (n < 1 || !a[0] || a[0]->type != SW_VAL_STRING) return sw_val_list(NULL, 0); | |
| const char *s = a[0]->v.str ? a[0]->v.str : ""; | |
| size_t len = strlen(s); | |
| int cap = (int)len + 1; | |
| sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap); | |
| int cnt = 0; | |
| size_t i = 0; | |
| while (i < len) { | |
| int sl = _sw_utf8_seq_len((unsigned char)s[i]); | |
| if (i + (size_t)sl > len) sl = 1; /* truncated tail sequence */ | |
| static int _sw_utf8_seq_len(unsigned char c) { | |
| if (c < 0x80) return 1; | |
| if (c >= 0xC2 && c <= 0xDF) return 2; | |
| if (c >= 0xE0 && c <= 0xEF) return 3; | |
| if (c >= 0xF0 && c <= 0xF4) return 4; | |
| return 1; | |
| } | |
| static int _sw_utf8_is_cont(unsigned char c) { | |
| return (c & 0xC0) == 0x80; | |
| } | |
| /* string_chars(s) → list of single-CODEPOINT strings. The codepoint-aware | |
| * complement to byte-oriented string_length/string_sub: slicing UTF-8 text | |
| * by bytes corrupts multi-byte runes (Round-7 audit O7), and agent code | |
| * handles non-ASCII text constantly. length(string_chars(s)) is the | |
| * codepoint length; reassemble slices with Std.join(chars, ""). Walks raw | |
| * UTF-8 sequence lengths (no grapheme clustering — combining marks stay | |
| * separate codepoints, documented). */ | |
| static sw_val_t *_builtin_string_chars(sw_val_t **a, int n) { | |
| if (n < 1 || !a[0] || a[0]->type != SW_VAL_STRING) return sw_val_list(NULL, 0); | |
| const char *s = a[0]->v.str ? a[0]->v.str : ""; | |
| size_t len = strlen(s); | |
| int cap = (int)len + 1; | |
| sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap); | |
| int cnt = 0; | |
| size_t i = 0; | |
| while (i < len) { | |
| int sl = _sw_utf8_seq_len((unsigned char)s[i]); | |
| if (i + (size_t)sl > len) sl = 1; /* truncated tail sequence */ | |
| for (int j = 1; j < sl; j++) { | |
| if (!_sw_utf8_is_cont((unsigned char)s[i + (size_t)j])) { | |
| sl = 1; | |
| break; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/swarmrt_builtins_studio.h` around lines 8470 - 8495, The code trusts the
UTF-8 lead byte in _sw_utf8_seq_len and then blindly consumes sl bytes in
_builtin_string_chars, so malformed leads or bad continuation bytes get treated
as multi-byte codepoints; update the logic in _builtin_string_chars to validate
each continuation byte (each must satisfy (byte & 0xC0) == 0x80) before
accepting sl, and also detect and reject overlong/illegal sequences (e.g.,
0xC0/0xC1 leads, surrogates, codepoints > 0x10FFFF, and invalid lead ranges
0xF5-0xFF) so that any invalid or truncated sequence falls back to treating the
first byte as a single 1-byte character; you can still use _sw_utf8_seq_len to
propose sl but perform the full validation against s, len, i and adjust sl to 1
on failure before creating the char entry in items.
| int cap = (int)len + 1; | ||
| sw_val_t **items = (sw_val_t **)malloc(sizeof(sw_val_t *) * cap); | ||
| int cnt = 0; |
There was a problem hiding this comment.
Keep this temporary buffer on the checked OOM path.
Line 8490 allocates items with raw malloc() and Line 8499 writes through it immediately. Under the same OOM/VMA-exhaustion conditions this PR is hardening, string_chars/1 still falls back to a bare NULL-deref instead of the new diagnostic abort path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/swarmrt_builtins_studio.h` around lines 8489 - 8491, The temporary buffer
"items" is allocated with raw malloc in the body of string_chars/1 (cap, items,
sw_val_t **) and can be NULL-dereferenced on OOM; defer or move this allocation
onto the checked-OOM path: do not call raw malloc inline — instead either (a)
allocate using the project's checked allocator API (e.g. the same
checked/aborting allocator used elsewhere) when you know allocation must
succeed, or (b) delay allocation until after the function has performed its
OOM/VMA checks and then call a checked allocation routine (or check the malloc
return and call the existing sw OOM abort helper). Update the code around the
"cap"/"items" use so items is only allocated via the checked allocator (or
null-checked and aborts with the diagnostic) before any write occurs.
| delay(0, fn() { cap }) | ||
| sleep(1) | ||
|
|
There was a problem hiding this comment.
Assert timer fire explicitly instead of relying on fixed sleep.
Line 43 schedules the one-shot delay, but Line 44 only sleeps; this does not guarantee the callback actually ran. If it doesn’t fire, the probe can still reach PROBE_OK, weakening fired-timer leak coverage.
Suggested fix
fun one_round(cap) {
# one-shot delay: fires, sends, closure must be freed after firing
- delay(0, fn() { cap })
- sleep(1)
+ parent = self()
+ delay(0, fn() { send(parent, 'delay_fired') ; cap })
+ receive { 'delay_fired' -> 0 }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/gc/lsan_lifecycle.sw` around lines 43 - 45, The test currently
schedules the one-shot timer with delay(0, fn() { cap }) but then uses sleep(1)
which can pass without the callback running; change the test to explicitly wait
for the callback to run instead of sleeping — e.g., have the callback signal a
synchronization primitive or set a boolean/flag that the test blocks on and
asserts before proceeding (reference the delay(...) call, the fn() { cap }
callback, and the current sleep(...)/PROBE_OK probe), ensuring the probe only
advances once the timer-fired signal is observed to preserve fired-timer leak
coverage.
| # Run. Bounded: a hung test must FAIL the suite loudly, not wedge it | ||
| # forever — the Phase-2.2 scheduler matrix found test_http_request | ||
| # deadlocking under SW_SCHEDULERS=1 (blocking curl client starves the | ||
| # in-process server fiber on the only scheduler) and the suite sat on | ||
| # it indefinitely. 180s is ~20x the slowest healthy test. | ||
| if timeout 180 "$bin" >"$log" 2>&1; then |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -eu
echo "Platform: $(uname -s)"
echo "timeout: $(command -v timeout || echo MISSING)"
echo "gtimeout: $(command -v gtimeout || echo MISSING)"
nl -ba tests/sw/run_tests.sh | sed -n '54,68p'Repository: skyblanket/swarmrt
Length of output: 173
🏁 Script executed:
#!/bin/bash
set -eu
f="tests/sw/run_tests.sh"
echo "== timeout usage =="
rg -n --hidden --no-mmap '\btimeout\b' "$f" || true
echo
echo "== gtimeout usage =="
rg -n --hidden --no-mmap '\bgtimeout\b' "$f" || true
echo
echo "== perl alarm fallback =="
rg -n --hidden --no-mmap 'alarm\s+\$t|perl.*alarm|perl -e' "$f" || true
echo
echo "== context around lines 50-75 =="
nl -ba "$f" | sed -n '45,85p'Repository: skyblanket/swarmrt
Length of output: 715
🏁 Script executed:
#!/bin/bash
set -eu
f="tests/sw/run_tests.sh"
echo "== timeout/gtimeout usage =="
rg -n --no-mmap '\btimeout\b|\bgtimeout\b' "$f" || true
echo
echo "== lines 45-90 (no nl available) =="
sed -n '45,90p' "$f"
echo
echo "== locate existing perl alarm usage block(s) =="
rg -n --no-mmap 'perl -e.*alarm|alarm\s+\$t|macOS has no `timeout`' "$f" || trueRepository: skyblanket/swarmrt
Length of output: 2458
🏁 Script executed:
#!/bin/bash
set -eu
f="tests/sw/run_tests.sh"
echo "== function definitions containing timeout/perl/alarm =="
rg -n --no-mmap '^\s*(run_with_timeout|timeout|alarm|perl -e|gtimeout)\b|^\s*[a-zA-Z0-9_]+\s*\(\)\s*\{' "$f" || true
echo
echo "== search for 'run_with_' wrappers =="
rg -n --no-mmap 'run_with_|\btimeout/|fallback|wrapper' "$f" || true
echo
echo "== print surrounding perl-alarm blocks (150-235) =="
sed -n '130,240p' "$f"Repository: skyblanket/swarmrt
Length of output: 5432
🏁 Script executed:
#!/bin/bash
set -eu
f="tests/sw/run_tests.sh"
echo "== top of file =="
sed -n '1,120p' "$f"Repository: skyblanket/swarmrt
Length of output: 4334
Replace the non-portable timeout call in the main run loop with a timeout/Perl fallback wrapper.
tests/sw/run_tests.sh invokes timeout 180 "$bin" directly, with no macOS fallback, while other parts of the script already use Perl alarm explicitly because macOS lacks timeout. On platforms without GNU timeout, this will fail the suite (e.g., exit 127) instead of properly bounding hangs.
Proposed fix
@@
set -u
SWARMRT_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
SWC="$SWARMRT_ROOT/bin/swc"
TESTS_DIR="$SWARMRT_ROOT/tests/sw"
BUILD_DIR="$SWARMRT_ROOT/tests/sw/.build"
+
+run_with_timeout() {
+ secs="$1"; shift
+ if command -v timeout >/dev/null 2>&1; then
+ timeout "$secs" "$@"
+ else
+ perl -e 'my $t=shift `@ARGV`; alarm $t; exec `@ARGV` or die' "$secs" "$@"
+ fi
+}
@@
- if timeout 180 "$bin" >"$log" 2>&1; then
+ if run_with_timeout 180 "$bin" >"$log" 2>&1; then📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Run. Bounded: a hung test must FAIL the suite loudly, not wedge it | |
| # forever — the Phase-2.2 scheduler matrix found test_http_request | |
| # deadlocking under SW_SCHEDULERS=1 (blocking curl client starves the | |
| # in-process server fiber on the only scheduler) and the suite sat on | |
| # it indefinitely. 180s is ~20x the slowest healthy test. | |
| if timeout 180 "$bin" >"$log" 2>&1; then | |
| # Run. Bounded: a hung test must FAIL the suite loudly, not wedge it | |
| # forever — the Phase-2.2 scheduler matrix found test_http_request | |
| # deadlocking under SW_SCHEDULERS=1 (blocking curl client starves the | |
| # in-process server fiber on the only scheduler) and the suite sat on | |
| # it indefinitely. 180s is ~20x the slowest healthy test. | |
| if run_with_timeout 180 "$bin" >"$log" 2>&1; then |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/sw/run_tests.sh` around lines 57 - 62, Replace the direct non-portable
invocation `timeout 180 "$bin" >"$log" 2>&1` with the existing cross-platform
timeout wrapper used elsewhere in the script (use a helper such as a
`timeout_wrapper` or `run_with_alarm` that tries GNU `timeout` first and falls
back to a Perl `alarm` implementation), so that running "$bin" with a 180s bound
logs to "$log" and behaves correctly on macOS and Linux; update the `if`
condition to call that wrapper instead of `timeout 180 "$bin"` and ensure the
wrapper preserves exit codes and redirects stdout/stderr the same way.
| # Self-loopback test: an in-process HTTP server fiber + a BLOCKING | ||
| # curl-backed client on the same runtime. The client call occupies its | ||
| # scheduler OS THREAD (not just the fiber), so with one scheduler the | ||
| # server can never run -> deadlock/timeout (found by the Phase-2.2 | ||
| # scheduler matrix). Needs >=2 schedulers until blocking builtins are | ||
| # moved off the scheduler thread (see KNOWN_ISSUES). |
There was a problem hiding this comment.
Align threshold wording with the actual skip condition.
Line 68 says Needs >=2 schedulers, but Lines 71-73 skip both "1" and "2" and print needs >=3 schedulers. Please make the comment consistent (same copied block also appears in tests/sw/test_http_post_stream.sw and tests/sw/test_http_request.sw).
Proposed wording fix
- # scheduler matrix). Needs >=2 schedulers until blocking builtins are
+ # scheduler matrix). Needs >=3 schedulers until blocking builtins are📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Self-loopback test: an in-process HTTP server fiber + a BLOCKING | |
| # curl-backed client on the same runtime. The client call occupies its | |
| # scheduler OS THREAD (not just the fiber), so with one scheduler the | |
| # server can never run -> deadlock/timeout (found by the Phase-2.2 | |
| # scheduler matrix). Needs >=2 schedulers until blocking builtins are | |
| # moved off the scheduler thread (see KNOWN_ISSUES). | |
| # Self-loopback test: an in-process HTTP server fiber + a BLOCKING | |
| # curl-backed client on the same runtime. The client call occupies its | |
| # scheduler OS THREAD (not just the fiber), so with one scheduler the | |
| # server can never run -> deadlock/timeout (found by the Phase-2.2 | |
| # scheduler matrix). Needs >=3 schedulers until blocking builtins are | |
| # moved off the scheduler thread (see KNOWN_ISSUES). |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/sw/test_http_headers.sw` around lines 64 - 69, The comment in the
self-loopback test block is inconsistent with the skip condition: the code skips
schedulers "1" and "2" but the comment says "Needs >=2 schedulers"; change the
comment text to match the actual condition (e.g., "Needs >=3 schedulers") in
tests/sw/test_http_headers.sw and make the identical edits in the copied blocks
in tests/sw/test_http_post_stream.sw and tests/sw/test_http_request.sw so the
printed message and the skip logic are consistent.
…45ms -> ~40ms (6ms small-arena) The 'startup time' mystery (Round-7 O5: hello measured 110-145ms wall vs a 3ms C baseline) was mostly SHUTDOWN: the deadlock watchdog slept its 5s interval in 100ms nanosleep chunks polling the stop flag, and sw_shutdown joins it — so EVERY binary paid avg ~50ms (worst 100ms) at exit. strace showed the ~99ms gap on the watchdog thread-stack munmap. Now the watchdog waits on a condvar with the interval as deadline; shutdown signals it under the lock and the join returns immediately. Also removes the 10/s idle wakeups the chunked sleep cost. Measured (Linux x86_64, hello binary): 110-145ms -> 38-42ms wall at the default 100K-slot arena (the remaining ~35ms is the documented arena init), and ~5-6ms with SW_MAX_PROCS=1024 — the README's <10ms claim is now reachable with the documented knob. Bundled in this commit (same file, same verification battery): scheduler idle-path hardening + an honest P1 finding: - publish-idle-then-repoll guard before parking (standard sleeper protocol): a producer enqueueing between the consumer's last pick and the idle=1 store skipped the signal and the work ate the full 0.5ms park timeout. - O4 spin-before-park lands OPT-IN, DEFAULT OFF (SW_SPIN_US, 0 disables): it measures 58.4 -> 4.5us per cross-scheduler ping-pong round trip (13x), but GATES A LATENT DEADLOCK — depth-1 ping-pong wedges ~15% of runs with spin on (0/60 off, 9/60 on): both fibers parked WAITING, all runqs empty, zero enqueues forever. The repoll guard does not close it, so the loss is upstream of the park (runq push vs spinning pick, or the receive waiting-flag handoff). Filed as P1 in KNOWN_ISSUES with the reproducer; root-cause is the Phase 2.3/2.4 TSan + atomics work, then the default flips on. - SW_SCHED_TRACE=1: 1Hz scheduler counter telemetry on stderr (enq / sig / idle_seen / park / park_sig / park_to / spin_hit / guard_hit) — the instrument that proved the wedge was a total enqueue stall (enq=0, park_to~6600/s) rather than slow progress. Verified: suite + conformance, storms (both scheduler modes), gc-slope, gc-stress, lsan-gate, phases 2-10, spin_wedge_hunt 25/25 with defaults, startup numbers above. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
tests/stress/spin_wedge_hunt.sh: regression gate for the spin-gated scheduler deadlock (KNOWN_ISSUES P1) — N runs of the depth-1 ping-pong probe under SW_SPIN_US, first wedge exits 1. Wire into the stress gate with spin on once the race is fixed; passes 25/25 with today's defaults (spin off). scripts/gc_slope.sh: every probe run is bounded by timeout 600 — the wedge ground the message probe from ~100ms to 25+ minutes and the gate just sat on it; now a hang lands in the existing FAIL:exit path. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…up numbers measured KNOWN_ISSUES: new P1 entry for the spin-gated scheduler deadlock with the full evidence trail (incidence, telemetry signature, reproducer, fix path). Roadmap 2.2 amended: spin ships opt-in/default-off until 2.3/2.4 close the race. README: SW_MAX_PROCS row now carries measured boot numbers (~6ms small-arena / ~40ms default on Linux) after the watchdog shutdown fix. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/gc_slope.sh`:
- Around line 40-41: The script currently sets TO="" when the `timeout` command
is absent, which silently removes the runtime bound; instead, update the `TO`
logic so the script fails fast or falls back to GNU coreutils `gtimeout`: detect
`timeout` first, then check for `gtimeout` and set TO accordingly (e.g.,
`TO="timeout 600"` or `TO="gtimeout 600"`), and if neither exists, exit with a
clear error rather than leaving TO empty; change the code that assigns `TO` (the
lines referencing the `TO` variable and the `command -v timeout` check) to
implement this fallback-or-fail behavior.
In `@tests/stress/spin_wedge_hunt.sh`:
- Around line 36-38: The current check treats any non-zero exit from the probe
as a timeout wedge; modify the block that runs SW_QUIET=1 timeout "$TIMEOUT_S"
"$BIN" 4000 to capture the actual exit status (e.g., save $? into a variable)
and distinguish timeout exit code 124 from other failures, then log different
messages: for 124 print the existing "WEDGE … exceeded ${TIMEOUT_S}s" message,
for other non-zero statuses print a failure/crash message including the numeric
exit code and context (e.g., that the binary crashed or exited abnormally).
Update the messages in the spin_wedge_hunt reporting lines so they reference the
captured status instead of always saying "exceeded timeout".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0af5a114-c45c-461c-8bcc-8d983a158d67
📒 Files selected for processing (6)
README.mddocs/PRODUCTION_ROADMAP.mddocs/notes/KNOWN_ISSUES.mdscripts/gc_slope.shsrc/swarmrt_native.ctests/stress/spin_wedge_hunt.sh
✅ Files skipped from review due to trivial changes (2)
- docs/PRODUCTION_ROADMAP.md
- README.md
| local TO="timeout 600" | ||
| command -v timeout >/dev/null 2>&1 || TO="" |
There was a problem hiding this comment.
Fail fast (or fallback to gtimeout) instead of silently disabling bounds.
At Line 41, missing timeout sets TO="", which reintroduces unbounded hangs in the GC slope gate. That breaks the bounded-run contract this script is enforcing.
Proposed fix
- local TO="timeout 600"
- command -v timeout >/dev/null 2>&1 || TO=""
+ local -a TO
+ if command -v timeout >/dev/null 2>&1; then
+ TO=(timeout 600)
+ elif command -v gtimeout >/dev/null 2>&1; then
+ TO=(gtimeout 600)
+ else
+ echo "FAIL:no-timeout-tool"
+ return
+ fi
@@
- gnu) out=$(SW_RUNTIME_QUIET=1 $TO /usr/bin/time -v "$bin" "$count" 2>&1); rc=$?
+ gnu) out=$(SW_RUNTIME_QUIET=1 "${TO[@]}" /usr/bin/time -v "$bin" "$count" 2>&1); rc=$?
@@
- bsd) out=$(SW_RUNTIME_QUIET=1 $TO /usr/bin/time -l "$bin" "$count" 2>&1); rc=$?
+ bsd) out=$(SW_RUNTIME_QUIET=1 "${TO[@]}" /usr/bin/time -l "$bin" "$count" 2>&1); rc=$?
@@
- gtime) out=$(SW_RUNTIME_QUIET=1 $TO gtime -v "$bin" "$count" 2>&1); rc=$?
+ gtime) out=$(SW_RUNTIME_QUIET=1 "${TO[@]}" gtime -v "$bin" "$count" 2>&1); rc=$?📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local TO="timeout 600" | |
| command -v timeout >/dev/null 2>&1 || TO="" | |
| local -a TO | |
| if command -v timeout >/dev/null 2>&1; then | |
| TO=(timeout 600) | |
| elif command -v gtimeout >/dev/null 2>&1; then | |
| TO=(gtimeout 600) | |
| else | |
| echo "FAIL:no-timeout-tool" | |
| return | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/gc_slope.sh` around lines 40 - 41, The script currently sets TO=""
when the `timeout` command is absent, which silently removes the runtime bound;
instead, update the `TO` logic so the script fails fast or falls back to GNU
coreutils `gtimeout`: detect `timeout` first, then check for `gtimeout` and set
TO accordingly (e.g., `TO="timeout 600"` or `TO="gtimeout 600"`), and if neither
exists, exit with a clear error rather than leaving TO empty; change the code
that assigns `TO` (the lines referencing the `TO` variable and the `command -v
timeout` check) to implement this fallback-or-fail behavior.
| if ! SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1; then | ||
| echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)" | ||
| echo "spin_wedge_hunt: FAIL" |
There was a problem hiding this comment.
Differentiate timeout wedges from other probe failures.
At Line 36, all non-zero exits are reported as “WEDGE … exceeded timeout”. This conflates timeout (124) with ordinary failures (e.g., crash/exec error), which makes stress diagnostics misleading.
Proposed fix
-for i in $(seq 1 "$RUNS"); do
- if ! SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1; then
- echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)"
- echo "spin_wedge_hunt: FAIL"
- exit 1
- fi
+for i in $(seq 1 "$RUNS"); do
+ rc=0
+ SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1 || rc=$?
+ if [ "$rc" -eq 124 ]; then
+ echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)"
+ echo "spin_wedge_hunt: FAIL"
+ exit 1
+ elif [ "$rc" -ne 0 ]; then
+ echo "probe failed on run $i (exit=$rc)"
+ echo "spin_wedge_hunt: FAIL"
+ exit 2
+ fi
done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ! SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1; then | |
| echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)" | |
| echo "spin_wedge_hunt: FAIL" | |
| for i in $(seq 1 "$RUNS"); do | |
| rc=0 | |
| SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 >/dev/null 2>&1 || rc=$? | |
| if [ "$rc" -eq 124 ]; then | |
| echo "WEDGE on run $i (exceeded ${TIMEOUT_S}s — healthy is ~0.1-0.5s)" | |
| echo "spin_wedge_hunt: FAIL" | |
| exit 1 | |
| elif [ "$rc" -ne 0 ]; then | |
| echo "probe failed on run $i (exit=$rc)" | |
| echo "spin_wedge_hunt: FAIL" | |
| exit 2 | |
| fi | |
| done |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/stress/spin_wedge_hunt.sh` around lines 36 - 38, The current check
treats any non-zero exit from the probe as a timeout wedge; modify the block
that runs SW_QUIET=1 timeout "$TIMEOUT_S" "$BIN" 4000 to capture the actual exit
status (e.g., save $? into a variable) and distinguish timeout exit code 124
from other failures, then log different messages: for 124 print the existing
"WEDGE … exceeded ${TIMEOUT_S}s" message, for other non-zero statuses print a
failure/crash message including the numeric exit code and context (e.g., that
the binary crashed or exited abnormally). Update the messages in the
spin_wedge_hunt reporting lines so they reference the captured status instead of
always saying "exceeded timeout".
…ecome _Atomic; trace init ordered ThreadSanitizer on the message ping-pong + spawn storm flagged exactly three races (the feared fiber false-positive storm did not materialize — cross-thread fiber migration synchronizes through the runq's C11 atomics, so TSan sees the happens-before edges): - sched->active (main's init readiness spin vs the scheduler thread's store), sched->should_exit and g_swarm->running (sw_shutdown's store vs every scheduler loop's read) were 'volatile int' — volatile is not synchronization in C11. Now _Atomic int: sw_scheduler/sw_swarm are not asm-offset-pinned (only sw_process is), and these are cold flags. - sched_trace_maybe_start ran AFTER scheduler threads were created, so the g_sched_trace flag write raced their reads on every enqueue — now initialized before pthread_create (ordered by thread creation). Verified: full battery green (suite + conformance, stress both variants, gc-slope, gc-stress, lsan-gate, phases 2-10) and 'make tsan-gate' reports no unsuppressed races on msg, msg+spin, and the full 80k storm. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…under TSan) Builds the depth-1 message probe and the spawn storm with -fsanitize=thread and fails on any unsuppressed race. Suppressions (tests/stress/tsan.supp) cover ONLY the documented warn-only lock-free watchdog scanner (watchdog_thread_fn + sw_io_active_port_count) — by design it reads slab fields unsynchronized and tolerates stale values; making sw_process.state _Atomic for a scanner-clean runtime is the Phase-2.4 decision (state is written on every context switch). Note for the P1 spin-gated deadlock: 40/40 wedge-free under TSan with spin on — the interleave is timing-suppressed and shows NO C11 race, confirming it is a protocol bug (waiting/idle flag handshake), not a data race. tests/stress/spin_wedge_hunt.sh remains its gate. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…esults + worklist run_stress.sh prints vm.max_map_count / nproc / ulimits on Linux — the gate's verdict used to depend SILENTLY on a host sysctl (Round-7 O8). Phase 2.4 audit (TSan over msg, storm, and phase 2/4/5 binaries): mailbox/runq clean; control flags fixed in 2.3; remaining worklist recorded in the roadmap — proc->state relaxed-atomicization (~50 mechanical sites, zero-cost), the timer-list unlocked reader, the registry name-lifetime question, and two racy test-code counters. sw_stats joins the watchdog in tsan.supp as a documented warn-only printer. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…eive waiting-flag handshake; seq_cst; spin default ON
The ~15%-incidence total deadlock under depth-1 cross-scheduler
ping-pong (both fibers parked WAITING, zero enqueues forever) was ONE
LINE PAIR in sw_receive_any's fast path:
atomic_store_explicit(&waiting, 1, memory_order_release);
sig = atomic_load_explicit(&sig_head, memory_order_acquire);
Dekker's pattern: store one variable, load a DIFFERENT one, with the
sender doing the mirror image (push sig_head, then exchange waiting).
Release->acquire across different objects orders NOTHING — the
receiver's waiting=1 sat in its store buffer past the sig_head load
(x86 StoreLoad reordering), so the receiver saw a pre-push sig_head and
parked, while the sender's wake-exchange read waiting==0 from memory
and skipped the enqueue. Both sides lost; the message sat in the
mailbox forever. The three sibling receive paths survived on x86 only
because mailbox_drain's locked xchg happens to be a full fence — which
arm64 (the macOS daily driver) does NOT promise, so they were latent
there too. TSan was silent throughout: every access was atomic; the
bug was pure ordering.
Fix: every waiting-flag participant is seq_cst — all four
about-to-sleep stores, the sig_head probe load, and mailbox_wake's
exchange. Total-order correctness on every architecture; the cost is an
xchg/fence on the COLD about-to-sleep and wake paths only.
Diagnosis tooling that found it (stays in the runtime, off by default):
- SW_SCHED_TRACE=1: 1Hz counters + a stall detector that fires a wedge
AUTOPSY — every live process's state/waiting/mailbox pointers + runq
internals — after 3 flatlined seconds. The corpse read: parked,
waiting=1, sig_head!=NULL. That pair is impossible under a correct
protocol and pointed straight at the handshake.
- SW_SCHED_TRACE=2: + a lock-free protocol event ring (push / wake /
wset / drain / park / enq / pick) dumped with the autopsy.
Verification: 0/300 wedge-hunt runs at the fix commit (~45 expected at
the pre-fix incidence), then the full battery with spin DEFAULT ON
(SW_SPIN_US=30): tsan-gate, suite + conformance, stress both variants,
gc-slope, gc-stress, lsan-gate, phases 2-10, scheduler matrix
S=1/2/4/8, spin_wedge_hunt 40/40. Cross-scheduler ping-pong:
58.4us/rt at the audit baseline -> 3.0us/rt now (19x).
https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/swarmrt_native.c (4)
924-1004:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTie the scheduler-trace thread to runtime shutdown.
pthread_detach()letssched_trace_fn()outliveg_swarm, but it keeps dereferencingg_swarm, the slab, and the scheduler array on every tick and during autopsy dumps. WithSW_SCHED_TRACEenabled,sw_shutdown()can therefore free those objects while the trace thread is still using them.Also applies to: 1667-1770
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_native.c` around lines 924 - 1004, The sched-trace thread is detached so it can outlive and dereference freed runtime state; fix by making it joinable and joined at shutdown: in sched_trace_maybe_start(), replace the pthread_detach() call with storing the pthread_t returned by pthread_create into a global/thread-static (e.g. add pthread_t g_sched_trace_thread and a bool g_sched_trace_started) and set that flag when creation succeeds; in sw_shutdown() (or the runtime teardown function) check g_sched_trace_started and call pthread_join(g_sched_trace_thread, NULL) before freeing g_swarm; keep sched_trace_fn using g_swarm->running as its exit condition so the join will complete cleanly.
1453-1474:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDerive the overflow window from the real guard-page size.
The stack allocator uses the runtime page size, but this detector hardcodes
8192. On systems with larger pages, many real guard-page faults will miss this branch and lose the new “stack overflow” diagnostic.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_native.c` around lines 1453 - 1474, The overflow detector hardcodes 8192 bytes for "guard page + one page slack" which breaks on systems with different page sizes; replace that constant with the runtime page size (e.g. call sysconf(_SC_PAGESIZE) or getpagesize()) and compute guard_bot as guard_top - (page_size * 2) (or whatever count of pages the allocator actually uses for guard+slack). Update the SIGSEGV handler block that reads cur->stack_mem and computes guard_top/guard_bot to use the computed page_size so faults in the real guard page range are detected correctly.
107-118:⚠️ Potential issue | 🟠 MajorMake
g_watchdog_stoprace-free (atomic or fully mutex-guarded)
watchdog_thread_fn()readsg_watchdog_stopoutsideg_watchdog_lock(while (!g_watchdog_stop)at ~151 andif (g_watchdog_stop) breakat ~165), whilesw_shutdown()writes it under the lock (~1678-1681). With a non-atomicvolatile int, this is still a data race/UB; useatomic_int/C11 atomics or ensure every access is protected byg_watchdog_lock(including the outer stop checks).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_native.c` around lines 107 - 118, The watcher-stop variable g_watchdog_stop is accessed without synchronization in watchdog_thread_fn (reads in the while/if) while sw_shutdown writes it under g_watchdog_lock, causing a data race; fix by making g_watchdog_stop an atomic_int (C11) and replace its reads/writes in watchdog_thread_fn and sw_shutdown with atomic loads/stores using appropriate memory orders (e.g., atomic_load_explicit(..., memory_order_acquire) and atomic_store_explicit(..., memory_order_release)), or alternatively ensure every access (including the outer while(!g_watchdog_stop) check and the in-loop if) is performed while holding g_watchdog_lock and use pthread_cond_signal in sw_shutdown — reference g_watchdog_stop, watchdog_thread_fn, sw_shutdown, and g_watchdog_lock when making the change.
1255-1265:⚠️ Potential issue | 🟡 MinorFix data race on
spin_itersinscheduler_loop()
scheduler_loop()containsstatic int spin_iters = -1;, and it’s executed by every scheduler thread (pthread_create(..., scheduler_main, ...)->scheduler_loop). The first-timespin_itersinitialization (if (spin_iters < 0) { ... }) is unsynchronized, so concurrent threads can race on that write (UB / TSAN noise). Guard the initialization (e.g.,pthread_once) or move it to a single pre-thread init step (or use an atomic once pattern).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/swarmrt_native.c` around lines 1255 - 1265, scheduler_loop currently uses a function-local static spin_iters initialized lazily (static int spin_iters = -1) which races when multiple scheduler threads run scheduler_loop via scheduler_main; fix by performing one-time initialization of spin_iters before threads start or by guarding the lazy init with a pthread_once (or C11 atomic once) helper so only one thread computes and stores the value derived from getenv("SW_SPIN_US") and sets spin_iters; update references to spin_iters in scheduler_loop to rely on the safely-initialized value and remove the unsynchronized if (spin_iters < 0) block or replace it with a pthread_once-invoked initializer function.
♻️ Duplicate comments (1)
docs/notes/KNOWN_ISSUES.md (1)
58-64:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUnresolved framing inconsistency from prior review.
This entry still contains crash language ("raw SIGSEGV," Line 62) that conflicts with the "Open" section's statement that issues are "genuine limitations, not crashes" (Line 8-9). A previous review flagged this inconsistency and it remains unaddressed.
📝 Suggested alignment
Either:
- Move this entry to a separate "Crash / Data-loss" subsection within "Open"
- Reword to emphasize the limitation (no mutual TCO) and note the crash as an implementation detail: "...consume a C stack frame per hop and overflow the 128KB fiber stack at depth ~10⁴–10⁵. The overflow currently manifests as undefined behavior (see REVIEW_FABLE_2026-06.md, O2, regarding sigaltstack handling)."
This preserves the technical detail while aligning severity signals with the section's framing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/notes/KNOWN_ISSUES.md` around lines 58 - 64, The "Mutual tail recursion is not TCO'd; deep chains overflow silently" entry contains crash wording ("raw SIGSEGV") that conflicts with the "Open" section's framing; either move this entry into a new "Crash / Data-loss" subsection under "Open" or reword the paragraph to present it as a limitation and implementation detail (e.g., state that mutual tail-calls consume a C stack frame per hop and can overflow the 128KB fiber stack at depth ~10⁴–10⁵, and note that the overflow currently manifests as undefined behavior documented in REVIEW_FABLE_2026-06.md with sigaltstack details) so the severity language matches the rest of the document.
🧹 Nitpick comments (1)
Makefile (1)
510-539: ⚡ Quick winConsider adding PROBE_OK validation for robustness.
The
tsan-gatetarget checks for TSan warnings and captures exit codes, but unlikelsan-gate(Line 415), it doesn't verify that the test programs actually completed their work by checking for aPROBE_OKmarker in the output. A program could crash early or produce incomplete results without emitting TSan warnings, and the current grep-only check would miss this.🛡️ Suggested enhancement
Add PROBE_OK checks similar to lsan-gate:
`@rc`=0; \ out1=$$(SW_QUIET=1 TSAN_OPTIONS="suppressions=tests/stress/tsan.supp" \ timeout 300 $(BIN_DIR)/tsan_msg 4000 2>&1) || rc=1; \ + echo "$$out1" | grep -q "PROBE_OK" || { echo "tsan-gate: msg test incomplete"; rc=1; }; \ echo "$$out1" | grep -q "WARNING: ThreadSanitizer" && { echo "$$out1" | head -40; rc=1; }; \ out2=$$(SW_QUIET=1 SW_SPIN_US=30 TSAN_OPTIONS="suppressions=tests/stress/tsan.supp" \ timeout 300 $(BIN_DIR)/tsan_msg 4000 2>&1) || rc=1; \ + echo "$$out2" | grep -q "PROBE_OK" || { echo "tsan-gate: msg+spin test incomplete"; rc=1; }; \ echo "$$out2" | grep -q "WARNING: ThreadSanitizer" && { echo "$$out2" | head -40; rc=1; }; \ out3=$$(SW_QUIET=1 TSAN_OPTIONS="suppressions=tests/stress/tsan.supp" \ timeout 300 $(BIN_DIR)/tsan_storm 2>&1) || rc=1; \ + echo "$$out3" | grep -q "PROBE_OK" || { echo "tsan-gate: storm test incomplete"; rc=1; }; \ echo "$$out3" | grep -q "WARNING: ThreadSanitizer" && { echo "$$out3" | head -40; rc=1; }; \This assumes the test programs print
PROBE_OKon successful completion, matching the pattern used in other sanitizer gates.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` around lines 510 - 539, The tsan-gate target currently only greps for "WARNING: ThreadSanitizer" and exit codes but does not verify successful completion; update the tsan-gate recipe (target tsan-gate, the invocations producing $(BIN_DIR)/tsan_msg and $(BIN_DIR)/tsan_storm and the out1/out2/out3 checks) to also assert each program emitted the PROBE_OK marker in their captured output (out1/out2/out3) similar to lsan-gate: after running each test capture its stdout/stderr into outN, check for PROBE_OK (e.g., grep -q "PROBE_OK") and set rc=1 if missing, in addition to the existing ThreadSanitizer checks, so failures that produce no TSan warnings but didn't complete are detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/swarmrt_native.c`:
- Around line 924-1004: The sched-trace thread is detached so it can outlive and
dereference freed runtime state; fix by making it joinable and joined at
shutdown: in sched_trace_maybe_start(), replace the pthread_detach() call with
storing the pthread_t returned by pthread_create into a global/thread-static
(e.g. add pthread_t g_sched_trace_thread and a bool g_sched_trace_started) and
set that flag when creation succeeds; in sw_shutdown() (or the runtime teardown
function) check g_sched_trace_started and call
pthread_join(g_sched_trace_thread, NULL) before freeing g_swarm; keep
sched_trace_fn using g_swarm->running as its exit condition so the join will
complete cleanly.
- Around line 1453-1474: The overflow detector hardcodes 8192 bytes for "guard
page + one page slack" which breaks on systems with different page sizes;
replace that constant with the runtime page size (e.g. call
sysconf(_SC_PAGESIZE) or getpagesize()) and compute guard_bot as guard_top -
(page_size * 2) (or whatever count of pages the allocator actually uses for
guard+slack). Update the SIGSEGV handler block that reads cur->stack_mem and
computes guard_top/guard_bot to use the computed page_size so faults in the real
guard page range are detected correctly.
- Around line 107-118: The watcher-stop variable g_watchdog_stop is accessed
without synchronization in watchdog_thread_fn (reads in the while/if) while
sw_shutdown writes it under g_watchdog_lock, causing a data race; fix by making
g_watchdog_stop an atomic_int (C11) and replace its reads/writes in
watchdog_thread_fn and sw_shutdown with atomic loads/stores using appropriate
memory orders (e.g., atomic_load_explicit(..., memory_order_acquire) and
atomic_store_explicit(..., memory_order_release)), or alternatively ensure every
access (including the outer while(!g_watchdog_stop) check and the in-loop if) is
performed while holding g_watchdog_lock and use pthread_cond_signal in
sw_shutdown — reference g_watchdog_stop, watchdog_thread_fn, sw_shutdown, and
g_watchdog_lock when making the change.
- Around line 1255-1265: scheduler_loop currently uses a function-local static
spin_iters initialized lazily (static int spin_iters = -1) which races when
multiple scheduler threads run scheduler_loop via scheduler_main; fix by
performing one-time initialization of spin_iters before threads start or by
guarding the lazy init with a pthread_once (or C11 atomic once) helper so only
one thread computes and stores the value derived from getenv("SW_SPIN_US") and
sets spin_iters; update references to spin_iters in scheduler_loop to rely on
the safely-initialized value and remove the unsynchronized if (spin_iters < 0)
block or replace it with a pthread_once-invoked initializer function.
---
Duplicate comments:
In `@docs/notes/KNOWN_ISSUES.md`:
- Around line 58-64: The "Mutual tail recursion is not TCO'd; deep chains
overflow silently" entry contains crash wording ("raw SIGSEGV") that conflicts
with the "Open" section's framing; either move this entry into a new "Crash /
Data-loss" subsection under "Open" or reword the paragraph to present it as a
limitation and implementation detail (e.g., state that mutual tail-calls consume
a C stack frame per hop and can overflow the 128KB fiber stack at depth
~10⁴–10⁵, and note that the overflow currently manifests as undefined behavior
documented in REVIEW_FABLE_2026-06.md with sigaltstack details) so the severity
language matches the rest of the document.
---
Nitpick comments:
In `@Makefile`:
- Around line 510-539: The tsan-gate target currently only greps for "WARNING:
ThreadSanitizer" and exit codes but does not verify successful completion;
update the tsan-gate recipe (target tsan-gate, the invocations producing
$(BIN_DIR)/tsan_msg and $(BIN_DIR)/tsan_storm and the out1/out2/out3 checks) to
also assert each program emitted the PROBE_OK marker in their captured output
(out1/out2/out3) similar to lsan-gate: after running each test capture its
stdout/stderr into outN, check for PROBE_OK (e.g., grep -q "PROBE_OK") and set
rc=1 if missing, in addition to the existing ThreadSanitizer checks, so failures
that produce no TSan warnings but didn't complete are detected.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3c322608-9466-49f7-8581-23c226c66a0f
📒 Files selected for processing (7)
Makefiledocs/PRODUCTION_ROADMAP.mddocs/notes/KNOWN_ISSUES.mdsrc/swarmrt_native.csrc/swarmrt_native.htests/stress/run_stress.shtests/stress/tsan.supp
✅ Files skipped from review due to trivial changes (2)
- tests/stress/tsan.supp
- docs/PRODUCTION_ROADMAP.md
…n + nesting/node guards (Phase 3 fuzz) make fuzz drove a single mutated input to ~16GB and OOM-killed swc. Three parser DoS holes, all reachable from a malformed source file: 1. The OUTER receive-clause loop did not gate on !p->err (par_case's clause loop already did). A stray token in pattern position made a clause fail to parse; the cursor then stopped advancing (par_primary/par_expect short-circuit on err) while sitting on a token that is not RBRACE/AFTER/EOF, so the loop spun forever appending empty clauses — unbounded mknode + realloc. Repro: tests/fuzz/corpus/parse/receive_stray_token.sw (a stray 'D' before a receive pattern). Added . 2. Unbounded expression-nesting recursion: ((((... blew the C stack (ASAN stack-overflow in par_primary). par_primary is split into a guarded wrapper + par_primary_inner; SW_PARSE_MAX_DEPTH (128) plus the real stack-headroom probe (interp_stack_near_limit, parser frames are huge — tok_t carries an 8KB text buffer copied per frame) bound it. f-string interpolations inherit the parent's depth (a fresh sub-parser was depth-0 and recursed unbounded through nested f-strings). 3. Per-parse AST-node budget (SW_PARSE_MAX_NODES, 4M ≈ a 100k+-line program): every mknode + node_dup is counted, par_init resets, and par_primary aborts with a clean 'program too large' BEFORE its err short-circuit — the O(1) backstop for any future non-advancing or unbounded loop, regardless of where it lives. Plus the fuzz harness leaked the AST each iteration (detect_leaks=0 hid it): sw_lang_free_ast() is now public (node_free was file-static) and the harness reclaims per parse. Verified: make fuzz (parse+marshal+http, 20k mutations each) clean; the catch-driver runs 200k mutations of the corpus with a 3s/parse alarm and reports no slow parse; full battery green (suite+conformance, stress, gc-slope, gc-stress, lsan, tsan, phases 2-10, scheduler matrix S=1/8); all 28 doc programs still compile. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…at-death; TSan gate covers phase 2/5 TSan-driven shared-state hardening. All changes verified perf-neutral and TSan-clean (make tsan-gate now covers msg / msg+spin / 80k storm / phase 2 / phase 5; full battery green). Runtime: - sw_process.state is now _Atomic sw_proc_state_t. The 30 hot writes are explicit atomic_store relaxed (free codegen — plain mov on x86/arm64); reads stay implicit (free on x86, seq_cst loads are plain mov). Same size/alignment, so the asm-pinned ctx offset is unchanged (the _Static_asserts pass). Closes the scheduler-swap-in write vs sw_monitor / sw_send_after cross-thread reads that TSan flagged. Measured perf unchanged: spawn 4.2 us/cycle, ping-pong 2.6 us/rt. - Timer list: fire_timers' unlocked head-peek (`if (!tl->head) return`) now reads under tl->lock — it raced the lock-protected insert in sw_send_after. Uncontended in the common no-timer case, so free in practice. - Monitor-at-death: sw_monitor's already-dead fast path reads exit_reason/panic_msg AND registers the monitor all under link_lock, and process_exit finalizes exit_reason under the same lock. Closes the data race (monitoring a process at its exact death instant) and makes DOWN delivery exactly-once (register-before-exit-locks => delivered by process_exit; already-EXITING-when-we-lock => we deliver, don't register). Test harness (fix, don't suppress): - static volatile int flags in test_phase2/4/5 -> _Atomic int (volatile is not synchronization — the same class of bug as the runtime fix); __sync_fetch_and_add -> C11 atomic_fetch_add. tsan.supp: documents the two by-design warn-only readers (watchdog scanner, sw_stats debug printer). Remaining 2.4 worklist (documented in PRODUCTION_ROADMAP.md): the registry by-name read path (sw_register) and the dynamic-supervisor child list (dynsup_entry / sup_start_child) still race — they surfaced once the test-timing shift perturbed the interleaving. phase 4 joins the tsan-gate once those close. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…closed; runtime TSan-clean (phase 2/4/5) The supervisor crash-restart races that surfaced once the test-timing shift perturbed the interleaving are fixed; the runtime is now TSan-clean on GenServer/Supervisor/Agent/DynSup/StateMachine/ProcessGroup plus message-passing and the spawn storm. make tsan-gate now gates phase 2/4/5 alongside msg/msg+spin/storm. Fixes (all verified perf-neutral — ping-pong 2.8us, spawn 5.6us): - exit_reason -> _Atomic: user/runtime code writes sw_self()->exit_reason = N directly (the documented "set reason then return" idiom, exercised by b3_crasher) while sw_monitor and process_exit read it. - reg_entry -> _Atomic: a child's process_exit reads it while sw_register (from sup_start_child on another scheduler) writes it under the registry rwlock. - kill_flag -> _Atomic (was volatile): scheduler-loop read vs the cross-thread exit-signal write. (x86 seq_cst loads are plain mov, and every write is a cold path, so free.) - registry_remove_proc hashes entry->name UNDER the registry wrlock — it was computed before the lock, racing a concurrent reuse that rewrote the entry's name storage. - USE-AFTER-FREE closed: process_destroy frees panic_msg under link_lock and NULLs it there; sw_monitor's already-dead path strdup-copies it under the same lock before delivering. A late monitor on a process mid-teardown previously read a panic_msg the teardown was freeing. All _Atomic fields keep their size/alignment, so the asm-pinned ctx offset is unchanged (the _Static_asserts pass). Phase 2 status: 2.1 (LSan) / 2.2 (sched matrix) / 2.3 (TSan) / 2.4 (atomics) all DONE. Remaining: 2.5 alloc-failure injection, 2.6 soak. Verified: full battery green — test-sw + conformance, stress both variants, gc-slope, gc-stress, lsan-gate, tsan-gate (incl. phase 2/4/5), phases 2-10 functional, all 28 doc programs compile. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…ault); 120 fail-points, 0 memory errors Closes the Phase 2.5 exit criterion: an injected malloc/region failure on any path either returns an error or terminates only the affected process, and a partial ownership transfer (spawn region adopted but child died) never leaks or double-frees. - sw_alloc_fault_tick() (swarmrt_native.c, #ifdef SW_ALLOC_FAULT only): atomic countdown armed by SW_FAIL_ALLOC_AT=N, fires once. Wired into val_alloc's global-heap fallback and every varena chunk_new (the value + region choke points where the GC ownership-transfer logic lives). Zero cost in shipping builds — entirely behind the ifdef. - tests/gc/alloc_fault.sw: spawn fan-out + message round-trips + supervised child + ETS churn — the ownership-transfer sites. - make alloc-fault (Linux/ASAN): sweeps N=1..120, each run must end clean or with a loud OOM/spawn panic; FAILS only on an ASAN report (use-after-free / double-free / heap overflow). Result: 120 fail-points, 100 clean degradations (region alloc fails -> spawn falls back to a global-heap arg copy or errors cleanly), 20 loud-OOM aborts on the value path, 0 memory errors. The documented spawn-region-adopted-but-child-died hazard does not corrupt memory under injected failure. CI: tsan-gate (was local-only) + alloc-fault added as blocking legs to linux-quickstart.yml alongside the advisory lsan-gate. Phase 2 now 2.1-2.5 done; only 2.6 (soak) remains. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…e 2 COMPLETE The last Phase 2 item. tests/soak/soak.sw runs a production-shaped mix — actor fan-out with ~8KB message round-trips + supervisor crash/restart + ETS put/replace/delete + one-shot & interval timers + a long tail loop — in a tight loop for SOAK_SECONDS, and tests/soak/run_soak.sh samples RSS and asserts clean exit (PROBE_OK, no watchdog/crash on stderr) + peak RSS under SOAK_RSS_BUDGET_MB. Proves the WHOLE mix stays flat under sustained churn, not just the per-owner slope probes. Local results (this host): 20s -> 2364 rounds, peak RSS 53 MB 180s -> 19978 rounds, peak RSS 58 MB Arena base is ~48 MB, so ~5-10 MB working set over the entire mix, flat across a 9x duration / 8x round increase — bounded. make soak runs the 60s CI smoke (advisory/blocking leg added to linux-quickstart.yml). The full 24h soak is the SAME binary on a dedicated host: SOAK_SECONDS=86400 SOAK_RSS_BUDGET_MB=512 ./tests/soak/run_soak.sh That sustained run is the one sign-off that needs infra rather than code; the harness is ready and documented in PRODUCTION_ROADMAP.md 2.6. Phase 2 (runtime hardening) is now complete: 2.1 LSan, 2.2 scheduler matrix, 2.3 TSan, 2.4 atomics/race-free, 2.5 alloc-failure, 2.6 soak. https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
…a map-free leak
make fuzz-json (the agent-facing input boundary: every LLM tool-call
response / parsed HTTP body flows through sw_lang_json_decode) found two
real heap-buffer-overflows on its first run, both OOB reads on malformed
JSON:
1. _jd_parse number path: when a value was expected but the char was
unparseable, it force-advanced (*pp)++ to guarantee progress WITHOUT
checking for the NUL terminator — at end-of-input (e.g. `{"a":` then
EOF) it stepped past `\0`, and the next _jd_skip_ws read past the
buffer. Now it never steps past NUL; the enclosing array/object loops
gate on **pp and terminate.
2. _jd_parse_string escape path: a string whose last byte is a backslash
(`\` then NUL) consumed the backslash, then the switch's default case
copied the NUL and (*pp)++ stepped past the terminator. Now a
trailing backslash at EOF stops the loop.
Both are OOB READS on untrusted input — exactly the class Phase-3 fuzzing
exists to catch. The recursion-depth guards (SW_JD_MAX_DEPTH etc.) were
already correct; these were pointer-advance-past-NUL bugs.
Also fixed a latent leak surfaced while making the fuzz harness free its
output: sw_val_free's MAP case freed the key/val arrays but not the
element nodes (the tuple/list case already recursed). Now it recurses —
safe (all callers free fully-owned trees; none alias map sub-values) and
it plugs a real runtime leak anywhere maps are sw_val_free'd.
- tests/fuzz/fuzz_json.c + corpus (incl. a 5000-deep array): 20k
mutations, clean. Wired into make fuzz and the CI fuzz leg (parser /
JSON / distribution-unmarshal / HTTP, all under ASAN+UBSAN).
Verified: all 4 fuzz targets clean, test-sw + conformance, gc-stress,
lsan-gate, 28 doc programs — green.
https://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
What this changes
This PR fixes three bugs found in the Round-7 full-stack audit (REVIEW_FABLE_2026-06.md):
Case-arm self-shadow codegen UB (F1, P0): When a case arm re-binds the subject's own name — idiomatic Erlang like
case n { n when n > 0 -> ... }— the codegen emittedsw_val_t *n = n;, a C self-initialization that read indeterminate memory. This caused wrong guard results on calm stacks and SIGSEGV inside_op_cmpon dirty ones. The interpreter was always correct, making this also a compiled/interpreter divergence. Fix: snapshot the subject into a fresh compiler temp before any arm binds (one pointer copy percase). Tuple, cons, and map sub-binds are covered by the same change.Compiled tail loops had no yield points (F2, P0): Generated code never called
sw_check_reds()/sw_yield(), so a self-tail-call loop was unpreemptible. On a single scheduler, an 80K spawn storm kept all children queued behind the running parent, with every child's 128 KB stack live simultaneously (~160K VMAs needed, over Linux's defaultvm.max_map_countof 65530). This causedmmapandcallocto fail, leading to SIGSEGV at nil. Result on this host: single-sched stress 0/50 before the fix. Why CI never caught it: Ubuntu 24.04 defaultsvm.max_map_countto ~1M. Fix: emitif (sw_check_reds()) sw_yield();at the self-tail-call backedge. One TLS decrement per turn, yields every 2000 turns. Result: single-sched stress 50/50 (was 0/50); fullmake stresspasses both variants.Allocation failure SIGSEGV'd instead of failing loud (F3, P1):
val_alloc/val_strdupfell back tocalloc/strdupwith no NULL check, and everysw_val_*constructor writes through the result unchecked. Under OOM or VMA exhaustion, the death was a bare segfault with no diagnosis. Fix: check for NULL and panic with the allocation size and likely cause (e.g.,vm.max_map_count/ live process count) before abort.Related issue
n/a — findings from Round-7 external audit (REVIEW_FABLE_2026-06.md)
Checklist
make test-swpasses (addedtests/sw/test_case_shadow_scrutinee.swwith 7 assertions covering all shadow patterns)make stresscompletes all runs (single-sched variant now 50/50, was 0/50; full variant passes both)Notes for reviewers
F1 (case-shadow UB): The fix snapshots
subj0(the subject expression) into a fresh tempsubjbefore any arm pattern binds. This is a single pointer copy percasestatement, zero overhead. Thewithstatement desugars tocase, so it is covered by the same change.F2 (tail-loop preemption): The reduction check
if (sw_check_reds()) sw_yield();is emitted only at self-tail-call backedges (the language's only unbounded loop construct). Both calls are no-ops outside ahttps://claude.ai/code/session_01G7WBQwXWk9ggdbnoMDtuTW
Summary by CodeRabbit
Bug Fixes
New Features
Documentation
Tests / CI