diff --git a/bench/pause-window/PROBE-multi-branch-anomaly.md b/bench/pause-window/PROBE-multi-branch-anomaly.md index f726ab8..707dd63 100644 --- a/bench/pause-window/PROBE-multi-branch-anomaly.md +++ b/bench/pause-window/PROBE-multi-branch-anomaly.md @@ -310,6 +310,208 @@ loop: that produced the data above. Writes /tmp/futex-trace-*.txt plus /tmp/futex-trace-*.txt.maps for uaddr cross-reference. +## Follow-up: perf flamegraph with DWARF Firecracker (2026-05-23) + +Acting on the previous "build FC with DWARF + perf record" next-step. + +### Setup +- Cloned `firecracker-microvm/firecracker@v1.12.0` +- Patched `Cargo.toml` `[profile.release]`: `lto = false` + `debug = "full"` +- Built with `RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release --bin firecracker` +- Result: 50 MB binary with `.debug_info`, `.debug_line`, `.eh_frame_hdr`, frame pointers (`+0x871 vmm::vstate::vm::ArchVm::snapshot_memory_to_file` resolves perfectly) +- Swapped into `/usr/local/bin/firecracker` (backup at `firecracker.release.bak`) +- Lowered `kernel.perf_event_paranoid` 4 → 1 and `kernel.kptr_restrict` 1 → 0 + (paranoid=4 was silently producing 0-sample perf.data files) + +### perf capture +`perf record -F 99 -a -g --call-graph fp -- sleep 10` while the +warmed source did 2 slow BRANCHes back-to-back inside the window. + +``` +warmup BRANCH 1: pause_ms=547 +warmup BRANCH 2-5: 225-787 +warmup BRANCH 6: pause_ms=2151 ← in slow regime +profiled BRANCH #1: pause_ms=2279 +profiled BRANCH #2: pause_ms=1530 + +Captured 2479 samples / 2.2 MB perf.data +``` + +### Result — third interpretation flip + +``` +Samples by process (top 10): + 1176 swapper ← CPU idle (47 % of samples) + 97 plymouthd + 78 bash + 73 sshd + 72 tokio-rt-worker + 69 runc:[2:INIT] + 66 kswapd0 + 55 avahi-daemon + 53 redis-server + 46 kworker/u80:1+f + 1 firecracker ← !!! + +firecracker on-CPU leaf function: + 1 vmm::vstate::vm::::snapshot_memory_to_file +``` + +**Only ONE sample landed in firecracker** during 3.8 s of pause window. +That sample was in `snapshot_memory_to_file+0x871` (called from +`vmm::persist::create_snapshot+0x2c5`), which is the actual memory.bin +write path. But ~10 ms of on-CPU time can't explain 3.8 s of pause. + +CPUs were dominated by `swapper` (idle) and unrelated processes +(sshd / plymouth / redis / etc. running their own work on other +cores). The pause time isn't burned in **user-space CPU at all**. + +### Synthesis of all four probes + +| Pass | Hypothesis | Verdict | +|---|---|---| +| 1 [#128](https://github.com/deeplethe/forkd/pull/128) — strace -c | User-space CPU bottleneck | Too strong — syscall fraction is small but doesn't imply user CPU dominates | +| 2 [#140](https://github.com/deeplethe/forkd/pull/140) — bpftrace + /proc/stack | 94 % off-CPU + futex waiters present | Right shape, wrong fixable lever | +| 3 [#143](https://github.com/deeplethe/forkd/pull/143) — futex args.uaddr | Futexes are passive waiters, not contention | Eliminates futex as a fix target | +| 4 [this] — perf -a -g + DWARF FC | **Pause time is in-kernel sleep with FC parked; CPU is mostly idle** | Final picture | + +### Where the time really goes + +FC issues the snapshot syscall chain (`KVM_GET_DIRTY_LOG` ioctl(s) +plus the file writes inside `snapshot_memory_to_file`), then **blocks +in the kernel**. While blocked: + +- The CPU FC was running on goes idle (samples land in `swapper`) +- Other unrelated processes get to run on other CPUs +- 3 sibling FC threads (the futex waiters from #143) sleep awaiting + the snapshot worker's completion signal +- vCPU thread sleeps in `kvm_vcpu_halt` (paused VM, expected) + +Why does this kernel sleep get **longer** on BRANCH 3+? The kernel +operation FC is waiting on grows in cost. Most likely candidates: + +1. **`KVM_GET_DIRTY_LOG` walk grows with VM uptime.** KVM tracks + dirty pages in a per-memslot bitmap. The bitmap clears on every + `KVM_GET_DIRTY_LOG` call but the kernel-side walk could touch + structures that grow as guest pages get accessed over time. +2. **page-cache → block-IO writeback contention.** Each BRANCH's + `memory.bin` write goes through the page cache. The N+1th + BRANCH's write may compete with N's still-being-flushed pages. + Background flusher (`kswapd0` / `kworker/u80:1+f` both visible + in samples) running concurrently is suggestive. +3. **VMA / mmap teardown.** Each diff BRANCH adds a memory mapping. + After many BRANCHes, FC's process VMA list grows; some kernel + paths walk it linearly. + +### What flamegraph would have shown (and didn't) + +We'd hoped the flamegraph would point at a hot Rust function. It +doesn't, because FC isn't burning CPU — it's blocked. **On-CPU** +flamegraphing is the wrong tool here. We need an **off-CPU** +flamegraph (Brendan Gregg style — sample stacks at every +`sched_switch` event with duration weighting). + +### Follow-up: off-CPU probe (same day, 2026-05-23) + +Built an off-CPU bpftrace probe and ran it during the slow regime. +Used `tracepoint:sched:sched_switch` + `sched:sched_wakeup` paired by +TID, capturing `kstack(perf, 16)` at sleep time. + +**Result — only the vCPU stack appears.** Out of all FC threads +(`firecracker`, `fc_api`, `fc_vcpu 0`, `fc_vcpu 1`, `kvm-nx-lpage-re`), +the only one going through sched_switch was `fc_vcpu`: + +``` +@offcpu_us[ + __schedule ← schedule ← kvm_vcpu_block ← kvm_vcpu_halt + ← vcpu_run ← kvm_arch_vcpu_ioctl_run ← kvm_vcpu_ioctl + ← __x64_sys_ioctl ← x64_sys_call ← do_syscall_64 +]: 19260812 +@offcpu_count[]: 32 +``` + +19.26 s of off-CPU time across 32 sleep cycles in vCPU 0+1. **No +other thread shows up at all** — not `firecracker` (the main thread), +not `fc_api`. They never `sched_switch` during the 8-12 s window. + +Confirmed with a minimal probe (`tracepoint:sched:sched_switch / pid +== $FC_PID / { @[comm] = count(); }`) over 8 s with 5 warmup BRANCHes: + +``` +@switches_per_thread[fc_vcpu 0]: 6 +@switches_per_thread[fc_vcpu 1]: 20 +``` + +Only vCPU threads switch off-CPU. **Main thread + fc_api stay on the +CPU the entire window** (no sched_switch event) — *but they also +don't show up in perf on-CPU samples* (we have 1 FC sample in 2479). + +### The paradox + +``` +- main thread isn't off-CPU (no sched_switch off) +- main thread isn't on-CPU (no perf samples) +- pause_ms grows linearly with BRANCH count anyway +``` + +A thread can be in this state only if: + +1. **In a long uninterruptible kernel syscall** that runs to + completion without preemption AND is somehow undersampled by + perf. On Alder Lake hybrid (i7-12700, this dev box has P-cores + + E-cores), perf hw event sampling can be E-core-blind by default — + `cpu_atom/cycles/P` events need to be explicitly requested. Our + perf record used default events which biased to P-core. +2. **Or perf sampling has a TOCTOU / IPI delivery issue** at the + specific kernel state FC is in. Less likely. + +### Concrete next step (deferred) + +Re-run perf with explicit hybrid sampling: + +```bash +perf record -F 99 -a -g --call-graph fp \ + -e cpu_core/cycles/P -e cpu_atom/cycles/P -- sleep 10 +``` + +And confirm/refute the P-core/E-core sampling theory by +`taskset --cpu-list 0,1 forkd-controller serve ...` (force daemon & +its FC children to P-cores 0,1) before re-profiling. + +If sampling is still empty: try `funccount` style probe on +suspected kernel hot paths — `kprobe:kvm_vm_ioctl`, +`kprobe:filemap_write_and_wait_range`, `kprobe:__filemap_fdatawrite_range` +— to spot which kernel path the main thread is camping in. + +The full off-CPU step is documented but not done. **Below is the +original sketch retained for reference.** + +### Original next step (off-CPU pair probe) + +Off-CPU flamegraph via bpftrace: + +```bpftrace +kprobe:finish_task_switch +/ args->prev->pid == $FC_PID / +{ + @start[tid] = nsecs; +} + +kprobe:try_to_wake_up +/ @start[args->p->pid] != 0 / +{ + @offcpu[ustack(perf), kstack] = sum(nsecs - @start[args->p->pid]); + delete(@start[args->p->pid]); +} +``` + +This sums up "how long was FC's worker thread asleep, broken down by +the (user-stack, kernel-stack) pair when it went to sleep." The +hottest pair tells us exactly which kernel function FC is parked on, +and which userspace caller put it there. + +Estimated effort: 30 min once we get the bpftrace tracepoint right. + ## Files - `bench/pause-window/probe-multi-branch-strace.sh` — the original @@ -320,5 +522,8 @@ loop: poll loop (200 Hz); did not pinpoint a single syscall (consistent with the off-CPU finding). - `bench/pause-window/probe-futex-trace.sh` — bpftrace futex aggregator - (see "Follow-up: futex tracing" above for results). + ("Follow-up: futex tracing" above). +- `bench/pause-window/probe-perf-flamegraph.sh` — perf record -a -g + + DWARF FC; flipped the picture to "in-kernel sleep dominates" + (this section). - This document. diff --git a/bench/pause-window/probe-offcpu.sh b/bench/pause-window/probe-offcpu.sh new file mode 100644 index 0000000..5bff55d --- /dev/null +++ b/bench/pause-window/probe-offcpu.sh @@ -0,0 +1,111 @@ +#!/usr/bin/env bash +# probe-offcpu.sh — off-CPU profile FC during slow BRANCH. +# +# Hooks the scheduler to record (kernel-stack, sleep-duration) pairs +# for FC's threads. The hottest stack is the kernel function FC is +# parked on — the actual bottleneck for the multi-BRANCH anomaly +# (#146). On-CPU probes consistently miss it because FC is off-CPU +# ~99.7 % of the slow window (PROBE-multi-branch-anomaly.md, "Follow-up: +# perf flamegraph" section). +set -euo pipefail + +FORKD_URL=${FORKD_URL:-http://127.0.0.1:8889} +FORKD_TOKEN=${FORKD_TOKEN:-$(cat "${FORKD_TOKEN_FILE:-/etc/forkd/token}" 2>/dev/null || echo "")} +TAG=${TAG:-coding-agent-fork-prewarm-v1} +WARMUP_BRANCHES=${WARMUP_BRANCHES:-6} +GAP_SECS=${GAP_SECS:-3} +OUT="/tmp/fc-offcpu-$(date +%s).txt" +auth=(-H "Authorization: Bearer $FORKD_TOKEN") + +echo "[probe] output → $OUT" >&2 + +# Spawn source +spawn=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"snapshot_tag\":\"$TAG\",\"n\":1,\"per_child_netns\":true}" \ + "$FORKD_URL/v1/sandboxes") +sb_id=$(echo "$spawn" | jq -r '.[0].id') +fc_pid=$(echo "$spawn" | jq -r '.[0].pid') +echo "[probe] sandbox=$sb_id fc_pid=$fc_pid" >&2 +sleep 2 + +# Warmup into slow regime +for i in $(seq 1 "$WARMUP_BRANCHES"); do + sleep "$GAP_SECS" + resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") + echo "[probe] warmup BRANCH $i: pause_ms=$(echo "$resp" | jq -r .pause_ms)" >&2 +done + +# bpftrace off-CPU probe. +# +# kprobe:finish_task_switch fires when a task is about to start running. +# arg0 is the previously-running task (prev_task_struct*); arg0->pid is +# its TID, arg0->tgid its PID. We filter to FC's tgid. +# +# At this point the CURRENT context is the task that's just resumed — +# i.e. NOT FC. We can read FC's kstack via kstack(perf, K) only if we +# walk prev->stack, which bpftrace doesn't expose directly. Workaround: +# capture stacks at the OUTGOING side via sched_switch tracepoint +# (current() is still prev there). +echo "[probe] starting bpftrace off-CPU probe (12s window)" >&2 +sudo bpftrace -e " +tracepoint:sched:sched_switch +/ pid == $fc_pid / +{ + // current task is prev (going to sleep). Capture its stacks + + // start time keyed by its TID (args->prev_pid). + @sleep_start[args->prev_pid] = nsecs; + @sleep_ustack[args->prev_pid] = ustack(perf, 16); + @sleep_kstack[args->prev_pid] = kstack(perf, 16); +} + +tracepoint:sched:sched_wakeup +/ args->pid != 0 && @sleep_start[args->pid] != 0 / +{ + \$dur_us = (nsecs - @sleep_start[args->pid]) / 1000; + @offcpu_us[@sleep_kstack[args->pid]] = sum(\$dur_us); + @offcpu_count[@sleep_kstack[args->pid]] = count(); + delete(@sleep_start[args->pid]); + delete(@sleep_ustack[args->pid]); + delete(@sleep_kstack[args->pid]); +} + +interval:s:12 { exit(); } +" > "$OUT" 2>&1 & +bp_pid=$! +sleep 0.5 + +# Fire 2 slow BRANCHes inside the probe window +sleep "$GAP_SECS" +echo "[probe] firing profiled BRANCH #1" >&2 +resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") +echo "[probe] profiled #1: pause_ms=$(echo "$resp" | jq -r .pause_ms)" >&2 + +sleep 1 +echo "[probe] firing profiled BRANCH #2" >&2 +resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") +echo "[probe] profiled #2: pause_ms=$(echo "$resp" | jq -r .pause_ms)" >&2 + +wait "$bp_pid" 2>/dev/null || true + +# Cleanup +curl -fsS -X DELETE "${auth[@]}" "$FORKD_URL/v1/sandboxes/$sb_id" > /dev/null || true + +echo "" >&2 +echo "[probe] done. Top off-CPU kernel stacks (longest sleep total):" >&2 +echo "" >&2 +# bpftrace map output: @offcpu_us[\n stack \n]: N\n. +# Just dump the whole map ordered. +awk ' +/^@offcpu_us\[/,/^[[:space:]]*[0-9]+$/ { + print +} +' "$OUT" >&2 || true + +echo "" >&2 +echo "[probe] raw output: $OUT" >&2 diff --git a/bench/pause-window/probe-perf-flamegraph.sh b/bench/pause-window/probe-perf-flamegraph.sh new file mode 100644 index 0000000..d1c9cc1 --- /dev/null +++ b/bench/pause-window/probe-perf-flamegraph.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# probe-perf-flamegraph.sh — perf record + flamegraph the FC snapshot +# worker during a slow BRANCH. Identifies the per-snapshot growing +# loop hypothesized in PROBE-multi-branch-anomaly.md. +# +# Requires: +# - FC binary at /usr/local/bin/firecracker built with DWARF + +# force-frame-pointers (see PROBE-multi-branch-anomaly.md for +# the build recipe) +# - perf >= 5.x with --call-graph dwarf +# - FlameGraph repo at $FLAMEGRAPH_DIR +set -euo pipefail + +FORKD_URL=${FORKD_URL:-http://127.0.0.1:8889} +FORKD_TOKEN=${FORKD_TOKEN:-$(cat "${FORKD_TOKEN_FILE:-/etc/forkd/token}" 2>/dev/null || echo "")} +TAG=${TAG:-coding-agent-fork-prewarm-v1} +WARMUP_BRANCHES=${WARMUP_BRANCHES:-6} +GAP_SECS=${GAP_SECS:-3} +FLAMEGRAPH_DIR=${FLAMEGRAPH_DIR:-/home/yangdongxu/work/FlameGraph} +OUT_BASE="/tmp/fc-perf-$(date +%s)" +OUT_DATA="$OUT_BASE.data" +OUT_FOLDED="$OUT_BASE.folded" +OUT_SVG="$OUT_BASE.svg" +auth=(-H "Authorization: Bearer $FORKD_TOKEN") + +echo "[probe] outputs: $OUT_DATA / $OUT_FOLDED / $OUT_SVG" >&2 + +# Spawn source +spawn=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"snapshot_tag\":\"$TAG\",\"n\":1,\"per_child_netns\":true}" \ + "$FORKD_URL/v1/sandboxes") +sb_id=$(echo "$spawn" | jq -r '.[0].id') +fc_pid=$(echo "$spawn" | jq -r '.[0].pid') +echo "[probe] sandbox=$sb_id fc_pid=$fc_pid" >&2 +sleep 2 + +# Warmup into the slow regime +for i in $(seq 1 "$WARMUP_BRANCHES"); do + sleep "$GAP_SECS" + resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") + pause=$(echo "$resp" | jq -r '.pause_ms') + echo "[probe] warmup BRANCH $i: pause_ms=$pause" >&2 +done + +# Start perf record with DWARF call-graph +echo "[probe] starting perf record on pid $fc_pid (DWARF unwinding, 10s window)" >&2 +# -a captures all CPUs system-wide; we filter to FC's pid in post-processing. +# Required because per-pid `-p` consistently misses the brief on-CPU +# windows during a slow BRANCH (FC is off-CPU ~94% of pause). +sudo perf record -F 99 -a -g --call-graph fp \ + -o "$OUT_DATA" -- sleep 10 & +perf_pid=$! +sleep 0.5 + +# Fire 1-2 slow BRANCHes inside the perf window +sleep "$GAP_SECS" +echo "[probe] firing profiled BRANCH #1" >&2 +resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") +echo "[probe] profiled #1: pause_ms=$(echo "$resp" | jq -r .pause_ms)" >&2 + +sleep 1 +echo "[probe] firing profiled BRANCH #2" >&2 +resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") +echo "[probe] profiled #2: pause_ms=$(echo "$resp" | jq -r .pause_ms)" >&2 + +# Wait for perf to finish its sleep window +wait "$perf_pid" 2>/dev/null || true + +# Cleanup +curl -fsS -X DELETE "${auth[@]}" "$FORKD_URL/v1/sandboxes/$sb_id" > /dev/null || true + +# Convert perf.data → folded stacks → SVG +echo "" >&2 +echo "[probe] converting perf.data → folded stacks → SVG" >&2 +sudo chown "$USER:$USER" "$OUT_DATA" +# Filter to FC's pid since we captured system-wide. +sudo perf script -i "$OUT_DATA" --pid "$fc_pid" 2>/dev/null \ + | "$FLAMEGRAPH_DIR/stackcollapse-perf.pl" > "$OUT_FOLDED" +"$FLAMEGRAPH_DIR/flamegraph.pl" --title "FC snapshot worker (BRANCH slow regime)" \ + "$OUT_FOLDED" > "$OUT_SVG" + +echo "" >&2 +echo "===== top folded stacks (CPU samples; bigger = hotter) =====" >&2 +sort -t' ' -k2 -n -r "$OUT_FOLDED" | head -15 >&2 + +echo "" >&2 +echo "[probe] flamegraph at $OUT_SVG" >&2 +echo "[probe] folded data at $OUT_FOLDED" >&2 +echo "[probe] raw perf data at $OUT_DATA" >&2