fix(engine): treat ffmpegStreamingTimeout as per-frame inactivity, not total render time#901
Conversation
…t total render time The streaming encoder's safety timer was set once at spawn and fired SIGTERM unconditionally at `ffmpegStreamingTimeout` ms — turning a "FFmpeg is hung" guard into a hard cap on total render duration. Slow-but-progressing captures (CI runner under load, large compositions, slower compositor paths after #838's always-clip change) regularly exceeded the 600s default and were killed mid-encode. The symptom surfaced as: Streaming encode failed: FFmpeg exited with code 255 video:NNNkB audio:0kB ... Exiting normally, received signal 15. libx264 had encoded most frames cleanly; SIGTERM arrived during the encode, libx264 printed its end-of-encode stats, and Node observed a non-zero exit. The `audio:0kB` in stderr is incidental — the streaming encoder is video-only; audio is muxed later in `assembleStage`. Convert the timer to a heartbeat: every successful `writeFrame` resets it. Reset on `accepted === false` as well — backpressure means FFmpeg is consuming slower than we're producing, but data is still flowing into stdin. Only true hangs (Chrome dead, capture loop stuck, no frames at all for the timeout window) now trip SIGTERM. Total wall-clock render time is unbounded, which matches operator intent: a renderer that's making progress should be allowed to finish. Renames are intentionally avoided — `ffmpegStreamingTimeout` keeps its name and `600_000` default; only the semantics changed. The config doc spells out the new behavior so downstream consumers know what 600s now means. Added unit test: - 9× writeFrame at 900ms intervals (under the 1000ms threshold) — encoder stays alive through 8.1s of "slow but progressing" capture - Stall past the threshold — SIGTERM fires
jrusso1020
left a comment
There was a problem hiding this comment.
LGTM — exact right fix for a misnamed safety guard that became a render-time hard cap.
Audited
The bug framing is precise: ffmpegStreamingTimeout was named like a hang-detection but implemented as a total-elapsed-time cap. Slow-but-progressing renders that crossed 600s got SIGTERM mid-encode, libx264 printed its end-of-encode stats, Node observed exit 255 — which surfaced as the cryptic "Streaming encode failed: FFmpeg exited with code 255" + audio:0kB (audio is muxed later by assembleStage, so the missing audio in the streaming encoder's stderr was an incidental distraction).
The bisect to #838 lines up — hf#838's always-clip change made the SDR capture path slightly slower per frame, accumulating to >600s total on style-13-prod. That's exactly the regression class the timeout was supposed to allow, but the old semantics killed.
Fix shape — heartbeat reset
let timer: NodeJS.Timeout | null = null;
const resetTimer = () => {
if (timer) clearTimeout(timer);
timer = setTimeout(() => {
if (exitStatus === "running") {
ffmpeg.kill("SIGTERM");
}
}, streamingTimeout);
};
resetTimer(); // arm at spawnwriteFrame calls resetTimer() after every ffmpeg.stdin.write(copy). The if (timer) clearTimeout(timer) guard handles the "we're closing" case where close() nulled the timer. ✓
Resetting on accepted === false (backpressure) is the right semantic call. The inline comment captures it:
"backpressure means FFmpeg is consuming slower than we're producing, but data is still flowing into its stdin buffer"
Node's stdin.write is non-blocking; accepted=false signals "buffer high-water-mark hit" not "consumer dead." Real hangs (Chrome dies, capture loop stalls) stop writeFrame calls entirely — the heartbeat sees the gap and fires.
Naming intentionally preserved
ffmpegStreamingTimeout keeps its name and 600_000 default. Only the semantics changed. The config doc is updated to spell out the new behavior:
"The timer resets on every successful
writeFramecall, so this caps the duration of a single 'no frame arrived' gap (capture hang, dead Chrome), not the total render time."
That's the right call — renaming would have been a downstream-breaking change (env vars, config dumps, log lines reference the name) for a semantic that's now actually correct. The docstring carries the corrected mental model. ✓
Test pins both directions
// Forward progress under threshold — encoder alive
for (let i = 0; i < 9; i++) {
encoder.writeFrame(Buffer.from([i]));
vi.advanceTimersByTime(900);
}
expect(proc.kill).not.toHaveBeenCalled();
// Stall past threshold — SIGTERM fires
vi.advanceTimersByTime(1100);
expect(proc.kill).toHaveBeenCalledWith("SIGTERM");Asserts 8.1s of "slow but progressing" doesn't fire (which the OLD timeout would have killed at ~1s) AND a true stall DOES fire. The comment in the test even notes the old-timeout comparison — good archeology for future readers. ✓
Non-blocking note
Edge case on accepted=false-as-progress: if FFmpeg becomes completely hung mid-render while Chrome keeps producing frames, stdin.write queues into Node's internal buffer indefinitely (returning false), resetTimer() keeps firing, heartbeat never trips. Failure mode is OOM instead of SIGTERM — worse for diagnostics than the original timeout was.
In practice this is bounded because the typical render loop awaits Chrome between frames, and Chrome's own backpressure would slow writeFrame calls if FFmpeg stalls. But a future refinement could track whether writes are actually draining (e.g., listen for stdin.drain events and timestamp the last successful drain) rather than just being issued. Refinement for a follow-up; the current PR correctly solves the reported failure mode.
Body claim verification
- "Bisects to #838 widening the SDR capture path at dpr=1" — concrete causal claim; I reviewed hf#838 myself and the always-clip change does add per-frame overhead. ✓
- "
audio:0kBis incidental — streamingEncoder is video-only; audio is muxed later in assembleStage" — exactly the kind of stderr-distraction footnote that saves future debuggers an hour. Worth the body real estate. - "Renames are intentionally avoided" — verified, default + name unchanged.
CI
Test, Typecheck, Lint, Build, Format, CodeQL (neutral), Test: runtime contract, Smoke: global install, preview-regression, player-perf all green. CLI smoke (required), Tests/Render on windows-latest, regression-shards all in-progress. Miguel already APPROVED. mergeable_state: blocked is reviewer-gate.
— Review by Rames Jusso (pr-review)
Per review on #901: resetting the heartbeat on every writeFrame call — including ones that returned false — re-introduces a worse failure mode than the bug it fixed. If FFmpeg hangs while Chrome keeps capturing, every writeFrame returns false (Node buffers because the kernel pipe isn't being drained), the heartbeat resets anyway, and Node's writable stream grows without bound until OOM. The capture loop doesn't `await drain`, so there's no backpressure path to throttle Chrome. Tighten the reset to `accepted === true` only. A `true` return means the write went straight to the kernel pipe — proof FFmpeg actually consumed something. In steady state with a slower-but-alive FFmpeg, writes alternate between true and false as the buffer drains and refills; the `true`s are enough to keep the heartbeat ticking. A truly hung FFmpeg yields only `false`s, so the timer eventually fires and SIGTERMs the process. Test: simulate stalled FFmpeg by overriding stdin.write to return false on every call, pump 9 frames at 900ms intervals (would have kept the prior implementation alive indefinitely), and assert SIGTERM fires inside the 1000ms timeout window. The earlier "slow-but-progressing" test (writes return true) continues to pass.
|
I've addressed the non-blocking issues |
jrusso1020
left a comment
There was a problem hiding this comment.
Re-approving on 111be1e3. Heartbeat-on-accepted-only is the tighter semantic; new test pins the failure mode.
Audited
const accepted = ffmpeg.stdin.write(copy);
// ... comment ...
if (accepted) resetTimer();
return accepted;Mechanism:
accepted === true— Node wrote directly to the OS pipe; FFmpeg drained → genuine forward progress → reset ✓accepted === false— Node buffered in user-space (high-water mark hit); FFmpeg didn't drain yet → don't reset → hung FFmpeg trips SIGTERM withinffmpegStreamingTimeouteven when Chrome keeps producing
Terence's inline comment captures the steady-state reasoning:
"In steady state with a slower-but-alive FFmpeg, writes alternate between true and false as the buffer drains and refills; the trues are enough to keep the heartbeat ticking."
That's exactly the right model. Node's writable-stream high-water-mark contract: write() returns true when bytes go through to the kernel pipe directly, false when they buffer. A healthy-but-slower FFmpeg produces a mix; a hung FFmpeg produces all-false. The heartbeat distinguishes the two correctly.
New test pins the previously-missing failure mode
proc.stdin.write = (_chunk: Buffer) => false;
for (let i = 0; i < 9; i++) {
encoder.writeFrame(Buffer.from([i]));
vi.advanceTimersByTime(900);
}
expect(proc.kill).toHaveBeenCalledWith("SIGTERM");9 writes at 900ms intervals, all returning false. The initial resetTimer() at spawn armed the 1000ms timer; no subsequent writes reset it (every write was buffered, not accepted); SIGTERM fires when the timer elapses. Comment also documents the producer-still-pumping-but-ffmpeg-stalled framing — exactly the "OOM-instead-of-SIGTERM" pathology I'd flagged.
Combined with the existing "forward-progress under threshold doesn't kill" test, the streaming encoder now has tight pins on:
- Forward progress with accepted writes → encoder stays alive (existing test)
- Sustained backpressure (all writes buffered) → SIGTERM fires (new test)
- True stall (no writes at all) → SIGTERM fires (existing test)
Edge case I considered
What about steady-state where every-other write is accepted? Each true resets the 600s timer; the false writes in between don't matter as long as a true arrives within the window. For 600s of pure-false to occur, FFmpeg would need to be hung for 10 minutes with no drains. That's the failure mode we want to catch. ✓
The fix is bound only by the heartbeat window — a real intermittent slow consumer that briefly fails to drain for >600s would still get killed. For the default 600s that's a hard pathology, not normal operation.
CI
All required checks completed green or in-progress (only regression-shards still running). mergeable_state: blocked reviewer-gate.
— Re-review by Rames Jusso (pr-review)
Merge activity
|
Summary
streamingEncoder.ts's safety timer from a total-render hard cap to a per-frame inactivity timeoutaccepted === truewrites — buffered writes don't count as consumer progressffmpegStreamingTimeoutconfig doc to reflect the new semanticsThe bug
The timer was set once at spawn and fired SIGTERM unconditionally at
ffmpegStreamingTimeoutms — turning a "FFmpeg is hung" guard into a hard cap on total render duration. Slow-but-progressing captures (CI runner under load, large compositions, slower compositor paths after #838's always-clip change) regularly exceeded the 600s default and were killed mid-encode. The symptom surfaced as:libx264 had encoded most frames cleanly; SIGTERM arrived during the encode, libx264 printed its end-of-encode stats, and Node observed a non-zero exit. The
audio:0kBin stderr is incidental —streamingEncoderis video-only; audio is muxed later inassembleStage.Downstream reproduction:
style-13-prodfails deterministically inheygen-com/hyperframes-internalCI after bumping@hyperframes/producerfrom 0.6.7 → 0.6.10. Bisects to #838 widening the SDR capture path at dpr=1 — same composition shape, slower per-frame, total render now crosses 600s.The fix
Convert the timer to a heartbeat: each
writeFramethat goes through to the kernel pipe (i.e.stdin.writereturnstrue) resets it. Only true hangs (no successful frame write for the timeout window) trip SIGTERM now; "slow but progressing" renders are unbounded.Crucially, the heartbeat does not reset on
accepted === false. Afalsereturn means Node had to buffer the write because FFmpeg hasn't drained the pipe yet — that's not proof of consumer progress, just proof we produced. Without this distinction, a hung FFmpeg with a live Chrome would queue frames into Node's writable buffer indefinitely (no backpressure path back to the capture loop) and grow until OOM. In steady state with a slow-but-alive FFmpeg, writes alternate betweentrueandfalseas the buffer drains and refills; thetrues are enough to keep the heartbeat ticking.Renames are intentionally avoided —
ffmpegStreamingTimeoutkeeps its name and600_000default; only the semantics changed. The config doc spells out the new behavior so downstream consumers know what 600s now means.Test plan
accepted=true): 9×writeFrameat 900ms intervals (under the 1000ms threshold) — encoder stays alive through 8.1s. Stall past the threshold — SIGTERM fires.accepted=false): overridestdin.writeto return false; pump 9×writeFrameat 900ms intervals. SIGTERM still fires inside the 1000ms window — buffered writes don't keep the heartbeat alive.streamingEncoder.test.tsstill passoxlint) + format (oxfmt --check) clean🤖 Generated with Claude Code