Skip to content

fix(engine): treat ffmpegStreamingTimeout as per-frame inactivity, not total render time#901

Merged
terencecho merged 2 commits into
mainfrom
fix/streaming-encode-inactivity-timeout
May 16, 2026
Merged

fix(engine): treat ffmpegStreamingTimeout as per-frame inactivity, not total render time#901
terencecho merged 2 commits into
mainfrom
fix/streaming-encode-inactivity-timeout

Conversation

@terencecho
Copy link
Copy Markdown
Contributor

@terencecho terencecho commented May 16, 2026

Summary

  • Convert streamingEncoder.ts's safety timer from a total-render hard cap to a per-frame inactivity timeout
  • Reset the timer only on accepted === true writes — buffered writes don't count as consumer progress
  • Update the ffmpegStreamingTimeout config doc to reflect the new semantics

The bug

The timer was set once at spawn and fired SIGTERM unconditionally at ffmpegStreamingTimeout ms — turning a "FFmpeg is hung" guard into a hard cap on total render duration. Slow-but-progressing captures (CI runner under load, large compositions, slower compositor paths after #838's always-clip change) regularly exceeded the 600s default and were killed mid-encode. The symptom surfaced as:

Streaming encode failed: FFmpeg exited with code 255
video:NNNkB audio:0kB ...
[libx264 @ ...] frame I:3  Avg QP:12.91  size: 73263
[libx264 @ ...] frame P:431 Avg QP:14.72  size: 31633
...
[libx264 @ ...] kb/s:7661.05
Exiting normally, received signal 15.

libx264 had encoded most frames cleanly; SIGTERM arrived during the encode, libx264 printed its end-of-encode stats, and Node observed a non-zero exit. The audio:0kB in stderr is incidental — streamingEncoder is video-only; audio is muxed later in assembleStage.

Downstream reproduction: style-13-prod fails deterministically in heygen-com/hyperframes-internal CI after bumping @hyperframes/producer from 0.6.7 → 0.6.10. Bisects to #838 widening the SDR capture path at dpr=1 — same composition shape, slower per-frame, total render now crosses 600s.

The fix

Convert the timer to a heartbeat: each writeFrame that goes through to the kernel pipe (i.e. stdin.write returns true) resets it. Only true hangs (no successful frame write for the timeout window) trip SIGTERM now; "slow but progressing" renders are unbounded.

Crucially, the heartbeat does not reset on accepted === false. A false return means Node had to buffer the write because FFmpeg hasn't drained the pipe yet — that's not proof of consumer progress, just proof we produced. Without this distinction, a hung FFmpeg with a live Chrome would queue frames into Node's writable buffer indefinitely (no backpressure path back to the capture loop) and grow until OOM. In steady state with a slow-but-alive FFmpeg, writes alternate between true and false as the buffer drains and refills; the trues are enough to keep the heartbeat ticking.

Renames are intentionally avoided — ffmpegStreamingTimeout keeps its name and 600_000 default; only the semantics changed. The config doc spells out the new behavior so downstream consumers know what 600s now means.

Test plan

  • Slow-but-progressing capture (accepted=true): 9× writeFrame at 900ms intervals (under the 1000ms threshold) — encoder stays alive through 8.1s. Stall past the threshold — SIGTERM fires.
  • Stalled FFmpeg with live producer (accepted=false): override stdin.write to return false; pump 9× writeFrame at 900ms intervals. SIGTERM still fires inside the 1000ms window — buffered writes don't keep the heartbeat alive.
  • Existing 33 tests in streamingEncoder.test.ts still pass
  • Lint (oxlint) + format (oxfmt --check) clean
  • CI regression suite

🤖 Generated with Claude Code

…t total render time

The streaming encoder's safety timer was set once at spawn and fired SIGTERM
unconditionally at `ffmpegStreamingTimeout` ms — turning a "FFmpeg is hung"
guard into a hard cap on total render duration. Slow-but-progressing captures
(CI runner under load, large compositions, slower compositor paths after #838's
always-clip change) regularly exceeded the 600s default and were killed
mid-encode. The symptom surfaced as:

    Streaming encode failed: FFmpeg exited with code 255
    video:NNNkB audio:0kB ...
    Exiting normally, received signal 15.

libx264 had encoded most frames cleanly; SIGTERM arrived during the encode,
libx264 printed its end-of-encode stats, and Node observed a non-zero exit.
The `audio:0kB` in stderr is incidental — the streaming encoder is video-only;
audio is muxed later in `assembleStage`.

Convert the timer to a heartbeat: every successful `writeFrame` resets it.
Reset on `accepted === false` as well — backpressure means FFmpeg is consuming
slower than we're producing, but data is still flowing into stdin. Only true
hangs (Chrome dead, capture loop stuck, no frames at all for the timeout
window) now trip SIGTERM. Total wall-clock render time is unbounded, which
matches operator intent: a renderer that's making progress should be allowed
to finish.

Renames are intentionally avoided — `ffmpegStreamingTimeout` keeps its name
and `600_000` default; only the semantics changed. The config doc spells out
the new behavior so downstream consumers know what 600s now means.

Added unit test:
  - 9× writeFrame at 900ms intervals (under the 1000ms threshold) — encoder
    stays alive through 8.1s of "slow but progressing" capture
  - Stall past the threshold — SIGTERM fires
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — exact right fix for a misnamed safety guard that became a render-time hard cap.

Audited

The bug framing is precise: ffmpegStreamingTimeout was named like a hang-detection but implemented as a total-elapsed-time cap. Slow-but-progressing renders that crossed 600s got SIGTERM mid-encode, libx264 printed its end-of-encode stats, Node observed exit 255 — which surfaced as the cryptic "Streaming encode failed: FFmpeg exited with code 255" + audio:0kB (audio is muxed later by assembleStage, so the missing audio in the streaming encoder's stderr was an incidental distraction).

The bisect to #838 lines up — hf#838's always-clip change made the SDR capture path slightly slower per frame, accumulating to >600s total on style-13-prod. That's exactly the regression class the timeout was supposed to allow, but the old semantics killed.

Fix shape — heartbeat reset

let timer: NodeJS.Timeout | null = null;
const resetTimer = () => {
  if (timer) clearTimeout(timer);
  timer = setTimeout(() => {
    if (exitStatus === "running") {
      ffmpeg.kill("SIGTERM");
    }
  }, streamingTimeout);
};
resetTimer();  // arm at spawn

writeFrame calls resetTimer() after every ffmpeg.stdin.write(copy). The if (timer) clearTimeout(timer) guard handles the "we're closing" case where close() nulled the timer. ✓

Resetting on accepted === false (backpressure) is the right semantic call. The inline comment captures it:

"backpressure means FFmpeg is consuming slower than we're producing, but data is still flowing into its stdin buffer"

Node's stdin.write is non-blocking; accepted=false signals "buffer high-water-mark hit" not "consumer dead." Real hangs (Chrome dies, capture loop stalls) stop writeFrame calls entirely — the heartbeat sees the gap and fires.

Naming intentionally preserved

ffmpegStreamingTimeout keeps its name and 600_000 default. Only the semantics changed. The config doc is updated to spell out the new behavior:

"The timer resets on every successful writeFrame call, so this caps the duration of a single 'no frame arrived' gap (capture hang, dead Chrome), not the total render time."

That's the right call — renaming would have been a downstream-breaking change (env vars, config dumps, log lines reference the name) for a semantic that's now actually correct. The docstring carries the corrected mental model. ✓

Test pins both directions

// Forward progress under threshold — encoder alive
for (let i = 0; i < 9; i++) {
  encoder.writeFrame(Buffer.from([i]));
  vi.advanceTimersByTime(900);
}
expect(proc.kill).not.toHaveBeenCalled();

// Stall past threshold — SIGTERM fires
vi.advanceTimersByTime(1100);
expect(proc.kill).toHaveBeenCalledWith("SIGTERM");

Asserts 8.1s of "slow but progressing" doesn't fire (which the OLD timeout would have killed at ~1s) AND a true stall DOES fire. The comment in the test even notes the old-timeout comparison — good archeology for future readers. ✓

Non-blocking note

Edge case on accepted=false-as-progress: if FFmpeg becomes completely hung mid-render while Chrome keeps producing frames, stdin.write queues into Node's internal buffer indefinitely (returning false), resetTimer() keeps firing, heartbeat never trips. Failure mode is OOM instead of SIGTERM — worse for diagnostics than the original timeout was.

In practice this is bounded because the typical render loop awaits Chrome between frames, and Chrome's own backpressure would slow writeFrame calls if FFmpeg stalls. But a future refinement could track whether writes are actually draining (e.g., listen for stdin.drain events and timestamp the last successful drain) rather than just being issued. Refinement for a follow-up; the current PR correctly solves the reported failure mode.

Body claim verification

  • "Bisects to #838 widening the SDR capture path at dpr=1" — concrete causal claim; I reviewed hf#838 myself and the always-clip change does add per-frame overhead. ✓
  • "audio:0kB is incidental — streamingEncoder is video-only; audio is muxed later in assembleStage" — exactly the kind of stderr-distraction footnote that saves future debuggers an hour. Worth the body real estate.
  • "Renames are intentionally avoided" — verified, default + name unchanged.

CI

Test, Typecheck, Lint, Build, Format, CodeQL (neutral), Test: runtime contract, Smoke: global install, preview-regression, player-perf all green. CLI smoke (required), Tests/Render on windows-latest, regression-shards all in-progress. Miguel already APPROVED. mergeable_state: blocked is reviewer-gate.

Review by Rames Jusso (pr-review)

Per review on #901: resetting the heartbeat on every writeFrame call —
including ones that returned false — re-introduces a worse failure mode
than the bug it fixed. If FFmpeg hangs while Chrome keeps capturing, every
writeFrame returns false (Node buffers because the kernel pipe isn't being
drained), the heartbeat resets anyway, and Node's writable stream grows
without bound until OOM. The capture loop doesn't `await drain`, so there's
no backpressure path to throttle Chrome.

Tighten the reset to `accepted === true` only. A `true` return means the
write went straight to the kernel pipe — proof FFmpeg actually consumed
something. In steady state with a slower-but-alive FFmpeg, writes alternate
between true and false as the buffer drains and refills; the `true`s are
enough to keep the heartbeat ticking. A truly hung FFmpeg yields only
`false`s, so the timer eventually fires and SIGTERMs the process.

Test: simulate stalled FFmpeg by overriding stdin.write to return false on
every call, pump 9 frames at 900ms intervals (would have kept the prior
implementation alive indefinitely), and assert SIGTERM fires inside the
1000ms timeout window. The earlier "slow-but-progressing" test
(writes return true) continues to pass.
@terencecho
Copy link
Copy Markdown
Contributor Author

I've addressed the non-blocking issues

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving on 111be1e3. Heartbeat-on-accepted-only is the tighter semantic; new test pins the failure mode.

Audited

const accepted = ffmpeg.stdin.write(copy);
// ... comment ...
if (accepted) resetTimer();
return accepted;

Mechanism:

  • accepted === true — Node wrote directly to the OS pipe; FFmpeg drained → genuine forward progress → reset ✓
  • accepted === false — Node buffered in user-space (high-water mark hit); FFmpeg didn't drain yet → don't reset → hung FFmpeg trips SIGTERM within ffmpegStreamingTimeout even when Chrome keeps producing

Terence's inline comment captures the steady-state reasoning:

"In steady state with a slower-but-alive FFmpeg, writes alternate between true and false as the buffer drains and refills; the trues are enough to keep the heartbeat ticking."

That's exactly the right model. Node's writable-stream high-water-mark contract: write() returns true when bytes go through to the kernel pipe directly, false when they buffer. A healthy-but-slower FFmpeg produces a mix; a hung FFmpeg produces all-false. The heartbeat distinguishes the two correctly.

New test pins the previously-missing failure mode

proc.stdin.write = (_chunk: Buffer) => false;
for (let i = 0; i < 9; i++) {
  encoder.writeFrame(Buffer.from([i]));
  vi.advanceTimersByTime(900);
}
expect(proc.kill).toHaveBeenCalledWith("SIGTERM");

9 writes at 900ms intervals, all returning false. The initial resetTimer() at spawn armed the 1000ms timer; no subsequent writes reset it (every write was buffered, not accepted); SIGTERM fires when the timer elapses. Comment also documents the producer-still-pumping-but-ffmpeg-stalled framing — exactly the "OOM-instead-of-SIGTERM" pathology I'd flagged.

Combined with the existing "forward-progress under threshold doesn't kill" test, the streaming encoder now has tight pins on:

  1. Forward progress with accepted writes → encoder stays alive (existing test)
  2. Sustained backpressure (all writes buffered) → SIGTERM fires (new test)
  3. True stall (no writes at all) → SIGTERM fires (existing test)

Edge case I considered

What about steady-state where every-other write is accepted? Each true resets the 600s timer; the false writes in between don't matter as long as a true arrives within the window. For 600s of pure-false to occur, FFmpeg would need to be hung for 10 minutes with no drains. That's the failure mode we want to catch. ✓

The fix is bound only by the heartbeat window — a real intermittent slow consumer that briefly fails to drain for >600s would still get killed. For the default 600s that's a hard pathology, not normal operation.

CI

All required checks completed green or in-progress (only regression-shards still running). mergeable_state: blocked reviewer-gate.

Re-review by Rames Jusso (pr-review)

@terencecho terencecho merged commit efc16a9 into main May 16, 2026
55 checks passed
Copy link
Copy Markdown
Contributor Author

Merge activity

@terencecho terencecho deleted the fix/streaming-encode-inactivity-timeout branch May 16, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants