Skip to content

fix(engine): byte-budget the frame data uri cache to bound memory at 4k#662

Open
jrusso1020 wants to merge 2 commits into05-07-feat_cli_add_--resolution_flag_to_hyperframes_init_for_4k_scaffoldingfrom
05-07-fix_engine_byte-budget_the_frame_data_uri_cache_to_bound_memory_at_4k
Open

fix(engine): byte-budget the frame data uri cache to bound memory at 4k#662
jrusso1020 wants to merge 2 commits into05-07-feat_cli_add_--resolution_flag_to_hyperframes_init_for_4k_scaffoldingfrom
05-07-fix_engine_byte-budget_the_frame_data_uri_cache_to_bound_memory_at_4k

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 7, 2026

What

Replaces the entry-count-only LRU in videoFrameInjector with a two-bound LRU that evicts on either entry count OR byte budget. Adds frameDataUriCacheBytesLimitMb config (default 1500 MB) plus a PRODUCER_FRAME_DATA_URI_CACHE_BYTES_MB env var.

Why

Today the cache evicts purely on entry count (frameDataUriCacheLimit: 256). The base64 data URI of each cached frame scales with the source frame size:

Resolution PNG frame size Data URI size 256 entries × URI
1080p ~6 MB ~8 MB ~2 GB
4K UHD ~25 MB ~33 MB ~8.4 GB per worker

At 4K with a multi-worker render, the cache alone OOMs commodity boxes long before the entry cap fires. With supersampling or HDR PNG frames it gets worse. The byte budget keeps the steady-state memory bounded regardless of resolution while leaving 1080p behavior unchanged (256 × 8 MB ≈ 2 GB ≤ 1.5 GB budget — actually slightly tighter, by design).

This is PR 3 of the 4K stack. Stacked on #661.

How

  • EngineConfig.frameDataUriCacheBytesLimitMb: number added to packages/engine/src/config.ts. Default 1500. Min clamp 64. Env override PRODUCER_FRAME_DATA_URI_CACHE_BYTES_MB.
  • createFrameSourceCache(entryLimit, bytesLimit, frameSrcResolver?) now tracks per-entry byte size in a parallel Map. After each remember(), evicts oldest entry while cache.size > entryLimit || totalBytes > bytesLimit. Both bounds are upper bounds — eviction is greedy until both are satisfied.
  • createVideoFrameInjector plumbs the new config field through. Producer's renderOrchestrator passes cfg.frameDataUriCacheBytesLimitMb to the injector.
  • __testing export exposes createFrameSourceCache for unit testing without spinning up Chrome.

Test plan

  • Unit tests added/updated — 4 new tests in videoFrameInjector.test.ts:
    • Evicts oldest entry when entry count exceeds limit
    • Evicts oldest entry when byte budget is exceeded (asserts bytes <= limit invariant)
    • frameSrcResolver short-circuit doesn't pollute the cache; falls through to file read when resolver returns null
    • Re-reading the same frame is a cache hit
  • Manual testing performed:
    • bun run --cwd packages/engine test — 541/541 pass
    • bunx vitest run packages/producer/src/services/renderOrchestrator.test.ts — 42/43 pass; the 1 failing test (rejects a maliciously crafted key…) also fails on clean main and is unrelated to this change
  • Documentation updated — config knobs are documented in inline JSDoc on EngineConfig; no user-facing CLI flag added in this PR

Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented May 7, 2026

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: request changes. The two-bound LRU is the right fix, and the math holds up at 4K. But there are two real issues that should land before this ships, plus a missing observability hook that I'd push hard on for a memory-budget knob in prod.

Blockers

  • blockerpackages/engine/src/services/videoFrameInjector.ts:62-78 (and the call site at line 124-130) — The 64 MB minimum on frameDataUriCacheBytesLimitMb is not sufficient to guarantee a single 4K frame is cacheable. A 4K PNG with a busy/noisy frame can easily exceed 33 MB raw, and base64 inflation is 4/3 + ~22-byte prefix = up to ~45-50 MB per data URI in the upper tail. With the 64 MB floor and a single 50 MB data URI, the post-insert eviction loop runs while (totalBytes > bytesLimit && cache.size > 0) and evicts the entry it just inserted. The caller still gets the data URI through the returned promise, but every subsequent get() re-reads the file and re-base64s it — turning the cache into a CPU hot path under exactly the conditions it was designed to handle.

    Fix one of: (a) when inserting an entry that exceeds the budget alone, log + skip caching it (if (dataUri.length > bytesLimit) return dataUri before cache.set), or (b) raise the floor to a value that fits a worst-case 4K PNG (256 MB feels safer). I'd take (a) — caching a single oversized entry that immediately evicts itself is pure overhead. Either way, add a test that pins the behavior at bytesLimit = 32 * 1024 with a 64 KB entry (current code: entry is inserted, then evicted, stats().entries === 0).

    #665's JSDoc acknowledges this edge case ("the post-insert eviction loop will drop the entry we just inserted") but neither PR fixes it. Acknowledging a footgun in a comment is not the same as guarding against it.

Important

  • importantpackages/engine/src/services/videoFrameInjector.ts:76 — No observability on eviction. For a memory budget that the team is going to tune in prod (PRODUCER_FRAME_DATA_URI_CACHE_BYTES_MB), there's no way from logs/metrics to tell whether the budget is too tight (lots of evictions → cache thrash) or too loose (memory pressure but evictions never fire). Add a counter — at minimum log every Nth eviction with { totalBytes, entries, evictedKey } at debug, or expose a stats()-derived eviction count so renderOrchestrator can include it in RenderPerfSummary. Without this, the next "4K renders are slow" investigation has no telemetry to anchor on.

  • importantpackages/engine/src/services/videoFrameInjector.ts:74-78while (cache.size > entryLimit || totalBytes > bytesLimit) is correct, but evictOldest() at line 64-69 reads cache.get(oldestKey)?.length ?? 0 to decrement — this is the post-#665 shape. In this PR (#662) it's still using the parallel sizes Map. The parallel Map is fine (correct, simple) but #665 also needs to land cleanly on top — and #665's "drop the parallel Map" change relies on the invariant that cache.get(key).length returns the byte size we accounted for at insertion. That's true for raw strings, but if anyone ever wraps the value (e.g. caches a Buffer or an object), the simplification silently breaks. Worth a load-bearing comment in this PR's remember() that says "value MUST be a string whose .length equals the bytes-accounted-at-insert" so the assumption survives the next refactor.

  • importantpackages/engine/src/services/videoFrameInjector.test.ts:36-46 — The "evicts oldest entry when entry count exceeds limit" test asserts entries === 2 after inserting 3 entries with limit 2, but doesn't assert which entry was evicted. The contract is "oldest" (i.e. LRU on insert order, since Map.keys().next() returns insertion order). Add: insert a, b, c with limit 2 → call cache.get(a) afterwards → that should be a miss (re-read from disk produces a fresh data URI). Today the test passes even if the cache evicted the wrong entry.

  • important — Memory math sanity check (good news, mostly):

    • 4K PNG raw ≈ 25-33 MB (matches PR table)
    • Base64 data URI ≈ 33-44 MB per frame
    • Default 1500 MB ÷ 33 MB = ~45 frames cached at 4K = 1.5s @ 30fps. For sequential render this is fine; for any code path that does seek-back / look-ahead across more than ~1.5s, the cache is effectively useless at 4K. Question for James: have you traced whether anything in the renderOrchestrator does non-sequential frame access? A perf regression at 4K from cache thrash isn't visible from wc -l of the diff.
    • 1080p with 256 entries × 8 MB = 2 GB old behavior, but 1500 MB budget caps it at ~187 entries (~6s @ 30fps). This is a behavior regression for 1080p — the PR description acknowledges it ("actually slightly tighter, by design") but it should be tested. Add a 1080p regression test that proves entries < 256 at the default budget. If anyone tunes the budget down further, 1080p users want to know what the steady-state cache size is.

Nits

  • nitpackages/engine/src/config.ts:80-87 — JSDoc says "1080p with ~6 MB per JPEG frame" but the data URI math elsewhere uses PNG (25 MB raw → 33 MB encoded). Pick one and stick with it; the 1080p case is JPEG (needsAlpha ? "png" : "jpeg" in renderOrchestrator), the 4K case can be either. JSDoc could just say "data URI size scales with frame size" and skip the per-format numbers.

  • nitpackages/engine/src/services/videoFrameInjector.test.ts:57 — Comment says "1 KB raw frame → ~1.4 KB base64" — the multiplier is 4/3 ≈ 1.33×, not 1.4×. Tiny precision; fine to leave.

  • nitpackages/engine/src/services/videoFrameInjector.ts:117bytesLimitMb * 1024 * 1024 — fine but 1024 ** 2 reads slightly cleaner.

Cross-PR

This PR composes correctly with #663 — the orchestrator's deviceScaleFactor path is what makes 4K data URIs actually appear, and this is the cache that holds them. Fine to land independently but the blocker should be fixed first.

Praise

The two-bound LRU is the right shape; the byte budget is the one that actually matters for memory safety. Per-entry size tracking, eviction-on-insert (vs. eviction-on-overflow next-call), and reset-on-rewrite (if (cache.has) totalBytes -= prev) are all correct. Test that the cache short-circuits via frameSrcResolver without polluting the cache is a nice touch.

— Vai

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this layer. I did not find a blocker in the byte-budgeted frame data URI cache diff.

What I checked:

  • the cache now has both an entry-count bound and byte-budget bound
  • replacing a key subtracts the previous cached string length before re-inserting
  • the config option is wired through producer -> engine frame injection
  • the top-of-stack focused verification that included this code passed after bootstrapping generated runtime artifacts (videoFrameInjector.test.ts, producer render-orchestrator focused tests, core/studio route tests, CLI docker args/init tests)

Non-blocking follow-ups I would still consider useful:

  • add PRODUCER_FRAME_DATA_URI_CACHE_BYTES_MB coverage in config.test.ts for default/env/min clamp
  • assert the specific oldest entry is evicted, not just that the cache has fewer entries after byte-budget pressure

Note: the stack still has blockers in lower/higher layers, and this PR currently still shows an aggregate CHANGES_REQUESTED state from existing review state. My approval is for this PR diff itself.

I rechecked the live head before posting: 03fd975a4a8b879e40333be95e5ec1d7dea37504.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retracting my prior REQUEST_CHANGES on this PR — Magi's analysis is correct.

I re-traced createFrameSourceCache after Magi's pushback. The critical flow when a single frame exceeds bytesLimit:

  1. remember() inserts the entry (cache.set + sizes.set + totalBytes += size).
  2. The while ((cache.size > entryLimit || totalBytes > bytesLimit) && cache.size > 0) { evictOldest() } loop runs until either the cache fits or empties.
  3. If the just-inserted entry alone exceeds bytesLimit, the loop evicts everything including the new entry — but remember() then return dataUri, so the caller still gets the data URI.

The outcome on a 4K frame larger than the configured budget is cache thrash (re-read + re-encode on every get), not OOM or render failure. That's a perf concern, not a correctness blocker.

Magi also correctly pointed out that the default budget is 1500 MB, not 64 MB — 64 MB is the minimum floor that only fires when someone explicitly sets PRODUCER_FRAME_DATA_URI_CACHE_BYTES_MB to a low value via env var. At default, 4K frames (~33 MB encoded) cache comfortably (~45 frames at default).

Updated verdict: comment. The if (dataUri.length > bytesLimit) return dataUri skip-cache guard I suggested would still be a small win (saves the Map insert + evict cycle on each 4K frame call when an env-var override sets the budget too low), but it's a follow-up nit, not a blocker. My earlier characterization that this would OOM or break rendering was wrong — it degrades to "cache miss every time," which is the bug-tier I should have called originally.

Two of my prior important items still stand on their own merits:

  • Eviction observability — a Datadog counter or log line on eviction fires gives you "is the budget right?" data from prod, especially valuable for users who set env-var overrides.
  • 1080p regression test — pin the tighter behavior so future tweaks don't quietly regress the cache hit rate at 1080p.

Apologies for the noise of the retraction. Better than letting a stale REQUEST_CHANGES sit on top of clean-degradation code.

— Vai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants