Skip to content

feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry#27262

Open
dannimad wants to merge 10 commits intomainfrom
danielcar/stress-fix
Open

feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry#27262
dannimad wants to merge 10 commits intomainfrom
danielcar/stress-fix

Conversation

@dannimad
Copy link
Copy Markdown
Contributor

@dannimad dannimad commented May 7, 2026

Same as #27216 with the addition of enabling batch id tracking only when groupedBatching is enabled as well.

Previous PR was breaking real service stress tests due to hitting an assert related to creating empty batches when grouped batching was not enabled.

This run https://dev.azure.com/fluidframework/internal/_build/results?buildId=398830 confirms issue is solved with added changes.

Copilot AI review requested due to automatic review settings May 7, 2026 21:07
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (509 lines, 11 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables batchId tracking / duplicate-batch detection by default (with a kill-switch) and adds telemetry for detector window activity, while updating replay/snapshot tooling and tests to stay compatible with the new default-on behavior (including the grouped-batching constraint).

Changes:

  • Default-enable batchId tracking when FlushMode.TurnBased and grouped batching are enabled; add kill-switch and additional Offline Load validations.
  • Add perf telemetry counters to DuplicateBatchDetector and extend its unit tests.
  • Update snapshot/replay tooling and multiple test suites to normalize package versions and to handle the new .recentBatchInfo blob/back-compat behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/tools/replay-tool/src/index.ts Re-exports normalizePackageVersions for downstream snapshot validation usage.
packages/tools/replay-tool/src/helpers.ts Extracts packageVersion normalization into a reusable helper for snapshot comparisons.
packages/test/test-end-to-end-tests/src/test/fewerBatches.spec.ts Disables batchId tracking for a targeted scenario where the detector would short-circuit the intended assertion.
packages/test/test-end-to-end-tests/src/test/batching.spec.ts Switches to Offline Load opt-in to exercise the correct modern enablement path.
packages/test/snapshots/src/validateSnapshots.ts Adds logic to ignore .recentBatchInfo when validating old-format snapshots; reuses version normalization.
packages/test/functional-tests/src/test/containerRuntime.spec.ts Ensures test messages include clientSequenceNumber to match expected summary metadata structure.
packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts Updates telemetry assertions and adds coverage for per-summary-window counter resets.
packages/runtime/container-runtime/src/test/containerRuntime.spec.ts Updates expectations for default-on stamping/detection; adds coverage for Immediate mode + grouped batching disabled scenarios.
packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts Adds per-summary perf counters and emits them via telemetry context.
packages/runtime/container-runtime/src/containerRuntime.ts Implements default-on enablement gated by TurnBased+grouped batching, kill-switch, and stronger Offline Load validations.

Comment on lines +1918 to +1921
this.batchIdTrackingEnabled =
!disableBatchIdTracking &&
this._flushMode === FlushMode.TurnBased &&
this.groupedBatchingEnabled;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: Confirming this concern — independent analysis landed on the same finding. The PR's own added comment at containerRuntime.ts:1893 declares the intent: "Offline Load requires both prerequisites, so a consumer that opts into it without them gets an explicit UsageError rather than silent degradation." The constructor enforces that for the FlushMode and grouped-batching prereqs (the two existing if (offlineLoadRequested && ...) throw new UsageError(...) checks at lines ~1916), but not for the new kill-switch.

When a host sets enableOfflineFull=true and an operator flips DisableBatchIdTracking=true (the most likely path during an incident), batchIdTrackingEnabled becomes false, no DuplicateBatchDetector is constructed, batchIds stop being stamped on resubmits (the same flag also gates the metadata path at containerRuntime.ts:4995-4999), and any recentBatchInfo blob loaded from a prior snapshot is silently dropped on the next summary. Offline resubmission semantics break with no error or telemetry.

Suggested fix — add a third precondition check immediately after the existing two:

if (offlineLoadRequested && disableBatchIdTracking) {
    throw new UsageError("Offline Load requires batch ID tracking; remove Fluid.ContainerRuntime.DisableBatchIdTracking or disable Offline Load.");
}

Or make offlineLoadRequested override disableBatchIdTracking and emit a logged warning. Either choice restores the contract for all three preconditions. Add a unit test parallel to the two existing prereq-error tests in containerRuntime.spec.ts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableOfflineFull will be removed in a follow up PR. For know we just wanted to remove the batch id tracking flag safely and eventually just have one feature flag for all offline features.

Comment thread packages/test/snapshots/src/validateSnapshots.ts
@anthony-murphy
Copy link
Copy Markdown
Contributor

Deep Review

Reviewed commit 2f3c6a8 on 2026-05-07.

Readiness: 5/10 — MAKING PROGRESS

Not ready for sign-off. One contained correctness concern: the new Fluid.ContainerRuntime.DisableBatchIdTracking kill-switch silently disables Offline Load even when a consumer explicitly opts in via Fluid.Container.enableOfflineFull, contradicting the "no silent degradation" intent codified in the comment this PR adds at containerRuntime.ts:1893. Same point was raised independently by copilot-pull-request-reviewer on the inline thread at line 1921 — see the reply there. Fix is one extra precondition check parallel to the two already there.

Path to Ready

  • Resolve inline threads

Context for Reviewers

For human reviewer
  • Needs human judgment — Throw vs. silent override for the kill-switch is a deliberate design call. The PR's freshly added comment chose throw for two of three prereqs; suggest @andre4i and @vladsud decide whether the kill-switch joins the throw side or is documented as the operator's escape hatch that supersedes Offline Load (in which case at minimum a logged telemetry event when overridden).
  • Needs human judgment — Whether the existing tests at containerRuntime.spec.ts:2477-2513 are adequate regression coverage for the 0xa00 failure that broke feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry #27216 is a re-land judgment call best made by @kian-thompson (resubmit + batch-metadata owner).
  • Cannot be assessed by the pipeline — Whether dropping Fluid.ContainerRuntime.enableBatchIdTracking outright (no deprecation shim, no honoring of explicit =false) is acceptable for this rollout is an org back-compat policy call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants