feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry#27262
feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry#27262
Conversation
…pec.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (509 lines, 11 files), I've queued these reviewers:
How this works
|
There was a problem hiding this comment.
Pull request overview
Enables batchId tracking / duplicate-batch detection by default (with a kill-switch) and adds telemetry for detector window activity, while updating replay/snapshot tooling and tests to stay compatible with the new default-on behavior (including the grouped-batching constraint).
Changes:
- Default-enable batchId tracking when
FlushMode.TurnBasedand grouped batching are enabled; add kill-switch and additional Offline Load validations. - Add perf telemetry counters to
DuplicateBatchDetectorand extend its unit tests. - Update snapshot/replay tooling and multiple test suites to normalize package versions and to handle the new
.recentBatchInfoblob/back-compat behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/tools/replay-tool/src/index.ts | Re-exports normalizePackageVersions for downstream snapshot validation usage. |
| packages/tools/replay-tool/src/helpers.ts | Extracts packageVersion normalization into a reusable helper for snapshot comparisons. |
| packages/test/test-end-to-end-tests/src/test/fewerBatches.spec.ts | Disables batchId tracking for a targeted scenario where the detector would short-circuit the intended assertion. |
| packages/test/test-end-to-end-tests/src/test/batching.spec.ts | Switches to Offline Load opt-in to exercise the correct modern enablement path. |
| packages/test/snapshots/src/validateSnapshots.ts | Adds logic to ignore .recentBatchInfo when validating old-format snapshots; reuses version normalization. |
| packages/test/functional-tests/src/test/containerRuntime.spec.ts | Ensures test messages include clientSequenceNumber to match expected summary metadata structure. |
| packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts | Updates telemetry assertions and adds coverage for per-summary-window counter resets. |
| packages/runtime/container-runtime/src/test/containerRuntime.spec.ts | Updates expectations for default-on stamping/detection; adds coverage for Immediate mode + grouped batching disabled scenarios. |
| packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts | Adds per-summary perf counters and emits them via telemetry context. |
| packages/runtime/container-runtime/src/containerRuntime.ts | Implements default-on enablement gated by TurnBased+grouped batching, kill-switch, and stronger Offline Load validations. |
| this.batchIdTrackingEnabled = | ||
| !disableBatchIdTracking && | ||
| this._flushMode === FlushMode.TurnBased && | ||
| this.groupedBatchingEnabled; |
There was a problem hiding this comment.
Deep Review: Confirming this concern — independent analysis landed on the same finding. The PR's own added comment at containerRuntime.ts:1893 declares the intent: "Offline Load requires both prerequisites, so a consumer that opts into it without them gets an explicit UsageError rather than silent degradation." The constructor enforces that for the FlushMode and grouped-batching prereqs (the two existing if (offlineLoadRequested && ...) throw new UsageError(...) checks at lines ~1916), but not for the new kill-switch.
When a host sets enableOfflineFull=true and an operator flips DisableBatchIdTracking=true (the most likely path during an incident), batchIdTrackingEnabled becomes false, no DuplicateBatchDetector is constructed, batchIds stop being stamped on resubmits (the same flag also gates the metadata path at containerRuntime.ts:4995-4999), and any recentBatchInfo blob loaded from a prior snapshot is silently dropped on the next summary. Offline resubmission semantics break with no error or telemetry.
Suggested fix — add a third precondition check immediately after the existing two:
if (offlineLoadRequested && disableBatchIdTracking) {
throw new UsageError("Offline Load requires batch ID tracking; remove Fluid.ContainerRuntime.DisableBatchIdTracking or disable Offline Load.");
}Or make offlineLoadRequested override disableBatchIdTracking and emit a logged warning. Either choice restores the contract for all three preconditions. Add a unit test parallel to the two existing prereq-error tests in containerRuntime.spec.ts.
There was a problem hiding this comment.
enableOfflineFull will be removed in a follow up PR. For know we just wanted to remove the batch id tracking flag safely and eventually just have one feature flag for all offline features.
Deep ReviewReviewed commit Readiness: 5/10 — MAKING PROGRESS Not ready for sign-off. One contained correctness concern: the new Path to Ready
Context for Reviewers
For human reviewer
|
Same as #27216 with the addition of enabling batch id tracking only when groupedBatching is enabled as well.
Previous PR was breaking real service stress tests due to hitting an assert related to creating empty batches when grouped batching was not enabled.
This run https://dev.azure.com/fluidframework/internal/_build/results?buildId=398830 confirms issue is solved with added changes.