feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry#27216
Conversation
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (433 lines, 11 files), I've queued these reviewers:
How this works
|
There was a problem hiding this comment.
Pull request overview
This PR makes batchId tracking (and fork/duplicate-batch detection) default-on for ContainerRuntime in FlushMode.TurnBased, adds a kill-switch to disable it in the field, and extends summary-time perf telemetry emitted by DuplicateBatchDetector.
Changes:
- Enable batchId tracking by default in TurnBased mode, skip it in Immediate mode, and add kill-switch
Fluid.ContainerRuntime.DisableBatchIdTrackingplus a one-time enablement telemetry event. - Add per-summary-window perf counters (
peakRecentBatchCount,processedBatchCount) toDuplicateBatchDetectortelemetry and reset them each summary window. - Update/unit-test expectations and configurations to reflect default-on behavior and kill-switch behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/test/test-end-to-end-tests/src/test/batching.spec.ts | Updates test config to preserve “Offline Load + Immediate throws” behavior now that batchId tracking no longer throws in Immediate mode. |
| packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts | Extends tests to validate new per-summary perf telemetry and counter resets. |
| packages/runtime/container-runtime/src/test/containerRuntime.spec.ts | Updates runtime tests for default-on stamping/detection, adds Immediate-mode skip test, and keeps Offline Load Immediate-mode back-compat error test. |
| packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts | Adds perf counters, emits new telemetry fields, and resets per-summary window counters. |
| packages/runtime/container-runtime/src/containerRuntime.ts | Implements default enablement in TurnBased, kill-switch, and enablement telemetry event; allocates detector only when enabled. |
| this.closeFn(error); | ||
| throw error; | ||
| } | ||
|
|
There was a problem hiding this comment.
If Fluid.Container.enableOfflineFull is true while the kill-switch Fluid.ContainerRuntime.DisableBatchIdTracking is also true, the runtime will allow Offline Load in TurnBased mode but will disable batchIdTrackingEnabled (and thus disable batchId stamping + DuplicateBatchDetector). That seems to violate the comment that tracking is a prerequisite for Offline Load and could re-enable unsafe fork scenarios silently. Consider either (a) ignoring the kill-switch when Offline Load is explicitly requested, or (b) throwing a UsageError when both flags are set so the configuration can't be enabled in an unsupported state.
| if (offlineLoadRequested && disableBatchIdTracking) { | |
| const error = new UsageError( | |
| "Offline mode requires batchId tracking and cannot be used when Fluid.ContainerRuntime.DisableBatchIdTracking is enabled.", | |
| ); | |
| this.closeFn(error); | |
| throw error; | |
| } |
There was a problem hiding this comment.
offline load flag will also by removed in a follow up
There was a problem hiding this comment.
Deep Review: Following up on this thread now that batchId tracking is default-on. The asymmetry the original comment flagged is still live in the current diff: the offlineLoadRequested UsageError validates only flush mode, not the kill-switch, and the constructor comment immediately above still describes batchId tracking as the prerequisite for Offline Load. Pre-PR, enableOfflineFull=true unconditionally enabled tracking; post-PR, setting enableOfflineFull=true together with Fluid.ContainerRuntime.DisableBatchIdTracking=true in TurnBased silently strips the prerequisite — Offline Load proceeds with fork detection disabled, no error, no telemetry. The PR description says Offline Load is "back-compat preserved" but only the FlushMode portion currently is.
@dannimad you mentioned enableOfflineFull will be removed in a follow-up — that resolves this cleanly if it's imminent. A couple of cheap options either way:
- (a) Link the follow-up issue here and add a one-line comment in the constructor noting the asymmetry is intentionally transient, or
- (b) Add the symmetric guard now — extend the existing
UsageErrorto fire whenofflineLoadRequested && disableBatchIdTracking, or emit aBatchIdTrackingDisabledForOfflineLoadwarning telemetry so the misconfig is at least observable while the follow-up lands.
Either closes the loop on this thread.
| registry, | ||
| loaderProps: { | ||
| configProvider: configProvider({ | ||
| "Fluid.ContainerRuntime.enableBatchIdTracking": true, | ||
| "Fluid.Container.enableOfflineFull": true, | ||
| }), | ||
| }, |
There was a problem hiding this comment.
This suite now enables Fluid.Container.enableOfflineFull for all tests. That changes behavior beyond just batchId tracking (and forces callers to remember to pass disableOfflineLoad=true for Immediate mode). If the goal is only to keep the one Offline Load validation test, consider scoping enableOfflineFull to that specific test/setup instead of setting it globally for the whole describeCompat, to avoid running unrelated batching tests under the Offline Load configuration.
…pec.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Deep ReviewReviewed commit Readiness: 6/10 — 🔨 MAKING PROGRESS Design continues to converge — independent solution-space proposals matched the PR's shape (default-on Path to Ready
Context for Reviewers
For human reviewer
Review history (2 prior reviews)
|
Enables batchId tracking and the
DuplicateBatchDetectorby default inFlushMode.TurnBasedso that "parallel fork" duplicate batches (the same local state sequenced twice from two containers rehydrated from the same serialized pending state) are caught for all consumers, not only those who opted into Offline Load.Changes in
container-runtime:batchIdTrackingEnabledis nowtruewhenever the runtime is inFlushMode.TurnBasedand the kill-switch is not set. InFlushMode.Immediatethe detector is silently skipped (it has nothing meaningful to do without batches).Fluid.ContainerRuntime.DisableBatchIdTrackingdisables the feature without a code change if a regression is observed in the field.Fluid.ContainerRuntime.enableBatchIdTrackingopt-in flag. It is no longer needed — the feature is unconditionally on.Fluid.Container.enableOfflineFullremains, since Offline Load still requires the sameTurnBased-only constraint and continues to throwUsageErrorif combined withFlushMode.Immediate(back-compat preserved).