Skip to content

feat(tree): internals for field batch format with specialized node shapes#27200

Open
justus-camp-microsoft wants to merge 4 commits intomicrosoft:mainfrom
justus-camp-microsoft:new_experimental_codec
Open

feat(tree): internals for field batch format with specialized node shapes#27200
justus-camp-microsoft wants to merge 4 commits intomicrosoft:mainfrom
justus-camp-microsoft:new_experimental_codec

Conversation

@justus-camp-microsoft
Copy link
Copy Markdown
Contributor

@justus-camp-microsoft justus-camp-microsoft commented Apr 29, 2026

Description

Adds internals for a new experimental chunked-forest codec format (vTextExperimental, version key "text") targeted at compressing runs of nodes that differ only in a few properties — for example, styled text runs where many nodes share most of their structural skeleton.

The core addition is EncodedSpecializedNodeShape (f shape variant): a node shape that derives from another node shape by overlaying property-level overrides. A base node shape defines the structural skeleton; the specialization specifies overrides for those properties.

Decode-side support lives in SpecializedNodeDecoder, which walks the specialized f shapes back to a given c shape via resolveToNodeShape, then merges via applySpecialization: base field order is preserved, overridden fields are replaced in place, and new fields are appended. Cycles and missing bases are caught with asserts.

Existing types broadened for the new union:

  • EncodedChunkShapeV1OrV2EncodedChunkShape (now V1 | V2 | VTextExperimental).
  • shapesV2 factored out of EncodedChunkShapeV2 so the new format can extend it.

This PR is decoder + format scaffolding only. Encoders are not yet emitting f shapes; that will follow in a separate change. The codec will be registered as part of introducing the encoder.

Reviewer Guidance

The review process is outlined on this wiki page.

Copilot AI review requested due to automatic review settings April 29, 2026 16:37
@justus-camp-microsoft justus-camp-microsoft requested a review from a team as a code owner April 29, 2026 16:37
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (1317 lines, 8 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds scaffolding for an experimental “text” chunked-forest codec shape (f) that can derive node shapes from a base shape via property-level overrides, enabling better compression for runs of structurally-similar nodes.

Changes:

  • Introduces EncodedSpecializedNodeShape / EncodedChunkShapeVTextExperimental schema and adds vTextExperimental: "text" to FieldBatchFormatVersion.
  • Adds decode-side support for f shapes via SpecializedNodeDecoder plus shape-chain resolution and specialization merging.
  • Broadens encoder/decoder shape typing to a unified EncodedChunkShape union and adds targeted unit tests for specialization behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/dds/tree/src/test/feature-libraries/chunked-forest/codec/chunkDecoding.spec.ts Adds extensive unit coverage for specialized-node decoding and dispatcher integration.
packages/dds/tree/src/feature-libraries/chunked-forest/codec/nodeEncoder.ts Updates encoder shape generics to use the new EncodedChunkShape union type.
packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/versions.ts Adds vTextExperimental version, defines EncodedChunkShape union including the new format.
packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/index.ts Re-exports the new format schema/types and updated chunk-shape union.
packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/formatVText.ts Defines the experimental text codec schema, including the new f shape.
packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/formatV2.ts Factors out shapesV2 for reuse by the experimental text schema.
packages/dds/tree/src/feature-libraries/chunked-forest/codec/compressedEncode.ts Broadens encoder buffer/shape generics to EncodedChunkShape.
packages/dds/tree/src/feature-libraries/chunked-forest/codec/chunkDecoding.ts Implements SpecializedNodeDecoder, specialization merge logic, and dispatcher support for f.

}

/**
* Walks a chain of specialized node shapes (`f`) back to the concrete `c` node shape they are
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this description and the function name, it sounds like this returns the base shape.

From the implementation, this looks like it looks up shape, and normalizes out any specialized node shapes. I think the docs and likely function name could use adjustment.

Comment thread packages/dds/tree/src/feature-libraries/chunked-forest/codec/chunkDecoding.ts Outdated
Comment thread packages/dds/tree/src/feature-libraries/chunked-forest/codec/chunkDecoding.ts Outdated
Comment thread packages/dds/tree/src/feature-libraries/chunked-forest/codec/chunkDecoding.ts Outdated
@CraigMacomber
Copy link
Copy Markdown
Contributor

Note about the PR title: this is a new format for an existing codec, not a new codec (our code doesn't separate these well yet, but I've been trying to make it clearer. I'd describe this as "add internals for experimental field batch codec format with specialized node shapes" (or maybe shortened to "internals for field batch format with specialized node shapes" for the description)

It might make sense to include the addition of the new format to fieldBatchCodecBuilder as part of this (lack of that is why I called this "internals" above), and thus support actually encoding and decoding the new format (just not using f in the encoder yet. That would opt the new code into per format test coverage for things like round tripping and json schema snapshotting. Since the format version is unstable (a string) that should be safe. I'm fine with it either way though.

@justus-camp-microsoft justus-camp-microsoft changed the title feat(tree): add experimental codec with specialized node shapes feat(tree): internals for field batch format with specialized node shapes Apr 30, 2026
@justus-camp-microsoft
Copy link
Copy Markdown
Contributor Author

Note about the PR title: this is a new format for an existing codec, not a new codec (our code doesn't separate these well yet, but I've been trying to make it clearer. I'd describe this as "add internals for experimental field batch codec format with specialized node shapes" (or maybe shortened to "internals for field batch format with specialized node shapes" for the description)

It might make sense to include the addition of the new format to fieldBatchCodecBuilder as part of this (lack of that is why I called this "internals" above), and thus support actually encoding and decoding the new format (just not using f in the encoder yet. That would opt the new code into per format test coverage for things like round tripping and json schema snapshotting. Since the format version is unstable (a string) that should be safe. I'm fine with it either way though.

I updated the title/description to denote that the codec isn't actually registered and that this is internals. I think it's easier to register the codec once we have both the encoder and decoder so I'll wait for the next change to do that if that's alright.

* Exported for testing.
*/
export function normalizeToNodeShape(
shapeIndex: number,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, use our ShapeIndex type alias for number here and in the set below.

export function normalizeToNodeShape(
shapeIndex: number,
context: DecoderContext<EncodedChunkShape>,
visited: Set<number> = new Set(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extending the API docs to cover parameters, and noting why this exists and how most callers don't need to provide it would be good.

Maybe a more specific name for it would help too: our codec does have logic to traverse and visit all shape which this might be confused with.

}

/**
* Resolves `shapeIndex` to a fully-merged {@link EncodedNodeShape}, normalizing away any
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should either:

  1. document that it only is allowed to be used on EncodedNodeShape | SpecalizedNodeShape
    OR
  2. work on any node shape, and return a union of all the types except SpecalizedNodeShape
    OR
  3. Take in the input as type EncodedNodeShape | SpecalizedNodeShape instead of a shape index.

* @remarks
* Exported for testing.
*/
export function applySpecialization(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EncodedSpecializedNodeShape documents the merge rules: this should refer to/link that and that should likely link here as well.

Also I noticed an issue with those documented rules:

extraFields are inherited unless the specialization sets them as own properties — to

  • inherit, the property must be omitted; setting it explicitly (even to false or
  • undefined) is treated as an override.

These types are supposed to be JSON compatible since they get json serialized. JSON does not preserve undefined properties, so we should not have them be significant or it won't work with real encoded data (I'm not sure if this is a docs bug or a real format issue)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that doc calls these rules "Merge rules": given we have a bunch of logic about merges in this codebase which is about a different thing, calling them specialization rules or override rules would be more clear I think.

}
}
for (const [keyEncoded, shapeIndex] of spec.fields ?? []) {
if (!overriddenKeys.has(context.identifier<FieldKey>(keyEncoded))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just use overrides instead of overriddenKeys, then overriddenKeys can be entirely removed.

mergedFields.push([keyEncoded, overrideShape]);
}
}
for (const [keyEncoded, shapeIndex] of spec.fields ?? []) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A code comment on these loops would be helpful. This one could be something like:

Suggested change
for (const [keyEncoded, shapeIndex] of spec.fields ?? []) {
// Add all fields for all overrides from spec that are new fields, in the order they are specified
for (const [keyEncoded, shapeIndex] of spec.fields ?? []) {

*/
export function applySpecialization(
base: EncodedNodeShape,
spec: EncodedSpecializedNodeShape,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"spec" is an ambiguous abbreviation (and we try to avoid abbreviations in the first place). Most people would assume this means specification, but here it might mean specialization.

Renaming this to "overrides" (and "overrides" below to "fieldOverrides") would be more clear.

context: DecoderContext<EncodedChunkShape>,
): EncodedNodeShape {
const overrides = new Map<FieldKey, number>();
for (const [keyEncoded, shapeIndex] of spec.fields ?? []) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm with its three loops seems like it could be simplified, and made a bit mor intuitive if you refactored it.

Copy the base fields array: const fields = [...base.fields];
Make a lookup for where to override: "const indexFromKey = new Map(base.fields.map(...));
Make a pass over spec.fields, either updating items in fields using indexFromKey, or appending to the array.

I feel like that approach is more obviously a correct implementation of the state merge policy.

type: base.type,
value: "value" in spec ? spec.value : base.value,
fields: mergedFields.length > 0 ? mergedFields : undefined,
extraFields: "extraFields" in spec ? spec.extraFields : base.extraFields,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break due to JSON not persisting undefined fields if trying to remove optional fields in an override. I think you might need to store some other value (like 0, false or null: 0 is the shortest of those so maybe pick that) to encode that case.

The fact you missed this implies that there might be a gap in test coverage, either for this case and/or not testing with the JSON round tripping.

context: DecoderContext<EncodedChunkShape>,
) {
this.inner = new NodeDecoder(
applySpecialization(normalizeToNodeShape(shape.base, context), shape, context),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you replace this with a single call to normalizeToNodeShape on shape? (Might be work tweaking the signature for normalizeToNodeShape to enable that)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants