feat(tree): internals for field batch format with specialized node shapes#27200
feat(tree): internals for field batch format with specialized node shapes#27200justus-camp-microsoft wants to merge 4 commits intomicrosoft:mainfrom
Conversation
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (1317 lines, 8 files), I've queued these reviewers:
How this works
|
There was a problem hiding this comment.
Pull request overview
Adds scaffolding for an experimental “text” chunked-forest codec shape (f) that can derive node shapes from a base shape via property-level overrides, enabling better compression for runs of structurally-similar nodes.
Changes:
- Introduces
EncodedSpecializedNodeShape/EncodedChunkShapeVTextExperimentalschema and addsvTextExperimental: "text"toFieldBatchFormatVersion. - Adds decode-side support for
fshapes viaSpecializedNodeDecoderplus shape-chain resolution and specialization merging. - Broadens encoder/decoder shape typing to a unified
EncodedChunkShapeunion and adds targeted unit tests for specialization behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/dds/tree/src/test/feature-libraries/chunked-forest/codec/chunkDecoding.spec.ts | Adds extensive unit coverage for specialized-node decoding and dispatcher integration. |
| packages/dds/tree/src/feature-libraries/chunked-forest/codec/nodeEncoder.ts | Updates encoder shape generics to use the new EncodedChunkShape union type. |
| packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/versions.ts | Adds vTextExperimental version, defines EncodedChunkShape union including the new format. |
| packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/index.ts | Re-exports the new format schema/types and updated chunk-shape union. |
| packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/formatVText.ts | Defines the experimental text codec schema, including the new f shape. |
| packages/dds/tree/src/feature-libraries/chunked-forest/codec/format/formatV2.ts | Factors out shapesV2 for reuse by the experimental text schema. |
| packages/dds/tree/src/feature-libraries/chunked-forest/codec/compressedEncode.ts | Broadens encoder buffer/shape generics to EncodedChunkShape. |
| packages/dds/tree/src/feature-libraries/chunked-forest/codec/chunkDecoding.ts | Implements SpecializedNodeDecoder, specialization merge logic, and dispatcher support for f. |
| } | ||
|
|
||
| /** | ||
| * Walks a chain of specialized node shapes (`f`) back to the concrete `c` node shape they are |
There was a problem hiding this comment.
From this description and the function name, it sounds like this returns the base shape.
From the implementation, this looks like it looks up shape, and normalizes out any specialized node shapes. I think the docs and likely function name could use adjustment.
|
Note about the PR title: this is a new format for an existing codec, not a new codec (our code doesn't separate these well yet, but I've been trying to make it clearer. I'd describe this as "add internals for experimental field batch codec format with specialized node shapes" (or maybe shortened to "internals for field batch format with specialized node shapes" for the description) It might make sense to include the addition of the new format to fieldBatchCodecBuilder as part of this (lack of that is why I called this "internals" above), and thus support actually encoding and decoding the new format (just not using |
I updated the title/description to denote that the codec isn't actually registered and that this is internals. I think it's easier to register the codec once we have both the encoder and decoder so I'll wait for the next change to do that if that's alright. |
| * Exported for testing. | ||
| */ | ||
| export function normalizeToNodeShape( | ||
| shapeIndex: number, |
There was a problem hiding this comment.
For clarity, use our ShapeIndex type alias for number here and in the set below.
| export function normalizeToNodeShape( | ||
| shapeIndex: number, | ||
| context: DecoderContext<EncodedChunkShape>, | ||
| visited: Set<number> = new Set(), |
There was a problem hiding this comment.
Extending the API docs to cover parameters, and noting why this exists and how most callers don't need to provide it would be good.
Maybe a more specific name for it would help too: our codec does have logic to traverse and visit all shape which this might be confused with.
| } | ||
|
|
||
| /** | ||
| * Resolves `shapeIndex` to a fully-merged {@link EncodedNodeShape}, normalizing away any |
There was a problem hiding this comment.
I think this should either:
- document that it only is allowed to be used on EncodedNodeShape | SpecalizedNodeShape
OR - work on any node shape, and return a union of all the types except SpecalizedNodeShape
OR - Take in the input as type EncodedNodeShape | SpecalizedNodeShape instead of a shape index.
| * @remarks | ||
| * Exported for testing. | ||
| */ | ||
| export function applySpecialization( |
There was a problem hiding this comment.
EncodedSpecializedNodeShape documents the merge rules: this should refer to/link that and that should likely link here as well.
Also I noticed an issue with those documented rules:
extraFieldsare inherited unless the specialization sets them as own properties — to
- inherit, the property must be omitted; setting it explicitly (even to
falseorundefined) is treated as an override.
These types are supposed to be JSON compatible since they get json serialized. JSON does not preserve undefined properties, so we should not have them be significant or it won't work with real encoded data (I'm not sure if this is a docs bug or a real format issue)
There was a problem hiding this comment.
Also that doc calls these rules "Merge rules": given we have a bunch of logic about merges in this codebase which is about a different thing, calling them specialization rules or override rules would be more clear I think.
| } | ||
| } | ||
| for (const [keyEncoded, shapeIndex] of spec.fields ?? []) { | ||
| if (!overriddenKeys.has(context.identifier<FieldKey>(keyEncoded))) { |
There was a problem hiding this comment.
This could just use overrides instead of overriddenKeys, then overriddenKeys can be entirely removed.
| mergedFields.push([keyEncoded, overrideShape]); | ||
| } | ||
| } | ||
| for (const [keyEncoded, shapeIndex] of spec.fields ?? []) { |
There was a problem hiding this comment.
A code comment on these loops would be helpful. This one could be something like:
| for (const [keyEncoded, shapeIndex] of spec.fields ?? []) { | |
| // Add all fields for all overrides from spec that are new fields, in the order they are specified | |
| for (const [keyEncoded, shapeIndex] of spec.fields ?? []) { |
| */ | ||
| export function applySpecialization( | ||
| base: EncodedNodeShape, | ||
| spec: EncodedSpecializedNodeShape, |
There was a problem hiding this comment.
"spec" is an ambiguous abbreviation (and we try to avoid abbreviations in the first place). Most people would assume this means specification, but here it might mean specialization.
Renaming this to "overrides" (and "overrides" below to "fieldOverrides") would be more clear.
| context: DecoderContext<EncodedChunkShape>, | ||
| ): EncodedNodeShape { | ||
| const overrides = new Map<FieldKey, number>(); | ||
| for (const [keyEncoded, shapeIndex] of spec.fields ?? []) { |
There was a problem hiding this comment.
This algorithm with its three loops seems like it could be simplified, and made a bit mor intuitive if you refactored it.
Copy the base fields array: const fields = [...base.fields];
Make a lookup for where to override: "const indexFromKey = new Map(base.fields.map(...));
Make a pass over spec.fields, either updating items in fields using indexFromKey, or appending to the array.
I feel like that approach is more obviously a correct implementation of the state merge policy.
| type: base.type, | ||
| value: "value" in spec ? spec.value : base.value, | ||
| fields: mergedFields.length > 0 ? mergedFields : undefined, | ||
| extraFields: "extraFields" in spec ? spec.extraFields : base.extraFields, |
There was a problem hiding this comment.
This is going to break due to JSON not persisting undefined fields if trying to remove optional fields in an override. I think you might need to store some other value (like 0, false or null: 0 is the shortest of those so maybe pick that) to encode that case.
The fact you missed this implies that there might be a gap in test coverage, either for this case and/or not testing with the JSON round tripping.
| context: DecoderContext<EncodedChunkShape>, | ||
| ) { | ||
| this.inner = new NodeDecoder( | ||
| applySpecialization(normalizeToNodeShape(shape.base, context), shape, context), |
There was a problem hiding this comment.
can't you replace this with a single call to normalizeToNodeShape on shape? (Might be work tweaking the signature for normalizeToNodeShape to enable that)
Description
Adds internals for a new experimental chunked-forest codec format (
vTextExperimental, version key"text") targeted at compressing runs of nodes that differ only in a few properties — for example, styled text runs where many nodes share most of their structural skeleton.The core addition is
EncodedSpecializedNodeShape(fshape variant): a node shape that derives from another node shape by overlaying property-level overrides. A base node shape defines the structural skeleton; the specialization specifies overrides for those properties.Decode-side support lives in
SpecializedNodeDecoder, which walks the specializedfshapes back to a givencshape viaresolveToNodeShape, then merges viaapplySpecialization: base field order is preserved, overridden fields are replaced in place, and new fields are appended. Cycles and missing bases are caught with asserts.Existing types broadened for the new union:
EncodedChunkShapeV1OrV2→EncodedChunkShape(nowV1 | V2 | VTextExperimental).shapesV2factored out ofEncodedChunkShapeV2so the new format can extend it.This PR is decoder + format scaffolding only. Encoders are not yet emitting
fshapes; that will follow in a separate change. The codec will be registered as part of introducing the encoder.Reviewer Guidance
The review process is outlined on this wiki page.