GH-45947 : [C++][Parquet] Variant encoding#50122
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
7f51026 to
70a364b
Compare
70a364b to
8ab28f0
Compare
| DecodeMetadata(encoded1.metadata.data(), | ||
| static_cast<int64_t>(encoded1.metadata.size()))); | ||
|
|
||
| // Build a new variant reusing the same metadata |
There was a problem hiding this comment.
I would add a test where we reuse existing metadata but make a mistake with the data types. For example, according to the metadata, we should write a string but we write an integer instead. I think there is currently no validation for this case in VariantBuilder—is that on purpose?
Either way, the final Variant will be malformed, so the round-trip should fail.
| class ARROW_EXPORT VariantBuilder { | ||
| public: | ||
| VariantBuilder(); | ||
| explicit VariantBuilder(const VariantMetadata& existing_metadata); |
There was a problem hiding this comment.
It might also be useful to pass a value buffer to VariantBuilder to initialize buffer_. This way, it will be possible to continue building an existing Variant value.
| /// builder.Int(30); | ||
| /// builder.FinishObject(start, fields); | ||
| /// ARROW_ASSIGN_OR_RAISE(auto result, builder.Finish()); | ||
| class ARROW_EXPORT VariantBuilder { |
There was a problem hiding this comment.
This API is great for building new variants. Did you also consider adding an API that allows modifying existing Variant values? We would need to add a function to VariantBuilder similar to FindObjectField from the decoding PR, which would "move"| the context of VariantBuilder to a specific place/field. Once called, you would then be able to override the existing value.
Rationale for this change
This is part of the GH-45937 umbrella (Add variant support to C++ Parquet). It adds the encoding (writing) side of the Variant binary format, building on the decoder from GH-45946 (PR #50121). The encoder is required for GH-45948 (variant shredding, PR #50232) and for any Parquet writer that needs to produce Variant columns.
As with the decoder, the implementation targets feature parity with the arrow-go
parquet/variant.Builder, adapted to idiomatic C++ patterns. Divergences are deliberate and documented.What changes are included in this PR?
Adds
VariantBuilderclass invariant_internal.h/variant_builder.ccfor encoding Variant binary values per the Variant Encoding Spec.Builder API:
Null(),Bool(),Int()(auto-sizes),Int8/16/32/64(),Float(),Double(),Date(),TimestampMicros/NTZ(),TimestampNanos/NTZ(),TimeNTZ(),Decimal4/8/16(),String()(auto short-string for 63 bytes),Binary(),UUID()Offset()/NextElement()/FinishArray()for arrays,NextField()/FinishObject()for objectsFinish()produces encoded metadata + value buffers with sorted-flag detectionReset()clears buffer for builder reuse; dictionary preserved acrossFinish()callsVariantMetadatafor shared-dictionary workflowsKey design points:
noexceptmovable)FinishObject()sorts fields in-place by key spec requires field IDs in lexicographic key orderStatus::Invalid) spec says "An object may not contain duplicate keys"; configurable tolerance deferred to GH-45937: [C++][Parquet] Variant shredding #45948 with TODOFinishArray()validates offsets are non-negativeFinish()validates total dictionary size fits in 4-byte offsetsmetadataMaxSizeLimit); C++ only enforces the spec's ~4GB 4-byte offset maximumTODOs for GH-45948 (shredding, PR #50232):
Are these changes tested?
Yes. 238 total tests pass with
BUILD_WARNING_LEVEL=CHECKIN(73 encoder + 165 decoder):Int8/16/32/64without auto-sizing (4 tests)is_largeflag: 300-element array + 300-field object (2 tests)Finish()calls (2 tests)Are there any user-facing changes?
No breaking changes. This extends the public API added in GH-45946 (PR #50121) with the
VariantBuilderclass in the samearrow::extension::variant_internalnamespace.AI Disclosure: AI coding assistants were used during development for scaffolding, test generation, and review iteration. All code has been reviewed, debugged, and verified by the author who owns and understands the changes.