Skip to content

GH-45947 : [C++][Parquet] Variant encoding#50122

Open
qzyu999 wants to merge 3 commits into
apache:mainfrom
qzyu999:variant-encoding
Open

GH-45947 : [C++][Parquet] Variant encoding#50122
qzyu999 wants to merge 3 commits into
apache:mainfrom
qzyu999:variant-encoding

Conversation

@qzyu999

@qzyu999 qzyu999 commented Jun 8, 2026

Copy link
Copy Markdown

PR Stack (merge in order):

  1. GH-45946: [C++][Parquet] Variant decoding #50121 Variant decoding (merge first)
  2. GH-45947 : [C++][Parquet] Variant encoding #50122 YOU ARE HERE Variant encoding (this PR, depends on GH-45946: [C++][Parquet] Variant decoding #50121)
  3. GH-45948: [C++][Parquet] Variant shredding #50232 Variant shredding (depends on this PR)

All three PRs are part of the GH-45937 umbrella (Add variant support to C++ Parquet).

Rationale for this change

This is part of the GH-45937 umbrella (Add variant support to C++ Parquet). It adds the encoding (writing) side of the Variant binary format, building on the decoder from GH-45946 (PR #50121). The encoder is required for GH-45948 (variant shredding, PR #50232) and for any Parquet writer that needs to produce Variant columns.

As with the decoder, the implementation targets feature parity with the arrow-go parquet/variant.Builder, adapted to idiomatic C++ patterns. Divergences are deliberate and documented.

What changes are included in this PR?

Adds VariantBuilder class in variant_internal.h / variant_builder.cc for encoding Variant binary values per the Variant Encoding Spec.

Builder API:

  • All 21 primitive types: Null(), Bool(), Int() (auto-sizes), Int8/16/32/64(), Float(), Double(), Date(), TimestampMicros/NTZ(), TimestampNanos/NTZ(), TimeNTZ(), Decimal4/8/16(), String() (auto short-string for 63 bytes), Binary(), UUID()
  • Container construction: Offset() / NextElement() / FinishArray() for arrays, NextField() / FinishObject() for objects
  • Finish() produces encoded metadata + value buffers with sorted-flag detection
  • Reset() clears buffer for builder reuse; dictionary preserved across Finish() calls
  • Constructor from existing VariantMetadata for shared-dictionary workflows

Key design points:

  • Move-only (non-copyable, noexcept movable)
  • FinishObject() sorts fields in-place by key spec requires field IDs in lexicographic key order
  • Strict duplicate key rejection (Status::Invalid) spec says "An object may not contain duplicate keys"; configurable tolerance deferred to GH-45937: [C++][Parquet] Variant shredding #45948 with TODO
  • FinishArray() validates offsets are non-negative
  • Finish() validates total dictionary size fits in 4-byte offsets
  • Decimal scale validation ( 38) in encoder; decoder is lenient
  • Go enforces a 128MB metadata limit (metadataMaxSizeLimit); C++ only enforces the spec's ~4GB 4-byte offset maximum

TODOs for GH-45948 (shredding, PR #50232):

// TODO GH-45948: Add BuildWithoutMeta()  raw value bytes without metadata
// TODO GH-45948: Add UnsafeAppendEncoded()  append pre-encoded bytes
// TODO GH-45948: Add SetAllowDuplicates(bool)  last-value-wins semantics

Are these changes tested?

Yes. 238 total tests pass with BUILD_WARNING_LEVEL=CHECKIN (73 encoder + 165 decoder):

  • Primitive round-trips (14 tests including short/long string boundary at 63/64 bytes)
  • Int auto-sizing boundaries: Int8Int16Int32Int64 transitions (8 tests)
  • Direct int type methods: Int8/16/32/64 without auto-sizing (4 tests)
  • Array round-trips: empty, simple, nested (3 tests)
  • Object round-trips: empty, simple, nested, duplicate rejection, field sorting (5 tests)
  • Builder features: reset, from-existing-metadata, sorted/unsorted flag (4 tests)
  • Integration: complex nested object, large metadata (300 keys), offset-size computation, invalid start, negative offsets (5 tests)
  • Special floats: NaN, Inf for float and double (6 tests)
  • Large containers triggering is_large flag: 300-element array + 300-field object (2 tests)
  • Decoder utility round-trips through builder output: FindObjectField, GetArrayElement, GetObjectFieldAt, ValueSize (4 tests)
  • Builder reuse: dictionary preservation across multiple Finish() calls (2 tests)
  • Pre-existing buffer: FinishObject/FinishArray with start > 0 (2 tests)
  • Decimal scale validation: rejects scale > 38 (1 test)

Are there any user-facing changes?

No breaking changes. This extends the public API added in GH-45946 (PR #50121) with the VariantBuilder class in the same arrow::extension::variant_internal namespace.

AI Disclosure: AI coding assistants were used during development for scaffolding, test generation, and review iteration. All code has been reviewed, debugged, and verified by the author who owns and understands the changes.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@qzyu999 qzyu999 changed the title Variant encoding GH-45947 : [C++][Parquet] Variant encoding Jun 8, 2026

@misiek1984 misiek1984 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

DecodeMetadata(encoded1.metadata.data(),
static_cast<int64_t>(encoded1.metadata.size())));

// Build a new variant reusing the same metadata

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a test where we reuse existing metadata but make a mistake with the data types. For example, according to the metadata, we should write a string but we write an integer instead. I think there is currently no validation for this case in VariantBuilder—is that on purpose?

Either way, the final Variant will be malformed, so the round-trip should fail.

class ARROW_EXPORT VariantBuilder {
public:
VariantBuilder();
explicit VariantBuilder(const VariantMetadata& existing_metadata);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be useful to pass a value buffer to VariantBuilder to initialize buffer_. This way, it will be possible to continue building an existing Variant value.

/// builder.Int(30);
/// builder.FinishObject(start, fields);
/// ARROW_ASSIGN_OR_RAISE(auto result, builder.Finish());
class ARROW_EXPORT VariantBuilder {

@misiek1984 misiek1984 Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is great for building new variants. Did you also consider adding an API that allows modifying existing Variant values? We would need to add a function to VariantBuilder similar to FindObjectField from the decoding PR, which would "move"| the context of VariantBuilder to a specific place/field. Once called, you would then be able to override the existing value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants