Skip to content

Variant: Rust API to Create Variant Values #7424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #6736
alamb opened this issue Apr 18, 2025 · 4 comments
Open
Tracked by #6736

Variant: Rust API to Create Variant Values #7424

alamb opened this issue Apr 18, 2025 · 4 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Apr 18, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Part of supporting the Variant type in Parquet and Arrow is programmatically
creating values in the binary format described in VariantEncoding.md. This
is important in the short term for writing tests, as well as for converting from
other types (specifically JSON).

Note this ticket covers the API to create such values, but not reading them
(see #7423) or reading/writing variant values to JSON.

Describe the solution you'd like

What I would like is a Rust API, that can efficiently create such values. I
think it is also important to design an API that supports reusing the metadata.

Describe alternatives you've considered

What I suggest is a Builder-style API, modeled on the Arrow array builder APIs
such as StringBuilder that can efficiently create Variant values.

For example:

// Location to write metadata
// Should be anything that implements std::io::Write or a trait
let mut metadata_buffer = vec![]
// Create a builder for constructing variant values
let builder = VariantBuilder::new(&mut metadata_buffer);

Example creating a primitive Variant value`:

// Create the equivalent of {"foo": 1, "bar": 100}
let mut value_buffer = vec![];
let mut object_builder = builder.new_object(&mut value_buffer); // object_builder has reference to builder
object_builder.append_value("foo", 1);
object_builder.append_value("bar", 100);
object_builder.finish();
// value_buffer now contains a valid variant 🎉
// builder contains a metadata header with fields "foo" and "bar"

Example of creating a nested VariantValue:

Here is how we might create an Object:

// Create nested object: the equivalent of {"foo": {"bar": 100}}
// note we haven't finalized the metadata yet so we reuse it here
let mut value_buffer2 = vec![];
let mut object_builder2 = builder.new_object(&mut value_buffer);
let mut foo_object_builder = object_builder.append_object("bar"); // builder for "bar"
foo_object_builder.append_value("bar", 100);
foo_object_builder.finish();
object_builder.finish();
// value_buffer2 contains a valid variant

Finish the builder to finalize the metadata

When the builder is finished, it finalizes / writes metadata as needed.

// complete writing the metadata
builder.finish();
// metadata_buffer contains valid variant metadata bytes

Considerations:

Reusing metadata

The metadata mostly contains a dictionary of field names, and so I believe an
important optimization will be reusing the same metadata to create multiple
values. For example the three following JSON values can use the same metadata
(with field names "foo" and "bar"):

{
"foo": 1,
"bar": 100
}
{
"foo": 2,
"bar": 200
}
{
"foo": 3,
}

Sorted dictionaries:

The metadata encoding spec permits writing sorted dictionaries in the metadata
header. However, when writing sorted dictionaries, once an object has been
created, it is in general not possible to add new metadata dictionary values
because the variant object value itself contains offsets to the dictionary, and thus inserting any new values into
the metadata would invalidate it.

One API that might work would be to supply a pre-existing metadata to the builder
and reusing that when possible and creating an new metadata when it isn't

Additional context

@PinkCrow007
Copy link

Thanks @alamb for the great writeup!
I’ve sketched out a rough implementation and had two related questions:

  • Reusing metadata: Do we want to support metadata reuse during the value creation phase within the same builder, or is the goal to allow sharing metadata across multiple Variant values after they are already created and in Arrow memory? (My current version only supports reuse during value creation within a single builder.)
  • Key ordering: Since the spec allows for sorted_strings in the metadata, should the VariantBuilder take an option to control key sorting?(This refers to sorting keys in the metadata, and is separate from object-level sorted dictionaries.)

@alamb
Copy link
Contributor Author

alamb commented Apr 28, 2025

Thanks @alamb for the great writeup! I’ve sketched out a rough implementation and had two related questions:

  • Reusing metadata: Do we want to support metadata reuse during the value creation phase within the same builder, or is the goal to allow sharing metadata across multiple Variant values after they are already created and in Arrow memory? (My current version only supports reuse during value creation within a single builder.)

I think we should start with reuse within a single builder.

  • Key ordering: Since the spec allows for sorted_strings in the metadata, should the VariantBuilder take an option to control key sorting?(This refers to sorting keys in the metadata, and is separate from object-level sorted dictionaries.)

Yes, I think having an option on the builder makes the most sense. I haven't fully thought through the interplay between creating sorted metadata and trying to reuse the metadata -- it seems like once the metadata fields have been added, we can't then add new field names without distrurbing existing values (the metadata field indexes would have changed)

I think the design principles of the arrow-rs crate are to provide high performance primitives and reasonable defaults, and allow it to the user to specify / control things at a lower level of performance

@alamb
Copy link
Contributor Author

alamb commented May 9, 2025

@PinkCrow007
Copy link

I’d be happy to take this one for now. Can you assign it to me @alamb?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants