-
Notifications
You must be signed in to change notification settings - Fork 927
Variant: Rust API to Create Variant Values #7424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @alamb for the great writeup!
|
I think we should start with reuse within a single builder.
Yes, I think having an option on the builder makes the most sense. I haven't fully thought through the interplay between creating sorted metadata and trying to reuse the metadata -- it seems like once the metadata fields have been added, we can't then add new field names without distrurbing existing values (the metadata field indexes would have changed) I think the design principles of the arrow-rs crate are to provide high performance primitives and reasonable defaults, and allow it to the user to specify / control things at a lower level of performance |
Update: parquet-java has a similar builder here: https://github.com/apache/parquet-java/tree/master/parquet-variant/src Which was added a few days ago with @gene-db: |
I’d be happy to take this one for now. Can you assign it to me @alamb? |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of supporting the Variant type in Parquet and Arrow is programmatically
creating values in the binary format described in VariantEncoding.md. This
is important in the short term for writing tests, as well as for converting from
other types (specifically JSON).
Note this ticket covers the API to create such values, but not reading them
(see #7423) or reading/writing variant values to JSON.
Describe the solution you'd like
What I would like is a Rust API, that can efficiently create such values. I
think it is also important to design an API that supports reusing the metadata.
Describe alternatives you've considered
What I suggest is a Builder-style API, modeled on the Arrow array builder APIs
such as StringBuilder that can efficiently create Variant values.
For example:
Example creating a primitive
Variant
value`:Example of creating a nested
VariantValue
:Here is how we might create an Object:
Finish the builder to finalize the metadata
When the builder is finished, it finalizes / writes metadata as needed.
Considerations:
Reusing metadata
The metadata mostly contains a dictionary of field names, and so I believe an
important optimization will be reusing the same metadata to create multiple
values. For example the three following JSON values can use the same metadata
(with field names "foo" and "bar"):
Sorted dictionaries:
The metadata encoding spec permits writing sorted dictionaries in the metadata
header. However, when writing sorted dictionaries, once an object has been
created, it is in general not possible to add new metadata dictionary values
because the variant object value itself contains offsets to the dictionary, and thus inserting any new values into
the metadata would invalidate it.
One API that might work would be to supply a pre-existing metadata to the builder
and reusing that when possible and creating an new metadata when it isn't
Additional context
The text was updated successfully, but these errors were encountered: