Skip to content

Add API for Creating Variant Values #7452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

PinkCrow007
Copy link

Which issue does this PR close?

Variant: Rust API to Create Variant Values #7424

Rationale for this change

This PR implements a builder-style API for creating Variant values in Rust, following the Variant binary encoding specification. It supports reusing metadata within a single builder session, as discussed in the issue.

What changes are included in this PR?

  • Add Variants as a canonical Arrow Extension Type (built on top of Struct type).
  • Introduce VariantBuilder, ObjectBuilder, and ArrayBuilder for constructing Variant values.

Are there any user-facing changes?

This PR adds a new public API for programmatically creating Variant-encoded values.
No breaking changes.

CC: @alamb for visibility

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 29, 2025
@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

Thank you so much for this @PinkCrow007 -- I have seen it and plan to review it, but may not have a chance for another day or two. So exciting!@

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! Thanks @PinkCrow007 👍

Comment on lines 261 to 291
/// Encodes a date value (days since epoch)
pub fn encode_date(value: i32, output: &mut Vec<u8>) {
// Use primitive + date type
let header = primitive_header(VariantPrimitiveType::Date as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}

/// Encodes a timestamp value (milliseconds since epoch)
pub fn encode_timestamp(value: i64, output: &mut Vec<u8>) {
// Use primitive + timestamp type
let header = primitive_header(VariantPrimitiveType::Timestamp as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}

/// Encodes a timestamp without timezone value (milliseconds since epoch)
pub fn encode_timestamp_ntz(value: i64, output: &mut Vec<u8>) {
// Use primitive + timestamp_ntz type
let header = primitive_header(VariantPrimitiveType::TimestampNTZ as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}

/// Encodes a time without timezone value (milliseconds)
pub fn encode_time_ntz(value: i64, output: &mut Vec<u8>) {
// Use primitive + time_ntz type
let header = primitive_header(VariantPrimitiveType::TimeNTZ as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are quite similar. To reduce duplication, could we create a more general encoder function, like encode_general(type_id: VariantPrimitiveType, value: i64, output: &mut Vec<u8>)? Or create a encoder trait for VariantPrimitiveType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point

I also think they don't need to be pub (maybe we could start with pub(crate)) as the main API people would use is the builder I think

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H and @alamb ! I've refactored the primitive encoding using an Encoder trait for VariantPrimitiveType and set visibility to pub(crate). How does this overall design look to you now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PinkCrow007 -- will check it out tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @PinkCrow007 -- this is very very nice work and I think it will form the basis of a great API to create variant values

I had some structural suggestions which I left, but the biggest suggestion is that I think the wonderful tests you have written would be improved significantly if we can change them to read back the Variant values that were written in the builders

Here is what I would like to propose to move forward:

  1. I'll make a new PR with the scaffolding for a parquet-variant crate
  2. I'll port some of the code for reading variant values there

Then I think we could add the builder code and tests you have in this PR and add them to the other crate.

But really this is great work

@@ -60,6 +60,8 @@ pub enum ArrowError {
DictionaryKeyOverflowError,
/// Error when the run end index in a REE array is bigger than the array length
RunEndIndexOverflowError,
/// Error during Variant operations in `arrow-variant`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a new variant to this enum, it will be a "breaking API change" as then downstream projects would potentially have to update their code to handle new variants

We make releases with API changes every three months,
https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule

So in other words, it would be great to remove this change from the PR so we can merge it faster.

@@ -37,6 +37,8 @@ mod uuid;
pub use uuid::Uuid;
mod variable_shape_tensor;
pub use variable_shape_tensor::{VariableShapeTensor, VariableShapeTensorMetadata};
mod variant;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we postpone adding the canonical extension type classes until we get farther along in the process and are in a better position to write tests.

In other words I recommend removing the changes in arrow-schema/src/extension/ as well in this pR

Comment on lines +42 to +49
arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true, optional = true }
arrow-data = { workspace = true }
arrow-schema = { workspace = true, features = ["canonical_extension_types"] }
serde = { version = "1.0", default-features = false }
serde_json = { version = "1.0", default-features = false, features = ["std"] }
indexmap = "2.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of these dependencies are used so we can remove them

arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true, optional = true }
arrow-data = { workspace = true }
serde = { version = "1.0", default-features = false }
serde_json = { version = "1.0", default-features = false, features = ["std"] }

Comment on lines 801 to 815
// Verify metadata contains all keys
let keys = get_metadata_keys(&metadata_buffer);
assert_eq!(keys.len(), 11, "Should have 11 keys in metadata");
assert!(keys.contains(&"null".to_string()), "Missing 'null' key");
assert!(
keys.contains(&"bool_true".to_string()),
"Missing 'bool_true' key"
);
assert!(keys.contains(&"string".to_string()), "Missing 'string' key");

// Verify object has the correct number of entries
// First byte after header is the number of fields (if small object)
assert!(value_buffer.len() > 1, "Value buffer too small");
let num_fields = value_buffer[1];
assert_eq!(num_fields as usize, 11, "Object should have 11 fields");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than testing these "internal" fields I think the tests would be better if they tested that the resulting value is a readable Variant value. See my next comment below

assert!(!variant.value().is_empty());
}

// =========================================================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very impressive set of test cases 👌

Comment on lines 261 to 291
/// Encodes a date value (days since epoch)
pub fn encode_date(value: i32, output: &mut Vec<u8>) {
// Use primitive + date type
let header = primitive_header(VariantPrimitiveType::Date as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}

/// Encodes a timestamp value (milliseconds since epoch)
pub fn encode_timestamp(value: i64, output: &mut Vec<u8>) {
// Use primitive + timestamp type
let header = primitive_header(VariantPrimitiveType::Timestamp as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}

/// Encodes a timestamp without timezone value (milliseconds since epoch)
pub fn encode_timestamp_ntz(value: i64, output: &mut Vec<u8>) {
// Use primitive + timestamp_ntz type
let header = primitive_header(VariantPrimitiveType::TimestampNTZ as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}

/// Encodes a time without timezone value (milliseconds)
pub fn encode_time_ntz(value: i64, output: &mut Vec<u8>) {
// Use primitive + time_ntz type
let header = primitive_header(VariantPrimitiveType::TimeNTZ as u8);
output.push(header);
output.extend_from_slice(&value.to_le_bytes());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point

I also think they don't need to be pub (maybe we could start with pub(crate)) as the main API people would use is the builder I think

// specific language governing permissions and limitations
// under the License.

//! [`arrow-variant`] contains utilities for working with the [Arrow Variant][format] binary format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ 📖

Comment on lines 73 to 76
/// Builder API for creating variant values
pub mod builder;
/// Encoder module for converting values to Variant binary format
pub mod encoder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start with a minimal API surface area (only expose the Builder and Varaint types directly)

Suggested change
/// Builder API for creating variant values
pub mod builder;
/// Encoder module for converting values to Variant binary format
pub mod encoder;
/// Builder API for creating variant values
mod builder;
/// Encoder module for converting values to Variant binary format
mod encoder;

@alamb
Copy link
Contributor

alamb commented May 2, 2025

Thank yoU @Weijun-H for the review as well

@github-actions github-actions bot removed the arrow Changes to the arrow crate label May 8, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @PinkCrow007 -- this is (again) quite amazing. The code you have sketched out in this PR I think will form the basis for all variant processing going forward. Very impressive

What I suggest we should do now is start merging this PR piece by piece into the repo. I have created a PR here to add a skeleton structure we can work on

That PR also updates the datafusion-testing submodule so it contains binary examples so we can test interoperability with Spark

I think the next PR into the repo should add the Variant struct type along with some basic "read the existing binary values" type tests.

use arrow_schema::ArrowError;
use std::fmt;

/// A Variant value in the Arrow binary format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically Variant is part of the Parquet spec, not part of the Arrow spec 🤷

Suggested change
/// A Variant value in the Arrow binary format
/// A Variant value in the Parquet binary format


/// A Variant value in the Arrow binary format
#[derive(Debug, Clone, PartialEq)]
pub struct Variant<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start as it does not copy the values. However I think it may have a few issues:

  1. The lifetimes ('a) are the same for the Value and the Metadata which I think will make sharing metadata across multiple variants potentially tricky
  2. There is no way to use match effectively to switch on variant type. Instead, I think it needs methods like is_object or is_array

As written I think these structures follow a pattern that is more common on Java or C++ (which is fine, but if we are going to make a native Rust library I think it is worth following standard rust Idioms)

I wonder if you considered the structure contemplated here: #7423

Specifically this structure:

/// Variant value. May contain references to metadata and value
/// 'a is lifetime for metadata
/// 'b is lifetime for value
pub enum Variant<'a, 'b> {
  Variant::Null,
  Variant::Int8
  ...
  // strings are stored in the value and thus have references to that value
  Variant::String(&'b str),
  Variant::ShortString(&'b str),
  // Objects and Arrays need the metadata and values, so store both.
  Variant::Object(VariantObject<'a, 'b>),
  VariantArray(VariantArray<'a, 'b>)
}

/// Wrapper over Variant Metadata
pub struct VariantMetadata<'a> {
  metadata: &'a[u8],
  // perhaps access to header fields like dict length and is_sorted
}

/// Represents a Variant Object with references to the underlying metadata
/// and value fields
pub enum VariantObject<'a, 'b> {
  // pointer to metadata
  metadata: VariantMetadata<'a>,
  // pointer to value
  value: &'a [u8],
}

/// Converts the variant value to a serde_json::Value
pub fn as_value(&self) -> Result<serde_json::Value, ArrowError> {
let keys = crate::decoder::parse_metadata_keys(self.metadata)?;
crate::decoder::decode_value(self.value, &keys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try and avoid using the JSON representation when decoding Variant values for such a low level API because:

  1. They are likely inefficient (e.g. to access a Variant value it needs to be converted from bytes --> Json --> Variant)
  2. They can't represent the types fully (e.g. there is no way to represent the different between Int and Float in JSON, all numbers are floats)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants