Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Implement Variant readers #12139

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jan 31, 2025

This adds Parquet readers for Variant data, including for shredded fields.

The readers are built using a new Variant visitor that is plugged into the existing Parquet type visitor by returning a Variant visitor class. When a Variant visitor is present, it is called to produce a result to pass into the Parquet variant visitor method. This PR adds a Variant visitor that produces Parquet readers that handle metadata, serialized values, and shredded values.

Tests are in TestVariantReaders.

This does not support Variant arrays. I intend to follow up with support after this is reviewed.

@@ -47,7 +47,7 @@ public Set<Integer> struct(Types.StructType struct, List<Set<Integer>> fieldResu

@Override
public Set<Integer> field(Types.NestedField field, Set<Integer> fieldResult) {
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType()) {
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType() || field.type() instanceof Types.VariantType) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed for reading Parquet files without dropping variant columns.

@@ -209,59 +213,59 @@ public static VariantPrimitive<Void> ofNull() {
return new PrimitiveWrapper<>(PhysicalType.NULL, null);
}

static VariantPrimitive<Boolean> of(boolean value) {
public static VariantPrimitive<Boolean> of(boolean value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to expose these to create test values in the Parquet package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to expose it here, but it we could also add some builders in the test module if we want to not have as much exposure. So TestParquet depends on TestCore and core has public builders for Variants.

* @param name a String name
* @return true if the group contains a field with the given name
*/
public static boolean hasField(GroupType group, String name) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New helper methods for working with Parquet fields.

* }
* </pre>
*/
public R object(GroupType object, R valueResult, List<R> fieldResults) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value, object, and array visitor methods are all used for a value / typed_value pair, depending on the structure of typed value (a shredded value, shredded array, or shredded object).

if (field.isPrimitive()) {
return false;
} else if (expected.type() == org.apache.iceberg.types.Types.VariantType.get()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were needed to project the sub-fields of a variant because it looks like a struct.

@@ -117,6 +120,14 @@ public GroupType map(MapType map, Type.Repetition repetition, int id, String nam
.named(AvroSchemaUtil.makeCompatibleName(name));
}

public Type variant(Type.Repetition repetition, int id, String originalName) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implements converting an Iceberg variant schema into a Parquet schema, without shredding.

@@ -54,101 +57,108 @@ public static <T> T visit(
} else {
// if not a primitive, the typeId must be a group
GroupType group = type.asGroupType();
OriginalType annotation = group.getOriginalType();
Copy link
Contributor Author

@rdblue rdblue Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code was refactored into methods for each case. Handling for LIST and MAP and struct fields did not change.

@@ -217,11 +238,19 @@ public T map(Types.MapType iMap, GroupType map, T key, T value) {
return null;
}

public T variant(Types.VariantType iVariant, T result) {
throw new UnsupportedOperationException("Not implemented for variant");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thrown to force implementations to update in order to handle variants.

@amogh-jahagirdar amogh-jahagirdar self-requested a review January 31, 2025 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants