-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Implement Variant readers #12139
base: main
Are you sure you want to change the base?
Conversation
@@ -47,7 +47,7 @@ public Set<Integer> struct(Types.StructType struct, List<Set<Integer>> fieldResu | |||
|
|||
@Override | |||
public Set<Integer> field(Types.NestedField field, Set<Integer> fieldResult) { | |||
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType()) { | |||
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType() || field.type() instanceof Types.VariantType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed for reading Parquet files without dropping variant columns.
@@ -209,59 +213,59 @@ public static VariantPrimitive<Void> ofNull() { | |||
return new PrimitiveWrapper<>(PhysicalType.NULL, null); | |||
} | |||
|
|||
static VariantPrimitive<Boolean> of(boolean value) { | |||
public static VariantPrimitive<Boolean> of(boolean value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to expose these to create test values in the Parquet package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fine to expose it here, but it we could also add some builders in the test module if we want to not have as much exposure. So TestParquet depends on TestCore and core has public builders for Variants.
* @param name a String name | ||
* @return true if the group contains a field with the given name | ||
*/ | ||
public static boolean hasField(GroupType group, String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New helper methods for working with Parquet fields.
* } | ||
* </pre> | ||
*/ | ||
public R object(GroupType object, R valueResult, List<R> fieldResults) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value
, object
, and array
visitor methods are all used for a value
/ typed_value
pair, depending on the structure of typed value (a shredded value, shredded array, or shredded object).
if (field.isPrimitive()) { | ||
return false; | ||
} else if (expected.type() == org.apache.iceberg.types.Types.VariantType.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes were needed to project the sub-fields of a variant because it looks like a struct.
@@ -117,6 +120,14 @@ public GroupType map(MapType map, Type.Repetition repetition, int id, String nam | |||
.named(AvroSchemaUtil.makeCompatibleName(name)); | |||
} | |||
|
|||
public Type variant(Type.Repetition repetition, int id, String originalName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implements converting an Iceberg variant schema into a Parquet schema, without shredding.
@@ -54,101 +57,108 @@ public static <T> T visit( | |||
} else { | |||
// if not a primitive, the typeId must be a group | |||
GroupType group = type.asGroupType(); | |||
OriginalType annotation = group.getOriginalType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code was refactored into methods for each case. Handling for LIST and MAP and struct fields did not change.
@@ -217,11 +238,19 @@ public T map(Types.MapType iMap, GroupType map, T key, T value) { | |||
return null; | |||
} | |||
|
|||
public T variant(Types.VariantType iVariant, T result) { | |||
throw new UnsupportedOperationException("Not implemented for variant"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thrown to force implementations to update in order to handle variants.
This adds Parquet readers for Variant data, including for shredded fields.
The readers are built using a new Variant visitor that is plugged into the existing Parquet type visitor by returning a Variant visitor class. When a Variant visitor is present, it is called to produce a result to pass into the Parquet
variant
visitor method. This PR adds a Variant visitor that produces Parquet readers that handle metadata, serialized values, and shredded values.Tests are in
TestVariantReaders
.This does not support Variant arrays. I intend to follow up with support after this is reviewed.