-
Notifications
You must be signed in to change notification settings - Fork 990
Open
Labels
0 - BacklogIn queue waiting for assignmentIn queue waiting for assignmentbugSomething isn't workingSomething isn't workingcuIOcuIO issuecuIO issuelibcudfAffects libcudf (C++/CUDA) code.Affects libcudf (C++/CUDA) code.
Milestone
Description
This bug is to track a (possible) misinterpretation of Parquet list schemas when stored in a legacy format. This is a follow-up to #13277.
This is specific to rules #3 and #4 in the Parquet LogicalType spec, which states:
3. If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
4. Otherwise, the repeated field's type is the element type with the repeated field's repetition.
Consider the following schema, from the Parquet file attached herewith:
<pyarrow._parquet.ParquetSchema object at 0x7fe1cc5849c0>
required group field_id=-1 spark_schema {
required group field_id=-1 my_list (List) {
repeated group field_id=-1 array {
required int32 field_id=-1 item;
}
}
}
libcudf seems to interpret this as List<Int32>:
$ gtests/PARQUET_TEST --gtest_filter=ParquetReaderTest.Myth
...
cudf::list_view<int32_t>:
Length : 1
Offsets : 0, 2
0, 1
By my reading of the spec, this should be interpreted as a List<Struct<Int32>>. Apache Spark seems to concur:
scala> spark.read.parquet("pq_array.parquet").printSchema
root
|-- my_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item: integer (nullable = true)Metadata
Metadata
Assignees
Labels
0 - BacklogIn queue waiting for assignmentIn queue waiting for assignmentbugSomething isn't workingSomething isn't workingcuIOcuIO issuecuIO issuelibcudfAffects libcudf (C++/CUDA) code.Affects libcudf (C++/CUDA) code.