[BUG] Misinterpretation of Parquet List schema with single GROUP child named "array"

This bug is to track a (possible) misinterpretation of Parquet list schemas when stored in a legacy format. This is a follow-up to https://github.com/rapidsai/cudf/pull/13277.

This is specific to rules #3 and #4 in the [Parquet `LogicalType` spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules), which states:
```
3. If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
4. Otherwise, the repeated field's type is the element type with the repeated field's repetition.
```
Consider the following schema, from the [Parquet file attached herewith](https://github.com/rapidsai/cudf/files/11427533/pq_array.zip):
```
 <pyarrow._parquet.ParquetSchema object at 0x7fe1cc5849c0>
required group field_id=-1 spark_schema {
  required group field_id=-1 my_list (List) {
    repeated group field_id=-1 array {
      required int32 field_id=-1 item;
    }
  }
}
```

`libcudf` seems to interpret this as `List<Int32>`:
```
$ gtests/PARQUET_TEST --gtest_filter=ParquetReaderTest.Myth
...
cudf::list_view<int32_t>:
Length : 1
Offsets : 0, 2
   0, 1
```
By my reading of the spec, this should be interpreted as a `List<Struct<Int32>>`. Apache Spark seems to concur:
```scala
scala> spark.read.parquet("pq_array.parquet").printSchema
root
 |-- my_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item: integer (nullable = true)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Misinterpretation of Parquet List schema with single GROUP child named "array" #13313

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Misinterpretation of Parquet List schema with single GROUP child named "array" #13313

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions