Skip to content

[BUG] Misinterpretation of Parquet List schema with single GROUP child named "array" #13313

@mythrocks

Description

@mythrocks

This bug is to track a (possible) misinterpretation of Parquet list schemas when stored in a legacy format. This is a follow-up to #13277.

This is specific to rules #3 and #4 in the Parquet LogicalType spec, which states:

3. If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
4. Otherwise, the repeated field's type is the element type with the repeated field's repetition.

Consider the following schema, from the Parquet file attached herewith:

 <pyarrow._parquet.ParquetSchema object at 0x7fe1cc5849c0>
required group field_id=-1 spark_schema {
  required group field_id=-1 my_list (List) {
    repeated group field_id=-1 array {
      required int32 field_id=-1 item;
    }
  }
}

libcudf seems to interpret this as List<Int32>:

$ gtests/PARQUET_TEST --gtest_filter=ParquetReaderTest.Myth
...
cudf::list_view<int32_t>:
Length : 1
Offsets : 0, 2
   0, 1

By my reading of the spec, this should be interpreted as a List<Struct<Int32>>. Apache Spark seems to concur:

scala> spark.read.parquet("pq_array.parquet").printSchema
root
 |-- my_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item: integer (nullable = true)

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentbugSomething isn't workingcuIOcuIO issuelibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions