Skip to content

[FEA] Add Parquet and ORC unit tests based on Apache sample files #13627

@GregoryKimball

Description

@GregoryKimball

During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included:

  • Rare failure with page size estimator (PQ writer, Report, Fix)
  • Failure with >1GB tables (PQ writer, Report, Fix)
  • Failure with 10k nulls followed by >5 valid values (ORC Writer, Report, Fix)

After discussion with the team we agreed on these additions to our testing suite to help prevent similar issues in the future:

  • Based on test files in parquet-testing/data, verify that "read" versus "read-write-read" result in identical tables
  • Based on test files in orc/examples, verify that "read" versus "read-write-read" result in identical tables
  • Based on test files in parquet-testing/data, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
  • Based on test files in orc/examples, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables

Note: please also see (#12739), for reader benchmarks, verify that the roundtripped table matches the starting table

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentSparkFunctionality that helps Spark RAPIDScuIOcuIO issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.testsUnit testing for project

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions