[FEA] Add Parquet and ORC unit tests based on Apache sample files

During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included:
* Rare failure with page size estimator (PQ writer, [Report](https://github.com/rapidsai/cudf/issues/13250), [Fix](https://github.com/rapidsai/cudf/pull/13364))
* Failure with >1GB tables (PQ writer, [Report](https://github.com/rapidsai/cudf/issues/13414), [Fix](https://github.com/rapidsai/cudf/pull/13438)) 
* Failure with 10k nulls followed by >5 valid values (ORC Writer, [Report](https://github.com/rapidsai/cudf/issues/13460), [Fix](https://github.com/rapidsai/cudf/pull/13466))

After discussion with the team we agreed on these additions to our testing suite to help prevent similar issues in the future:
* Based on test files in [parquet-testing/data](https://github.com/apache/parquet-testing/tree/b2e7cc755159196e3a068c8594f7acbaecfdaaac/data), verify that "read" versus "read-write-read" result in identical tables
* Based on test files in [orc/examples](https://github.com/apache/orc/tree/main/examples), verify that "read" versus "read-write-read" result in identical tables
* Based on test files in [parquet-testing/data](https://github.com/apache/parquet-testing/tree/b2e7cc755159196e3a068c8594f7acbaecfdaaac/data), verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
* Based on test files in [orc/examples](https://github.com/apache/orc/tree/main/examples), verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables

Note: please also see (#12739), for reader benchmarks, verify that the roundtripped table matches the starting table


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Add Parquet and ORC unit tests based on Apache sample files #13627

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Add Parquet and ORC unit tests based on Apache sample files #13627

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions