-
Notifications
You must be signed in to change notification settings - Fork 988
Open
Labels
0 - BacklogIn queue waiting for assignmentIn queue waiting for assignmentSparkFunctionality that helps Spark RAPIDSFunctionality that helps Spark RAPIDScuIOcuIO issuecuIO issuefeature requestNew feature or requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.Affects libcudf (C++/CUDA) code.testsUnit testing for projectUnit testing for project
Milestone
Description
During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included:
- Rare failure with page size estimator (PQ writer, Report, Fix)
- Failure with >1GB tables (PQ writer, Report, Fix)
- Failure with 10k nulls followed by >5 valid values (ORC Writer, Report, Fix)
After discussion with the team we agreed on these additions to our testing suite to help prevent similar issues in the future:
- Based on test files in parquet-testing/data, verify that "read" versus "read-write-read" result in identical tables
- Based on test files in orc/examples, verify that "read" versus "read-write-read" result in identical tables
- Based on test files in parquet-testing/data, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
- Based on test files in orc/examples, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
Note: please also see (#12739), for reader benchmarks, verify that the roundtripped table matches the starting table
Metadata
Metadata
Assignees
Labels
0 - BacklogIn queue waiting for assignmentIn queue waiting for assignmentSparkFunctionality that helps Spark RAPIDSFunctionality that helps Spark RAPIDScuIOcuIO issuecuIO issuefeature requestNew feature or requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.Affects libcudf (C++/CUDA) code.testsUnit testing for projectUnit testing for project