Skip to content

Handle compressed empty DataPage v2 #7389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 21, 2025
Merged

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Apr 7, 2025

Which issue does this PR close?

Rationale for this change

An empty bytes buffer cannot be decompressed. Spark's Parquet writer stores a DataPage v2 with only null values as an empty byte buffer, rather than compressed bytes that decompress to zero bytes.

The code currently tries to decompress a 0 bytes buffer, which is not allowed. This causes an error:

snappy: corrupt input (empty)

The issue is identical to this Apache Arrow issue: apache/arrow#22459
The fix is identical to Apache Arrow fix: apache/arrow#45252

What changes are included in this PR?

Do not attempt to decompress the empty bytes buffer.
This is tested in a unit test and a small example Parquet file.
Requires apache/parquet-testing#74.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 7, 2025
@EnricoMi EnricoMi force-pushed the datapage-v2-empty branch from 1b7ce45 to adf4c2f Compare April 7, 2025 10:20
&mut decompressed,
Some(uncompressed_size - offset),
)?;
if decompressed_size != 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer

Suggested change
if decompressed_size != 0 {
if decompressed_size > 0 {

here, but apache/arrow#45252 uses !=.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an extra check and return an error if offset > uncompressed_page_size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please! Otherwise we may be susceptible to some sort of DOS / parquet bomb with malformed input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer

FWIW since decompressed_size is an unsigned integer I think they are equivalent you should do whatever you prefer / makes the most sense to you

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 8, 2025

@mapleFU @pitrou this ports your bugfix apache/arrow#45252 from Arrow C++ to Rust.

@EnricoMi EnricoMi changed the title Work with empty DataPage v2 (only null values) Handle compressed empty DataPage v2 Apr 8, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @EnricoMi and @adamreeve

This is looking good.

let uncompressed_size = page_header.uncompressed_page_size as usize;
let mut decompressed = Vec::with_capacity(uncompressed_size);
let compressed = &buffer.as_ref()[offset..];
let uncompressed_page_size = page_header.uncompressed_page_size as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to be messing with this code anyways, maybe we can do a checked conversion from signed to unsigned as well:

Suggested change
let uncompressed_page_size = page_header.uncompressed_page_size as usize;
let uncompressed_page_size = usize::try_from(page_header.uncompressed_page_size)?;

Copy link
Contributor Author

@EnricoMi EnricoMi Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced as usize with usize::try_from in all unchecked places. There are 8 spots in this file with as u64, but I am not bolds enough to change those as well.

&mut decompressed,
Some(uncompressed_size - offset),
)?;
if decompressed_size != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please! Otherwise we may be susceptible to some sort of DOS / parquet bomb with malformed input

&mut decompressed,
Some(uncompressed_size - offset),
)?;
if decompressed_size != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer

FWIW since decompressed_size is an unsigned integer I think they are equivalent you should do whatever you prefer / makes the most sense to you

@@ -1321,6 +1320,107 @@ mod tests {
assert_eq!(page_count, 2);
}

#[test]
fn test_file_reader_empty_datapage_v2() {
let test_file = get_test_file("datapage_v2_empty_datapage.snappy.parquet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are adding tests, can you also add a test that page_v2_empty_compressed.parquet (added in apache/parquet-testing#71) works too ? It doesn't seem to be used yet https://github.com/search?q=repo%3Aapache%2Farrow-rs%20page_v2_empty_compressed.parquet&type=code

I know you say the current reader works ok with that file, but it would be nice to add additional automated coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, this is a good test case.

assert!(is_expected_page);
page_count += 1;
}
assert_eq!(page_count, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that the added test fails as follows without the code changes in this PR

assertion `left == right` failed
  left: 0
 right: 1

Left:  0
Right: 1
<Click to see difference>

thread 'file::serialized_reader::tests::test_file_reader_empty_datapage_v2' panicked at parquet/src/file/serialized_reader.rs:1422:9:
assertion `left == right` failed
  left: 0
 right: 1
stack backtrace:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the while let Ok(Some(page)) = page_reader_0.get_next_page() { to while let Some(page) = page_reader_0.get_next_page().unwrap() { so it is obvious why expected pages are not seen (now the error surfaces).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @EnricoMi and @adamreeve

@EnricoMi
Copy link
Contributor Author

Waiting for apache/parquet-testing#74 to be merged. Will then point parquet-testing sub-module back to main branch.

@EnricoMi
Copy link
Contributor Author

apache/parquet-testing#74 has been merged, moved parquet-testing to latest master.

@Dandandan Dandandan merged commit 6a6c631 into apache:main Apr 21, 2025
16 checks passed
@Dandandan
Copy link
Contributor

Thanks @EnricoMi

@EnricoMi
Copy link
Contributor Author

@Dandandan thanks! Will this be released as 56.0.0 and 55.0.1?

@EnricoMi EnricoMi deleted the datapage-v2-empty branch April 21, 2025 16:04
@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

@Dandandan thanks! Will this be released as 56.0.0 and 55.0.1?

It is scheduled to be released in 55.1.0 due in a few weeks

@EnricoMi
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading empty DataPageV2 fails with snappy: corrupt input (empty)
4 participants