-
Notifications
You must be signed in to change notification settings - Fork 930
Handle compressed empty DataPage v2 #7389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1b7ce45
to
adf4c2f
Compare
&mut decompressed, | ||
Some(uncompressed_size - offset), | ||
)?; | ||
if decompressed_size != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer
if decompressed_size != 0 { | |
if decompressed_size > 0 { |
here, but apache/arrow#45252 uses !=
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add an extra check and return an error if offset > uncompressed_page_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please! Otherwise we may be susceptible to some sort of DOS / parquet bomb with malformed input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer
FWIW since decompressed_size
is an unsigned integer I think they are equivalent you should do whatever you prefer / makes the most sense to you
@mapleFU @pitrou this ports your bugfix apache/arrow#45252 from Arrow C++ to Rust. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @EnricoMi and @adamreeve
This is looking good.
let uncompressed_size = page_header.uncompressed_page_size as usize; | ||
let mut decompressed = Vec::with_capacity(uncompressed_size); | ||
let compressed = &buffer.as_ref()[offset..]; | ||
let uncompressed_page_size = page_header.uncompressed_page_size as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to be messing with this code anyways, maybe we can do a checked conversion from signed to unsigned as well:
let uncompressed_page_size = page_header.uncompressed_page_size as usize; | |
let uncompressed_page_size = usize::try_from(page_header.uncompressed_page_size)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have replaced as usize
with usize::try_from
in all unchecked places. There are 8 spots in this file with as u64
, but I am not bolds enough to change those as well.
&mut decompressed, | ||
Some(uncompressed_size - offset), | ||
)?; | ||
if decompressed_size != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please! Otherwise we may be susceptible to some sort of DOS / parquet bomb with malformed input
&mut decompressed, | ||
Some(uncompressed_size - offset), | ||
)?; | ||
if decompressed_size != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer
FWIW since decompressed_size
is an unsigned integer I think they are equivalent you should do whatever you prefer / makes the most sense to you
@@ -1321,6 +1320,107 @@ mod tests { | |||
assert_eq!(page_count, 2); | |||
} | |||
|
|||
#[test] | |||
fn test_file_reader_empty_datapage_v2() { | |||
let test_file = get_test_file("datapage_v2_empty_datapage.snappy.parquet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are adding tests, can you also add a test that page_v2_empty_compressed.parquet
(added in apache/parquet-testing#71) works too ? It doesn't seem to be used yet https://github.com/search?q=repo%3Aapache%2Farrow-rs%20page_v2_empty_compressed.parquet&type=code
I know you say the current reader works ok with that file, but it would be nice to add additional automated coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, this is a good test case.
assert!(is_expected_page); | ||
page_count += 1; | ||
} | ||
assert_eq!(page_count, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that the added test fails as follows without the code changes in this PR
assertion `left == right` failed
left: 0
right: 1
Left: 0
Right: 1
<Click to see difference>
thread 'file::serialized_reader::tests::test_file_reader_empty_datapage_v2' panicked at parquet/src/file/serialized_reader.rs:1422:9:
assertion `left == right` failed
left: 0
right: 1
stack backtrace:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the while let Ok(Some(page)) = page_reader_0.get_next_page() {
to while let Some(page) = page_reader_0.get_next_page().unwrap() {
so it is obvious why expected pages are not seen (now the error surfaces).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @EnricoMi and @adamreeve
Waiting for apache/parquet-testing#74 to be merged. Will then point parquet-testing sub-module back to main branch. |
apache/parquet-testing#74 has been merged, moved |
Thanks @EnricoMi |
@Dandandan thanks! Will this be released as 56.0.0 and 55.0.1? |
It is scheduled to be released in 55.1.0 due in a few weeks |
Thanks! |
Which issue does this PR close?
snappy: corrupt input (empty)
#7388.Rationale for this change
An empty bytes buffer cannot be decompressed. Spark's Parquet writer stores a DataPage v2 with only
null
values as an empty byte buffer, rather than compressed bytes that decompress to zero bytes.The code currently tries to decompress a 0 bytes buffer, which is not allowed. This causes an error:
The issue is identical to this Apache Arrow issue: apache/arrow#22459
The fix is identical to Apache Arrow fix: apache/arrow#45252
What changes are included in this PR?
Do not attempt to decompress the empty bytes buffer.
This is tested in a unit test and a small example Parquet file.
Requires apache/parquet-testing#74.
Are there any user-facing changes?
No.