Change some panics to errors in parquet decoder #8602

rambleraptor · 2025-10-13T21:45:02Z

Rationale for this change

We've caused some unexpected panics from our internal testing. We've put in error checks for all of these so that they don't affect other users.

What changes are included in this PR?

Various error checks to ensure panics don't occur.

Are these changes tested?

Tests should continue to pass.

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Existing tests should cover these changes.

Are there any user-facing changes?

None.

etseidl · 2025-10-14T16:05:18Z

I guess we can lump this in with #7806

rambleraptor · 2025-10-14T18:25:20Z

I'm happy to split this up into some separate PRs. I know it's a lot of random things as-is.

etseidl · 2025-10-14T18:59:43Z

I'm happy to split this up into some separate PRs. I know it's a lot of random things as-is.

The pedant in me wants to take you up on your offer, but there's not so much going on here that I think that's necessary. Maybe just change the title to something that sounds better 😉. ("Address panics found in external testing").

etseidl

Thanks @rambleraptor, these all look sensible to me.

Would it be possible to gin up some tests for at least some of them?

parquet/src/encodings/decoding.rs

parquet/src/schema/types.rs

parquet/tests/arrow_reader/bad_data.rs

scovich

Thanks for digging into this! Several comments.

scovich · 2025-10-14T19:09:00Z

parquet/src/column/reader.rs

+                    return Ok((end, buf.slice(i32_size..end)));
+                }
+            }
+            Err(general_err!("not enough data to read levels"))


This is definitely an improvement over the existing code, but it opens a question:

Given that we're reading bytes from a byte buffer, it seems like we must expect to hit this situation at least occasionally? And the correct response is to fetch more bytes, not fail? Is there some mechanism for handling that higher up in the call stack? Or is there some reason it should be impossible for this code to run off the end of the buffer?

Also -- it seems like read_num_bytes should do bounds checking internally and return Option<T>, so buffer overrun is obvious at the call site instead of a hidden panic footgun? The method has a half dozen other callers, and they all need to do manual bounds checking, in various ways and with varying degrees of safety. In particular, parquet/src/data_type.rs has two call sites that lack any visible bounds checks.

In this particular instance we're reading a buffer that should contain an entire page of data. If it doesn't, that likely points to a problem with the metadata.

Changes to read_num_bytes would likely need more careful consideration as I suspect it might be used in some performance critical sections.

parquet/src/encodings/decoding.rs

parquet/src/encodings/rle.rs

parquet/src/file/reader.rs

scovich · 2025-10-14T19:33:57Z

parquet/src/schema/types.rs

+            } else if !is_root_node {
+                return Err(general_err!("Repetition level must be defined for non-root types"));
            }
            Ok((next_index, Arc::new(builder.build().unwrap())))


How do we know the unwrap is safe?

build never returns an Err 😉. But good point, could replace unwrap with ?.

parquet/src/schema/types.rs

Co-authored-by: Ed Seidl <[email protected]>

Co-authored-by: Ryan Johnson <[email protected]>

github-actions bot added the parquet Changes to the parquet crate label Oct 13, 2025

Assorted panics we've found

4d193b3

rambleraptor force-pushed the fix-some-panics branch from 17d5287 to 4d193b3 Compare October 13, 2025 22:11

rambleraptor mentioned this pull request Oct 14, 2025

[DISCUSS] Remove panics #7806

Open

etseidl approved these changes Oct 14, 2025

View reviewed changes

parquet/src/encodings/decoding.rs Outdated Show resolved Hide resolved

parquet/src/schema/types.rs Show resolved Hide resolved

parquet/tests/arrow_reader/bad_data.rs Show resolved Hide resolved

scovich reviewed Oct 14, 2025

View reviewed changes

rambleraptor changed the title ~~Assorted panics we've found~~ Address panics found in external testing Oct 14, 2025

rambleraptor and others added 2 commits October 14, 2025 13:05

Update parquet/src/encodings/decoding.rs

5e6f678

Co-authored-by: Ed Seidl <[email protected]>

Update parquet/src/file/reader.rs

ba4eb84

Co-authored-by: Ryan Johnson <[email protected]>

alamb changed the title ~~Address panics found in external testing~~ Change some panics to errors in parquet decoder Oct 14, 2025

Change some panics to errors in parquet decoder #8602

Are you sure you want to change the base?

Change some panics to errors in parquet decoder #8602

Conversation

rambleraptor commented Oct 13, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Oct 14, 2025

Uh oh!

rambleraptor commented Oct 14, 2025

Uh oh!

etseidl commented Oct 14, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

scovich Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

scovich Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scovich Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants