Skip to content

Improve error messages if schema hint mismatches with parquet schema #7481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 7, 2025

Which issue does this PR close?

Rationale for this change

Per #7479 (comment) the error messages are pretty bad as they tell you what fields were mismatched but not what about them was different

What changes are included in this PR?

  1. Improve error messages
  2. Add some Debug impls for the reader builders so I could use unwrap_err

Are there any user-facing changes?

better errors, new Debug impls

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels May 7, 2025
@@ -1073,8 +1073,8 @@ mod tests {

let a = Int64Array::from(vec![1, 2, 3, 4, 5]);

let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)]);
assert!(batch.is_err());
let err = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)]).unwrap_err();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanup

pub struct RowFilter {
/// A list of [`ArrowPredicate`]
pub(crate) predicates: Vec<Box<dyn ArrowPredicate>>,
}

impl Debug for RowFilter {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
write!(f, "RowFilter {{ {} predicates: }}", self.predicates.len())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrowPredicate doesn't implement Debug

schema: supplied_schema,
fields: field_levels.levels.map(Arc::new),
})
let mut errors = Vec::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving this message is the point of the PR

I also relaxed the check slightly so this will now allow the fields to differ in metadata where previously this would return an error. There is no test coverage for mismatched schemas

FYI @paleolimbot in case you have any wisdom to share here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...this would mean that extension types can be cast implicitly to their storage (or perhaps the opposite, depending on which field metadata takes precedence). It is probably safer to fail, but not the end of the world because those errors will show up later (an error matching a signature if the extension metadata is dropped, or an error parsing bytes if unexpected content was given an extension type by accident). A true "user defined type" solution for DataFusion would be a place to handle this properly in some future (field_common(field_a, field_b) -> Field, field_cast(array, array_field, common_field) -> ArrayRef, or something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think relaxing the check means that a user could supply the reader a schema that had metadata that was not present in the file and the reader will then read RecordBatches that have that metadata

I agree field_cast is the longer term right thing to do in DataFusion

In arrow-rs I think that field "casting" is happening during reading of parquet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the only correctness issue would be if the supplied schema had conflicting extension metadata (e.g., unit: m vs unit: cm). I am not sure that the current Parquet reader ever produces extension metadata (does it read the ARROW:i_forget_the_exact_name key and deserialize the schema?), so perhaps not an issue as long as somebody remembers this when it does.

Copy link
Contributor Author

@alamb alamb May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put the metadata check back in perhaps and we can relax it when necessary.

I mostly was being lazy to avoid writing a test for it .

@@ -3462,7 +3497,7 @@ mod tests {
Field::new("col2_valid", ArrowDataType::Int32, false),
Field::new("col3_invalid", ArrowDataType::Int32, false),
])),
"Arrow: incompatible arrow schema, the following fields could not be cast: [col1_invalid, col3_invalid]",
"Arrow: Incompatible supplied Arrow schema: data type mismatch for field col1_invalid: requested Int32 but found Int64, data type mismatch for field col3_invalid: requested Int32 but found Int64",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty good example of the before / after error messages. I would feel much better trying to debug the new message than the old

@alamb alamb changed the title Improve error messages if schema hint mismatches with parquet scheam Improve error messages if schema hint mismatches with parquet schema May 7, 2025
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still new to arrow-rs but I took a look through for things that were out of place and I didn't find any. The new error messages seem much better!

Schema metadata is probably not in scope here, but that is also occasionally different when merging things from multiple files. I believe Arrow C++ Datasets will just give you the metadata blindly from the first one (for better or worse).

field2.data_type()
));
}
if field1.is_nullable() != field2.is_nullable() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a non-nullable field be cast to a nullable one?

@alamb
Copy link
Contributor Author

alamb commented May 8, 2025

Thank you for your comments @paleolimbot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants