-
Notifications
You must be signed in to change notification settings - Fork 918
Improve error messages if schema hint mismatches with parquet schema #7481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -1073,8 +1073,8 @@ mod tests { | |||
|
|||
let a = Int64Array::from(vec![1, 2, 3, 4, 5]); | |||
|
|||
let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)]); | |||
assert!(batch.is_err()); | |||
let err = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)]).unwrap_err(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by cleanup
pub struct RowFilter { | ||
/// A list of [`ArrowPredicate`] | ||
pub(crate) predicates: Vec<Box<dyn ArrowPredicate>>, | ||
} | ||
|
||
impl Debug for RowFilter { | ||
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { | ||
write!(f, "RowFilter {{ {} predicates: }}", self.predicates.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrowPredicate doesn't implement Debug
schema: supplied_schema, | ||
fields: field_levels.levels.map(Arc::new), | ||
}) | ||
let mut errors = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improving this message is the point of the PR
I also relaxed the check slightly so this will now allow the fields to differ in metadata where previously this would return an error. There is no test coverage for mismatched schemas
FYI @paleolimbot in case you have any wisdom to share here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...this would mean that extension types can be cast implicitly to their storage (or perhaps the opposite, depending on which field metadata takes precedence). It is probably safer to fail, but not the end of the world because those errors will show up later (an error matching a signature if the extension metadata is dropped, or an error parsing bytes if unexpected content was given an extension type by accident). A true "user defined type" solution for DataFusion would be a place to handle this properly in some future (field_common(field_a, field_b) -> Field
, field_cast(array, array_field, common_field) -> ArrayRef
, or something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think relaxing the check means that a user could supply the reader a schema that had metadata that was not present in the file and the reader will then read RecordBatches that have that metadata
I agree field_cast
is the longer term right thing to do in DataFusion
In arrow-rs I think that field "casting" is happening during reading of parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the only correctness issue would be if the supplied schema had conflicting extension metadata (e.g., unit: m
vs unit: cm
). I am not sure that the current Parquet reader ever produces extension metadata (does it read the ARROW:i_forget_the_exact_name key
and deserialize the schema?), so perhaps not an issue as long as somebody remembers this when it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put the metadata check back in perhaps and we can relax it when necessary.
I mostly was being lazy to avoid writing a test for it .
@@ -3462,7 +3497,7 @@ mod tests { | |||
Field::new("col2_valid", ArrowDataType::Int32, false), | |||
Field::new("col3_invalid", ArrowDataType::Int32, false), | |||
])), | |||
"Arrow: incompatible arrow schema, the following fields could not be cast: [col1_invalid, col3_invalid]", | |||
"Arrow: Incompatible supplied Arrow schema: data type mismatch for field col1_invalid: requested Int32 but found Int64, data type mismatch for field col3_invalid: requested Int32 but found Int64", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty good example of the before / after error messages. I would feel much better trying to debug the new message than the old
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still new to arrow-rs but I took a look through for things that were out of place and I didn't find any. The new error messages seem much better!
Schema metadata is probably not in scope here, but that is also occasionally different when merging things from multiple files. I believe Arrow C++ Datasets will just give you the metadata blindly from the first one (for better or worse).
field2.data_type() | ||
)); | ||
} | ||
if field1.is_nullable() != field2.is_nullable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a non-nullable field be cast to a nullable one?
Thank you for your comments @paleolimbot |
Which issue does this PR close?
Rationale for this change
Per #7479 (comment) the error messages are pretty bad as they tell you what fields were mismatched but not what about them was different
What changes are included in this PR?
Debug
impls for the reader builders so I could useunwrap_err
Are there any user-facing changes?
better errors, new Debug impls