Improve error messages if schema hint mismatches with parquet schema #7481

alamb · 2025-05-07T18:07:36Z

Which issue does this PR close?

Part of arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

Rationale for this change

Per #7479 (comment) the error messages are pretty bad as they tell you what fields were mismatched but not what about them was different

What changes are included in this PR?

Improve error messages
Add some Debug impls for the reader builders so I could use unwrap_err

Are there any user-facing changes?

better errors, new Debug impls

alamb · 2025-05-07T18:08:18Z

arrow-array/src/record_batch.rs

@@ -1073,8 +1073,8 @@ mod tests {

        let a = Int64Array::from(vec![1, 2, 3, 4, 5]);

-        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)]);
-        assert!(batch.is_err());
+        let err = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)]).unwrap_err();


drive by cleanup

alamb · 2025-05-07T18:08:33Z

parquet/src/arrow/arrow_reader/filter.rs

 pub struct RowFilter {
    /// A list of [`ArrowPredicate`]
    pub(crate) predicates: Vec<Box<dyn ArrowPredicate>>,
 }

+impl Debug for RowFilter {
+    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+        write!(f, "RowFilter {{ {} predicates: }}", self.predicates.len())


ArrowPredicate doesn't implement Debug

alamb · 2025-05-07T18:09:20Z

parquet/src/arrow/arrow_reader/mod.rs

-                    schema: supplied_schema,
-                    fields: field_levels.levels.map(Arc::new),
-                })
+        let mut errors = Vec::new();


Improving this message is the point of the PR

I also relaxed the check slightly so this will now allow the fields to differ in metadata where previously this would return an error. There is no test coverage for mismatched schemas

FYI @paleolimbot in case you have any wisdom to share here

Hmm...this would mean that extension types can be cast implicitly to their storage (or perhaps the opposite, depending on which field metadata takes precedence). It is probably safer to fail, but not the end of the world because those errors will show up later (an error matching a signature if the extension metadata is dropped, or an error parsing bytes if unexpected content was given an extension type by accident). A true "user defined type" solution for DataFusion would be a place to handle this properly in some future (field_common(field_a, field_b) -> Field, field_cast(array, array_field, common_field) -> ArrayRef, or something).

I think relaxing the check means that a user could supply the reader a schema that had metadata that was not present in the file and the reader will then read RecordBatches that have that metadata

I agree field_cast is the longer term right thing to do in DataFusion

In arrow-rs I think that field "casting" is happening during reading of parquet

Probably the only correctness issue would be if the supplied schema had conflicting extension metadata (e.g., unit: m vs unit: cm). I am not sure that the current Parquet reader ever produces extension metadata (does it read the ARROW:i_forget_the_exact_name key and deserialize the schema?), so perhaps not an issue as long as somebody remembers this when it does.

I can put the metadata check back in perhaps and we can relax it when necessary.

I mostly was being lazy to avoid writing a test for it .

alamb · 2025-05-07T18:10:44Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -3462,7 +3497,7 @@ mod tests {
                Field::new("col2_valid", ArrowDataType::Int32, false),
                Field::new("col3_invalid", ArrowDataType::Int32, false),
            ])),
-            "Arrow: incompatible arrow schema, the following fields could not be cast: [col1_invalid, col3_invalid]",
+            "Arrow: Incompatible supplied Arrow schema: data type mismatch for field col1_invalid: requested Int32 but found Int64, data type mismatch for field col3_invalid: requested Int32 but found Int64",


this is a pretty good example of the before / after error messages. I would feel much better trying to debug the new message than the old

paleolimbot

I'm still new to arrow-rs but I took a look through for things that were out of place and I didn't find any. The new error messages seem much better!

Schema metadata is probably not in scope here, but that is also occasionally different when merging things from multiple files. I believe Arrow C++ Datasets will just give you the metadata blindly from the first one (for better or worse).

paleolimbot · 2025-05-07T18:49:23Z

parquet/src/arrow/arrow_reader/mod.rs

+                    field2.data_type()
+                ));
+            }
+            if field1.is_nullable() != field2.is_nullable() {


Can a non-nullable field be cast to a nullable one?

alamb · 2025-05-08T16:49:18Z

Thank you for your comments @paleolimbot

Improve error messages if schema hint mismatches with parquet scheam

6d2ced2

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels May 7, 2025

alamb commented May 7, 2025

View reviewed changes

alamb mentioned this pull request May 7, 2025

Document Arrow <--> Parquet schema conversion better #7479

Merged

alamb changed the title ~~Improve error messages if schema hint mismatches with parquet scheam~~ Improve error messages if schema hint mismatches with parquet schema May 7, 2025

clippy

7c21d5f

paleolimbot reviewed May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error messages if schema hint mismatches with parquet schema #7481

Improve error messages if schema hint mismatches with parquet schema #7481

alamb commented May 7, 2025 •

edited

Loading

alamb May 7, 2025

alamb May 7, 2025

alamb May 7, 2025

paleolimbot May 7, 2025

alamb May 8, 2025

paleolimbot May 8, 2025

alamb May 8, 2025 •

edited

Loading

alamb May 7, 2025

paleolimbot left a comment

paleolimbot May 7, 2025

alamb commented May 8, 2025

Improve error messages if schema hint mismatches with parquet schema #7481

Are you sure you want to change the base?

Improve error messages if schema hint mismatches with parquet schema #7481

Conversation

alamb commented May 7, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb May 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 8, 2025

alamb commented May 7, 2025 •

edited

Loading

alamb May 8, 2025 •

edited

Loading