-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: coerce int96 resolution inside of list, struct, and map types #16058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'm going to punt on other nested types for this PR. If this approach is good, it should be straightforward to add other nested types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good, I will spend a little more time to make sure I understand this.
What seems strange to me about this PR is that DataFusion code is doing parquet specific type coercion when it otherwise uses the Arrow types What I would expect is that DataFusion says "I want the data from parquet back as I am actually surprised it doesn't already work today, which is why I was hoping we could get an example parquet file to see what was going wrong |
We only want to coerce Timestamp(nanos) that originate as int96, so we need to reference the underlying Parquet schema. We don't want to touch Timestamps that aren't int96, hence the test
Maybe I'm misunderstanding, but that's essentially what's happening here. The API to request different type from the Parquet crate is by providing an Arrow schema with the desired types. This function builds that schema, but only modifies fields that are Timestamp(nanos) that are stored in the file as int96. This similar to the coercion functions that convert Utf8 fields to Utf8View.
It is working that way at the moment, but the type transformation just isn't digging into nested types to find nested int96 fields. |
Tested this PR with @andygrove's Comet branch for DF 48 and confirmed that we no longer need to set |
What I am reacting to is the fact that DataFusion code is directly manipulating the Parquet schema classes / structs, rather than, for example, setting some flag in the parquet read options and letting code in the parquet crate do it. I can see your point however, how this transformation needs to have access to the parquet schema 🤔 |
…similar to schema in file_format tests) to exercise end-to-end support.
My understanding is that PR is extending DataFusion's existing Parquet INT96 coercion to be recursive rather than just looking at the top-level types. It doesn't seem to be a change in overall approach. Changes LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mbutrovich!
@alamb I'd like to go ahead and merge this one if there are no objections |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Stack-based DFS inspired by arrow-rs'
Schema::normalize
https://github.com/apache/arrow-rs/blob/1f15130414bdfc01c8989ec95702655bf553c5c5/arrow-schema/src/schema.rs#L464We don't have a max depth argument here because you either complete the process or you fail. Anything else results in a schema that might yield unexpected data. I'm open to discussion on if we should include the max depth logic with some sort of reasonable cutoff, at which point we would have to return an error.
Are these changes tested?
New tests using Parquet schemas known to exercise the issue.
Are there any user-facing changes?
No.