Skip to content

fix: coerce int96 resolution inside of list, struct, and map types #16058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented May 15, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Stack-based DFS inspired by arrow-rs' Schema::normalize https://github.com/apache/arrow-rs/blob/1f15130414bdfc01c8989ec95702655bf553c5c5/arrow-schema/src/schema.rs#L464

We don't have a max depth argument here because you either complete the process or you fail. Anything else results in a schema that might yield unexpected data. I'm open to discussion on if we should include the max depth logic with some sort of reasonable cutoff, at which point we would have to return an error.

Are these changes tested?

New tests using Parquet schemas known to exercise the issue.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the datasource Changes to the datasource crate label May 15, 2025
@mbutrovich mbutrovich changed the title fix: coercing int96 does not work for nested types fix: coercing int96 does not work for list and struct types May 15, 2025
@mbutrovich mbutrovich marked this pull request as ready for review May 15, 2025 15:21
@mbutrovich
Copy link
Contributor Author

I'm going to punt on other nested types for this PR. If this approach is good, it should be straightforward to add other nested types.

@mbutrovich mbutrovich changed the title fix: coercing int96 does not work for list and struct types fix: coerce int96 resolution inside of list and struct types May 15, 2025
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good, I will spend a little more time to make sure I understand this.

@alamb
Copy link
Contributor

alamb commented May 15, 2025

What seems strange to me about this PR is that DataFusion code is doing parquet specific type coercion when it otherwise uses the Arrow types

What I would expect is that DataFusion says "I want the data from parquet back as List(TimestampMicros)" and then he parquet crate would handle doing the nested conversion in the parquet schema

I am actually surprised it doesn't already work today, which is why I was hoping we could get an example parquet file to see what was going wrong

@mbutrovich
Copy link
Contributor Author

mbutrovich commented May 15, 2025

What seems strange to me about this PR is that DataFusion code is doing parquet specific type coercion when it otherwise uses the Arrow types

We only want to coerce Timestamp(nanos) that originate as int96, so we need to reference the underlying Parquet schema. We don't want to touch Timestamps that aren't int96, hence the test coerce_int96_to_resolution_with_mixed_timestamps. Like I mention in the test, it's not actually clear to me that any system would ever write a Parquet file like that. For example, in Spark you set a config that determines how all timestamps are written. However, I don't think there's anything in the Parquet spec that prevents mixing timestamp type in a single file so I'd rather be conservative in the logic.

What I would expect is that DataFusion says "I want the data from parquet back as List(TimestampMicros)" and then he parquet crate would handle doing the nested conversion in the parquet schema

Maybe I'm misunderstanding, but that's essentially what's happening here. The API to request different type from the Parquet crate is by providing an Arrow schema with the desired types. This function builds that schema, but only modifies fields that are Timestamp(nanos) that are stored in the file as int96. This similar to the coercion functions that convert Utf8 fields to Utf8View.

I am actually surprised it doesn't already work today, which is why I was hoping we could get an example parquet file to see what was going wrong

It is working that way at the moment, but the type transformation just isn't digging into nested types to find nested int96 fields.

@mbutrovich
Copy link
Contributor Author

mbutrovich commented May 15, 2025

Tested this PR with @andygrove's Comet branch for DF 48 and confirmed that we no longer need to set generateArray and generateStruct to false in CometFuzzTestSuite's "Parquet temporal types written as INT96" test.
https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala#L227

@alamb
Copy link
Contributor

alamb commented May 15, 2025

Maybe I'm misunderstanding, but that's essentially what's happening here. The API to request different type from the Parquet crate is by providing an Arrow schema with the desired types. This function builds that schema, but only modifies fields that are Timestamp(nanos) that are stored in the file as int96. This similar to the coercion functions that convert Utf8 fields to Utf8View.

What I am reacting to is the fact that DataFusion code is directly manipulating the Parquet schema classes / structs, rather than, for example, setting some flag in the parquet read options and letting code in the parquet crate do it.

I can see your point however, how this transformation needs to have access to the parquet schema 🤔

@mbutrovich mbutrovich changed the title fix: coerce int96 resolution inside of list and struct types fix: coerce int96 resolution inside of list, struct, and map types May 15, 2025
@github-actions github-actions bot added the core Core DataFusion crate label May 16, 2025
@andygrove
Copy link
Member

It is working that way at the moment, but the type transformation just isn't digging into nested types to find nested int96 fields.

My understanding is that PR is extending DataFusion's existing Parquet INT96 coercion to be recursive rather than just looking at the top-level types. It doesn't seem to be a change in overall approach.

Changes LGTM.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich!

@andygrove
Copy link
Member

@alamb I'd like to go ahead and merge this one if there are no objections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: coerce_int96 does not work for int96 in nested types
5 participants