fix: coerce int96 resolution inside of list, struct, and map types #16058

mbutrovich · 2025-05-15T13:43:23Z

Which issue does this PR close?

Closes Parquet: coerce_int96 does not work for int96 in nested types #15763.

Rationale for this change

What changes are included in this PR?

Stack-based DFS inspired by arrow-rs' Schema::normalize https://github.com/apache/arrow-rs/blob/1f15130414bdfc01c8989ec95702655bf553c5c5/arrow-schema/src/schema.rs#L464

We don't have a max depth argument here because you either complete the process or you fail. Anything else results in a schema that might yield unexpected data. I'm open to discussion on if we should include the max depth logic with some sort of reasonable cutoff, at which point we would have to return an error.

Are these changes tested?

New tests using Parquet schemas known to exercise the issue.

Are there any user-facing changes?

No.

… docs.

mbutrovich · 2025-05-15T15:22:23Z

I'm going to punt on other nested types for this PR. If this approach is good, it should be straightforward to add other nested types.

datafusion/datasource-parquet/src/file_format.rs

kazuyukitanimura

Thanks, looks good, I will spend a little more time to make sure I understand this.

datafusion/datasource-parquet/src/file_format.rs

… next.

alamb · 2025-05-15T17:30:45Z

What seems strange to me about this PR is that DataFusion code is doing parquet specific type coercion when it otherwise uses the Arrow types

What I would expect is that DataFusion says "I want the data from parquet back as List(TimestampMicros)" and then he parquet crate would handle doing the nested conversion in the parquet schema

I am actually surprised it doesn't already work today, which is why I was hoping we could get an example parquet file to see what was going wrong

mbutrovich · 2025-05-15T17:40:27Z

What seems strange to me about this PR is that DataFusion code is doing parquet specific type coercion when it otherwise uses the Arrow types

We only want to coerce Timestamp(nanos) that originate as int96, so we need to reference the underlying Parquet schema. We don't want to touch Timestamps that aren't int96, hence the test coerce_int96_to_resolution_with_mixed_timestamps. Like I mention in the test, it's not actually clear to me that any system would ever write a Parquet file like that. For example, in Spark you set a config that determines how all timestamps are written. However, I don't think there's anything in the Parquet spec that prevents mixing timestamp type in a single file so I'd rather be conservative in the logic.

What I would expect is that DataFusion says "I want the data from parquet back as List(TimestampMicros)" and then he parquet crate would handle doing the nested conversion in the parquet schema

Maybe I'm misunderstanding, but that's essentially what's happening here. The API to request different type from the Parquet crate is by providing an Arrow schema with the desired types. This function builds that schema, but only modifies fields that are Timestamp(nanos) that are stored in the file as int96. This similar to the coercion functions that convert Utf8 fields to Utf8View.

I am actually surprised it doesn't already work today, which is why I was hoping we could get an example parquet file to see what was going wrong

It is working that way at the moment, but the type transformation just isn't digging into nested types to find nested int96 fields.

mbutrovich · 2025-05-15T18:53:03Z

Tested this PR with @andygrove's Comet branch for DF 48 and confirmed that we no longer need to set generateArray and generateStruct to false in CometFuzzTestSuite's "Parquet temporal types written as INT96" test.
https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala#L227

alamb · 2025-05-15T19:53:07Z

Maybe I'm misunderstanding, but that's essentially what's happening here. The API to request different type from the Parquet crate is by providing an Arrow schema with the desired types. This function builds that schema, but only modifies fields that are Timestamp(nanos) that are stored in the file as int96. This similar to the coercion functions that convert Utf8 fields to Utf8View.

What I am reacting to is the fact that DataFusion code is directly manipulating the Parquet schema classes / structs, rather than, for example, setting some flag in the parquet read options and letting code in the parquet crate do it.

I can see your point however, how this transformation needs to have access to the parquet schema 🤔

…similar to schema in file_format tests) to exercise end-to-end support.

andygrove · 2025-05-16T17:50:29Z

It is working that way at the moment, but the type transformation just isn't digging into nested types to find nested int96 fields.

My understanding is that PR is extending DataFusion's existing Parquet INT96 coercion to be recursive rather than just looking at the top-level types. It doesn't seem to be a change in overall approach.

Changes LGTM.

andygrove

Thanks @mbutrovich!

andygrove · 2025-05-19T22:00:31Z

@alamb I'd like to go ahead and merge this one if there are no objections

mbutrovich added 7 commits May 12, 2025 16:23

Add test generated from schema in Comet.

cb05d64

Checkpoint DFS.

2cd5942

Checkpoint with working transformation.

a9cc08e

fmt, clippy fixes.

6eecac4

Remove maximum stack depth.

b33a95a

More testing.

221627b

Improve tests.

aad4ce6

github-actions bot added the datasource Changes to the datasource crate label May 15, 2025

mbutrovich and others added 7 commits May 15, 2025 10:00

Improve docs.

9129c55

Use a smaller HashSet instead of HashMap with every field in it. More…

6372f1f

… docs.

Use a smaller HashSet instead of HashMap with every field in it. More…

ea38af5

… docs.

More docs.

3005a15

More docs.

c50e737

Fix typo.

034a776

Merge branch 'main' into int96_again_again

2b9d32a

mbutrovich changed the title ~~fix: coercing int96 does not work for nested types~~ fix: coercing int96 does not work for list and struct types May 15, 2025

mbutrovich marked this pull request as ready for review May 15, 2025 15:21

mbutrovich changed the title ~~fix: coercing int96 does not work for list and struct types~~ fix: coerce int96 resolution inside of list and struct types May 15, 2025

comphead reviewed May 15, 2025

View reviewed changes

datafusion/datasource-parquet/src/file_format.rs Show resolved Hide resolved

comphead reviewed May 15, 2025

View reviewed changes

datafusion/datasource-parquet/src/file_format.rs Outdated Show resolved Hide resolved

comphead reviewed May 15, 2025

View reviewed changes

datafusion/datasource-parquet/src/file_format.rs Outdated Show resolved Hide resolved

alamb reviewed May 15, 2025

View reviewed changes

datafusion/datasource-parquet/src/file_format.rs Show resolved Hide resolved

mbutrovich added 2 commits May 15, 2025 12:49

Refactor match with nested if lets to make it more readable.

1f96786

Address some PR feedback.

957ff63

kazuyukitanimura reviewed May 15, 2025

View reviewed changes

datafusion/datasource-parquet/src/file_format.rs Outdated Show resolved Hide resolved

mbutrovich added 2 commits May 15, 2025 13:08

Rename variables in struct processing to address PR feedback. Do List…

0e272f6

… next.

Rename variables in list processing to address PR feedback.

5fbe458

mbutrovich mentioned this pull request May 15, 2025

Parquet: coerce_int96 does not work for int96 in nested types #15763

Open

Update docs.

1ddb8c1

mbutrovich and others added 4 commits May 15, 2025 16:53

Simplify list parquet path generation.

10d378f

Map support.

247866d

Remove old TODO.

74019a5

Merge branch 'main' into int96_again_again

9baad44

mbutrovich changed the title ~~fix: coerce int96 resolution inside of list and struct types~~ fix: coerce int96 resolution inside of list, struct, and map types May 15, 2025

mbutrovich requested review from comphead and kazuyukitanimura May 16, 2025 15:56

mbutrovich and others added 4 commits May 16, 2025 11:57

Merge branch 'main' into int96_again_again

591440a

Reduce redundant docs be referring to docs above.

76777af

Reduce redundant docs be referring to docs above.

e805858

Add parquet file generated from CometFuzzTestSuite ParquetGenerator (…

08ad98b

…similar to schema in file_format tests) to exercise end-to-end support.

github-actions bot added the core Core DataFusion crate label May 16, 2025

Fix clippy.

be54b6a

andygrove approved these changes May 16, 2025

View reviewed changes

kazuyukitanimura approved these changes May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: coerce int96 resolution inside of list, struct, and map types #16058

fix: coerce int96 resolution inside of list, struct, and map types #16058

mbutrovich commented May 15, 2025 •

edited

Loading

mbutrovich commented May 15, 2025

kazuyukitanimura left a comment

alamb commented May 15, 2025 •

edited

Loading

mbutrovich commented May 15, 2025 •

edited

Loading

mbutrovich commented May 15, 2025 •

edited

Loading

alamb commented May 15, 2025

andygrove commented May 16, 2025

andygrove left a comment

andygrove commented May 19, 2025

fix: coerce int96 resolution inside of list, struct, and map types #16058

Are you sure you want to change the base?

fix: coerce int96 resolution inside of list, struct, and map types #16058

Conversation

mbutrovich commented May 15, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mbutrovich commented May 15, 2025

kazuyukitanimura left a comment

Choose a reason for hiding this comment

alamb commented May 15, 2025 • edited Loading

mbutrovich commented May 15, 2025 • edited Loading

mbutrovich commented May 15, 2025 • edited Loading

alamb commented May 15, 2025

andygrove commented May 16, 2025

andygrove left a comment

Choose a reason for hiding this comment

andygrove commented May 19, 2025

mbutrovich commented May 15, 2025 •

edited

Loading

alamb commented May 15, 2025 •

edited

Loading

mbutrovich commented May 15, 2025 •

edited

Loading

mbutrovich commented May 15, 2025 •

edited

Loading