feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

mbutrovich · 2025-10-06T02:37:21Z

This is mostly for discussion at the moment. There are slides from the 10/9/25 Iceberg-Rust community call here where I presented this effort here.

Rationale for this change

I was inspired by @RussellSpitzer's recent talk and wanted to revisit the abstraction layer at which Comet integrates with Iceberg. We have the iceberg_compat codepath for Iceberg integration, but this requires code changes in Iceberg Java to integrate with Parquet reader instantiation. Instead, this prototype works at the FileScanTask layer after planning. This prototype starts us toward fully-native Iceberg scans to match our Parquet logic with native_datafusion scans without any changes in upstream Iceberg Java code.

What changes are included in this PR?

New CometIcebergNativeScanExec node on the Scala side.
Use reflection to extract scan properties, mostly FileScanTasks and serialize to native code.
New IcebergScanExec on native side that uses FileScanTasks to perform reads in iceberg-rust.

How are these changes tested?

New CometIcebergNativeSuite.

Benefits over `iceberg_compat`?

No upstream code changes needed in Iceberg Java, no references to Comet needed in Iceberg anymore.
Better parallelism for file reading, more similar to native_datafusion.
No separate DataFusion runtime, these run in the same context as other operators (compared to iceberg_compat).
Better testing for iceberg-rust. I think I already found a shortcoming with row group pruning logic.
Tested with Iceberg 1.5, 1.7, 1.10.

Current Limitations/Concerns?

I lied about no upstream changes. I need one line changed in iceberg-rust and will open a PR there to make an API public. Currently this PR relies on my fork of iceberg-rust.
Need to try running Iceberg Java tests with this. I need to look at our current pipelines, since in theory we don’t want to apply the diff for iceberg_compat to Iceberg.
Need to explore/validate OpenDAL support for credential providers.
We'd need to try to keep iceberg-rust in sync with Comet's DataFusion dependency. I also had to bump my iceberg-rust fork to DataFusion 50.
We've already entangled Comet and Iceberg Java code, what would the deprecation of that code look like?
RecordBatchTransformer instead of SchemaAdapter/PhysicalExprAdapter. Need to understand the compatibility gap there.
Don't have access to ArrowReaderOptions yet (needed for proper Spark-compatible INT96 handling) https://github.com/apache/iceberg-rust/blob/dc349284a4204c1a56af47fb3177ace6f9e899a0/crates/iceberg/src/arrow/reader.rs#L1384.

codecov-commenter · 2025-10-06T02:54:32Z

Codecov Report

❌ Patch coverage is 75.38803% with 111 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.76%. Comparing base (f09f8af) to head (e19e201).
⚠️ Report is 632 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	73.25%	67 Missing and 21 partials ⚠️
...e/spark/sql/comet/CometIcebergNativeScanExec.scala	85.10%	3 Missing and 11 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala	53.84%	3 Missing and 3 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	60.00%	0 Missing and 2 partials ⚠️
...la/org/apache/comet/objectstore/NativeConfig.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2528      +/-   ##
============================================
+ Coverage     56.12%   59.76%   +3.63%     
- Complexity      976     1461     +485     
============================================
  Files           119      148      +29     
  Lines         11743    14175    +2432     
  Branches       2251     2438     +187     
============================================
+ Hits           6591     8471    +1880     
- Misses         4012     4444     +432     
- Partials       1140     1260     +120

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2025-10-06T15:22:35Z

It is promising!

# Conflicts: # native/Cargo.lock # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

…eberg version back to 1.8.1 after hitting known segfaults with old versions.

## Which issue does this PR close? - Part of #1749. ## What changes are included in this PR? - Change `ArrowReaderBuilder::new` to be `pub` instead of `pub(crate)`. ## Are these changes tested? - No new tests for this. Currently being used in DataFusion Comet: apache/datafusion-comet#2528

# Conflicts: # docs/source/user-guide/latest/configs.md # native/Cargo.lock # native/Cargo.toml # native/core/Cargo.toml

# Conflicts: # native/Cargo.lock

# Conflicts: # spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala

mbutrovich · 2025-10-22T12:50:51Z

Chipping away at Iceberg tests, running via:

  ENABLE_COMET=true ./gradlew -DsparkVersions=3.5 -DscalaVersion=2.13 -DflinkVersions= -DkafkaVersions= \
    :iceberg-spark:iceberg-spark-3.5_2.13:test \
    -Pquick=true -x javadoc

Yesterday:

Today with apache/iceberg-rust#1777 and other fixes:

…h default values.

mbutrovich · 2025-10-22T21:33:55Z

Today's progress:

Just one test suite remains to tackle.

mbutrovich added 3 commits October 5, 2025 21:53

CometNativeIcebergScan with iceberg-rust using FileScanTasks.

cded0ad

Clean up tests a little.

4f3004b

Remove old comment.

4afec43

mbutrovich added 6 commits October 6, 2025 06:58

Fix machete and missing suite CI failures.

fc97ce9

Fix unused variables.

cca4911

Spark 4.0 needs Iceberg 1.10, let's see if that works in CI.

93f466d

Remove errant println.

970b692

Remove old path() code path.

c44973b

Update old comment.

0f83fd4

mbutrovich added 2 commits October 6, 2025 11:49

Iceberg 1.5.x compatible reflection. Use 1.5.2 for Spark 3.4 and 3.5.

6cbbd09

Fix scalastyle issues.

6966a12

mbutrovich changed the title ~~feat: Iceberg scan based serializing FileScanTasks to iceberg-rust~~ feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025

mbutrovich force-pushed the iceberg-rust branch from 227332c to 6966a12 Compare October 6, 2025 20:03

mbutrovich changed the title ~~feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust~~ feat: Iceberg scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025

mbutrovich added 7 commits October 7, 2025 13:03

Merge branch 'main' into iceberg-rust

1153d71

# Conflicts: # native/Cargo.lock # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

Remove unused import.

a0f4d63

Clean up docs a bit.

a9cebfd

Refactor and cleanup.

6b2175a

Refactor and cleanup.

3618407

Add IcebergFileStream based on DataFusion, add benchmark. Bump the Ic…

8091a81

…eberg version back to 1.8.1 after hitting known segfaults with old versions.

Fix CometReadBenchmark.

880599e

This was referenced Oct 15, 2025

feat(reader): Make ArrowReaderBuilder::new public apache/iceberg-rust#1748

Merged

ArrowReader enhancements for Apache DataFusion Comet apache/iceberg-rust#1749

Open

mbutrovich added 4 commits October 16, 2025 16:04

Merge branch 'main' into iceberg-rust

5127e1c

# Conflicts: # docs/source/user-guide/latest/configs.md # native/Cargo.lock # native/Cargo.toml # native/core/Cargo.toml

Fixes after bringing in upstream/main.

878c971

Basic complex type support.

e66799e

CometFuzzIceberg stuff.

4f2f3b8

mbutrovich mentioned this pull request Oct 21, 2025

tests: FuzzDataGenerator instead of Parquet-specific generator #2616

Merged

mbutrovich added 5 commits October 21, 2025 11:24

Merge branch 'main' into iceberg-rust

71df65c

# Conflicts: # native/Cargo.lock

format and fix conflicts.

3371cc1

Basic S3 test and properties support

1c40d43

Fix NPE.

40c9a07

Merge branch 'main' into iceberg-rust

19797f3

# Conflicts: # spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala

mbutrovich mentioned this pull request Oct 22, 2025

feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) apache/iceberg-rust#1777

Open

mbutrovich added 2 commits October 21, 2025 22:18

Support migrated tables via apache/iceberg-rust#1777.

236b339

Update df50 commit based on field ID fix.

ce367cc

mbutrovich added 2 commits October 22, 2025 09:38

Bump df50 commit.

bd6c609

Support hive-partitioned Parquet files migrated to Iceberg tables wit…

33fa891

…h default values.

mbutrovich mentioned this pull request Oct 22, 2025

fix(reader): Support both position and equality delete files on the same FileScanTask apache/iceberg-rust#1778

Open

mbutrovich added 3 commits October 22, 2025 16:53

Bump df50.

ca13cc6

Merge branch 'main' into iceberg-rust

b4e829f

Fix after merging main.

e19e201

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

mbutrovich commented Oct 6, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 6, 2025 •

edited

Loading

Uh oh!

comphead commented Oct 6, 2025

Uh oh!

mbutrovich commented Oct 22, 2025 •

edited

Loading

Uh oh!

mbutrovich commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

Are you sure you want to change the base?

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

Conversation

mbutrovich commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Benefits over iceberg_compat?

Current Limitations/Concerns?

Uh oh!

codecov-commenter commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead commented Oct 6, 2025

Uh oh!

mbutrovich commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbutrovich commented Oct 6, 2025 •

edited

Loading

Benefits over `iceberg_compat`?

codecov-commenter commented Oct 6, 2025 •

edited

Loading

mbutrovich commented Oct 22, 2025 •

edited

Loading

mbutrovich commented Oct 22, 2025 •

edited

Loading