Implement schema adapter support for FileSource and add integration tests #16148

kosiew · 2025-05-22T13:23:50Z

Which issue does this PR close?

This is part of a series of PRs re-implementing #15295 to close #14657 by adding schema‐evolution support for:

listing‐based tables
nested structs
in DataFusion.

Rationale for this change

To enable customizable schema evolution during file scans, we introduce a SchemaAdapterFactory hook into all FileSource implementations. This allows users to adapt column mappings and perform transformations (e.g., renaming, casting, adding defaults) without forking core scan logic.

What changes are included in this PR?

Core API additions
- Added with_schema_adapter_factory and schema_adapter_factory methods to the FileSource trait
- Introduced the impl_schema_adapter_methods!() macro to reduce boilerplate in each FileSource implementation
- Added as_file_source helper to convert concrete sources into Arc<dyn FileSource>
Datasource crate updates
- Updated CSV, JSON, Avro, Parquet, and Arrow FileSource implementations to store and honor an optional schema_adapter_factory
- Applied the new macro and helper consistently across all FileSource implementations
Testing
- Added unit tests:
  - schema_adapter_factory_tests.rs
  - test_adapter_updated.rs
  - test_source_adapter_tests.rs
    These cover factory wiring, column index mapping, schema transformation logic, and source behavior
- Added integration tests:
  - schema_adapter_integration_tests.rs
  - apply_schema_adapter_tests.rs
    These validate adapter behavior in real-world scenarios such as scanning Parquet files

Are these changes tested?

Yes. This PR includes comprehensive new tests to ensure:

Default behavior is preserved when no schema adapter is used
Factories can be injected and retrieved via the new API
Adapters correctly map schemas and record batches
The system works end-to-end with real file formats like Parquet

Are there any user-facing changes?

Yes:

Public API additions to the FileSource trait
New macro impl_schema_adapter_methods!() for downstream implementors

These changes are additive and backward-compatible. Developers implementing custom FileSource types must either use the macro or provide the new methods to support schema adapters.

…a adapter factory

…inition

…improved clarity

…ce implementations

…urce

…mplementations

…hods

alamb

Thank you @kosiew -- I had a few comments -- mostly about the tests.

Please let me know what you think

alamb · 2025-05-23T20:08:23Z

datafusion/core/tests/integration_tests/schema_adapter_integration_tests.rs

+#[cfg(feature = "csv")]
+use datafusion_datasource_csv::CsvSource;
+
+/// A schema adapter factory that transforms column names to uppercase


this is very cool

alamb · 2025-05-23T20:09:29Z

datafusion/core/tests/integration_tests/schema_adapter_integration_tests.rs

+}
+
+#[tokio::test]
+async fn test_multi_source_schema_adapter_reuse() -> Result<()> {


it is not entirely clear to me what this test is verifying

The test checks that:

The UppercaseAdapterFactory can be applied to different source types (ArrowSource, ParquetSource, CsvSource)

After applying the factory, each source correctly reports having a schema adapter factory

The factory reference is properly maintained across different source instances

alamb · 2025-05-23T20:10:36Z

datafusion/core/tests/schema_adapter_factory_tests.rs

@@ -0,0 +1,208 @@
+// Licensed to the Apache Software Foundation (ASF) under one


I am not sure what additional coverage datafusion/core/tests/schema_adapter_factory_tests.rs adds in addition to the integration test in datafusion/core/tests/integration_tests/schema_adapter_integration_tests.rs

Also if it does add additional coverage can you please include this as part of the other core_integration tests?

Each new file in in datafusion/core/tests results in a new binary which each take 10s of MB

For example, I build this to check and the binary is 57 MB on my machine (it is even more with normal dev profile)

$cargo test --profile=ci --test schema_adapter_factory_tests ... Running tests/schema_adapter_factory_tests.rs (target/ci/deps/schema_adapter_factory_tests-b2997559eccc9857) ... $ du -h target/ci/deps/schema_adapter_factory_tests-b2997559eccc9857 57M target/ci/deps/schema_adapter_factory_tests-b2997559eccc9857

alamb · 2025-05-23T20:15:18Z

datafusion/core/tests/test_adapter_updated.rs

+        Field::new("extra", DataType::Int64, true),
+    ]);
+
+    // Create a TestSource


What is this test covering? I don't understand what additional coverage it is adding

test_schema_adapter validates that:

Creating and attaching a schema adapter factory to a file source

Creating a schema adapter using the factory

The schema adapter's ability to map column indices between a table schema and a file schema

The schema adapter's ability to create a projection that selects only the columns from the file schema that are present in the table schema

alamb · 2025-05-23T20:17:00Z

datafusion/core/tests/test_source_adapter_tests.rs

+}
+
+#[test]
+fn test_test_source_schema_adapter_factory() {


same comment here -- I am not sure what extra coverage this is adding and it adds a new binary

Good catch!
The tests in this file are covered elsewhere already.
Will remove this file.

alamb · 2025-05-23T20:19:47Z

datafusion/datasource-parquet/tests/apply_schema_adapter_tests.rs

+
+    // Implementation of apply_schema_adapter for testing purposes
+    // This mimics the private function in the datafusion-parquet crate
+    fn apply_schema_adapter(


If we left the method on ParquetSource it wouldn't have to be replicated 🤔

I like your suggestion!

alamb · 2025-05-23T20:23:38Z

datafusion/datasource/src/test_util.rs

@@ -81,6 +83,8 @@ impl FileSource for MockSource {
    fn file_type(&self) -> &str {
        "mock"
    }
+
+    impl_schema_adapter_methods!();


I would personally suggest not adding this macro as I think peoeple will likely just have their IDE fill it out or let the compiler tell them what to do

What I suggest we do is change the with_schema_adapter_factory() method return signature to return Result and provide default implementations in the the trait that return Error(NotYetImplemented)

That way users won't need to change their implementations if they don't use schema adapters at all

Hey, really appreciate the suggestion! Turning these two methods into trait defaults is tempting, but we run into some frustrating object-safety and cloning issues:

Fallible signature (Result)
Switching to

fn with_schema_adapter_factory(...) -> Result<Arc<dyn FileSource>, _>

has drawbacks:

It forces every callsite—even the 99% that never “fail”—to handle a Result. That’s a lot of boilerplate up front.

Pushing the “not implemented” case to runtime means we only discover missing overrides via panics or errors in production, instead of compile-time feedback.

Trait-object vs. Clone
A default like

fn with_schema_adapter_factory( &self, _factory: Arc<dyn SchemaAdapterFactory>, ) -> Arc<dyn FileSource> { Arc::new(self.clone()) }

can’t compile because:

self is a &Self, so self.clone() gives you another &Self, producing an Arc<&Self>, and &Self isn’t a FileSource.

To make it work you’d need Self: Sized + Clone on the default—but then that method isn’t even available on dyn FileSource, defeating trait-object use.

So, I am leaning towards keeping the macro because:

Object-safe clone: by generating Arc::new(Self { …, ..self.clone() }) inside each impl, the macro leverages the concrete type’s Clone impl without polluting the trait itself.

Single maintenance point: if we ever tweak the method signature, we update the macro once and every impl site gets fixed automatically.

Compile-time assurance: missing impl_schema_adapter_methods!() on a type immediately fails to compile, alerting the author they need to opt in.

It forces every callsite—even the 99% that never “fail”—to handle a Result. That’s a lot of boilerplate up front.

IN my opinion the boilerplate would be realtively minimal (likely it would require ? a few places)

Pushing the “not implemented” case to runtime means we only discover missing overrides via panics or errors in production, instead of compile-time feedback.

This argument makes sense to me. However, another drawback is that after this PR, it is required for DataSource implementations to handle schema adapter, rather than allowing an implementation to return a runtime error if it doesn't

I think that is probably fine but in the future I can still see a usecase for a fallable Result

If we implement the new methods for all structs in DataFusion, won't users who upgrade get a compile error because of the missing methods and thus be forced to make them a no-op or unimplemented!()? That seems reasonable to me.

compile error because of the missing methods and thus be forced to make them a no-op or unimplemented!()?

Yes, that is my understanding as well

This is what this PR does, right? Or are you suggesting a change?

It does it via macros right? I'm basically saying that instead of providing a macro that implements the functions for you I would force users to implement the functions and (if necessary) provider helpers they can call from within their implementation.

This would be my preference too -- the macro is a nice way to reduce the boiler plate (and @kosiew has documented it super well) but I think it then adds a bit more cognative load and it would be better to have a little more duplication to be explict

@alamb , @adriangb ,

Thanks for discussion on this.

What do you think about implementing the schema adapter support via an opt-in trait to avoid breaking changes in FileSource trait?

Here is my proposal for how to handle this API:

Simplify FileSource / SchemaAdapterFactory API #16214

I think it is fairly simple and follows the existing pattern in this codebase

alamb · 2025-05-23T20:24:08Z

fyi @adriangb

…tSource

This reverts commit 208b1cc.

…ry methods

…er_factory methods" This reverts commit ee07b69.

…ility

alamb

Thank you @kosiew -- I think this PR looks good enough to merge to me. Thank you for your patience and thoroughness

It would be great to avoid a new file datafusion/core/tests/test_adapter_updated.rs if possible but we can do that as a follow on too

alamb · 2025-05-27T18:33:54Z

datafusion/datasource/src/macros.rs

+
+//! Macros for the datafusion-datasource crate
+
+/// Helper macro to generate schema adapter methods for FileSource implementations


this is super well documented. I am still not a huge fan of adding this new macro as I think it makes implementing DataSources that much more complicated, but I can see the rationale for not adding a default on #16148 (comment)

So let's go with this approach and see how it goes

alamb · 2025-05-27T18:38:03Z

datafusion/datasource/src/test_util.rs

@@ -81,6 +83,8 @@ impl FileSource for MockSource {
    fn file_type(&self) -> &str {
        "mock"
    }
+
+    impl_schema_adapter_methods!();


It forces every callsite—even the 99% that never “fail”—to handle a Result. That’s a lot of boilerplate up front.

IN my opinion the boilerplate would be realtively minimal (likely it would require ? a few places)

Pushing the “not implemented” case to runtime means we only discover missing overrides via panics or errors in production, instead of compile-time feedback.

This argument makes sense to me. However, another drawback is that after this PR, it is required for DataSource implementations to handle schema adapter, rather than allowing an implementation to return a runtime error if it doesn't

I think that is probably fine but in the future I can still see a usecase for a fallable Result

alamb · 2025-05-27T18:40:36Z

datafusion/core/tests/test_adapter_updated.rs

@@ -0,0 +1,214 @@
+// Licensed to the Apache Software Foundation (ASF) under one


Due to

THe previously mentioned reasons that more targets results in longer compile times

It will be harder to find tests for the same feature if they are in different files

I think we should remove this new file datafusion/core/tests/test_adapter_updated.rs and put the test in datafusion/core/tests/integration_tests/schema_adapter_integration_tests.rs instead

However, we can do that as a follow on PR too

kosiew · 2025-05-28T00:54:40Z

Thank you @alamb for the review and feedback.

alamb

I will make a follow on PR to propose some simplifications. Let's move this one forward.

Thank you @kosiew and @adriangb

kosiew added 20 commits May 22, 2025 15:08

Implement schema adapter factory support for file sources

b23cd58

Add schema adapter factory support to file sources

cd51b5a

Add SchemaAdapterFactory import to file source module

2233456

Add schema_adapter_factory field to JsonOpener and JsonSource structs

60ff7e6

Add missing import for as_file_source in source.rs

4d6f8f7

Fix formatting in ArrowSource implementation by removing extra newlines

011ce03

Add integration and unit tests for schema adapter factory functionality

7fde396

fix tests

dc9b8ac

Refactor adapt method signature and improve test assertions for schem…

08b3ce0

…a adapter factory

Simplify constructor in TestSource by removing redundant function def…

aef5dd3

…inition

Remove redundant import of SchemaAdapterFactory in util.rs

f964947

fix tests: refactor schema_adapter_factory methods in TestSource for …

d8720f0

…improved clarity

feat: add macro for schema adapter methods in FileSource implementation

652fbaf

feat: use macro implement schema adapter methods for various FileSour…

fbd8c99

…ce implementations

refactor: clean up unused schema adapter factory methods in ParquetSo…

7e9f070

…urce

feat: add macro for generating schema adapter methods in FileSource i…

4c23e82

…mplementations

refactor: re-export impl_schema_adapter_methods from crate root

e91eb1b

refactor: update macro usage and documentation for schema adapter met…

9416efb

…hods

refactor: clean up import statements in datasource module

5fb40df

refactor: reorganize and clean up import statements in util.rs

413ebe1

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels May 22, 2025

kosiew marked this pull request as draft May 22, 2025 13:25

kosiew added 3 commits May 23, 2025 08:30

Merge branch 'main' into file-source-merge

a3fc370

Resolve merge conflict

f11134a

Export macro with local inner macros for improved encapsulation

c6ff4d5

kosiew marked this pull request as ready for review May 23, 2025 00:51

kosiew added 3 commits May 23, 2025 10:09

fix clippy error

cb27246

fix doc tests

613d115

fix CI error

d2027f1

kosiew force-pushed the file-source-with-adapter-factory branch from aaacdb6 to d2027f1 Compare May 23, 2025 06:07

Add metrics initialization to TestSource constructor

727032b

kosiew force-pushed the file-source-with-adapter-factory branch from 9fe9975 to 727032b Compare May 23, 2025 08:29

alamb reviewed May 23, 2025

View reviewed changes

kosiew added 14 commits May 26, 2025 10:53

Add comment for test_multi_source_schema_adapter_reuse

148148c

reduce test files, move non-redundant tests, consolidate in one file

d3b1680

test_schema_adapter - add comments

79a56f6

remove redundant tests

55dc418

Refactor schema adapter application to use ParquetSource method directly

e8f8df4

Refactor apply_schema_adapter usage to call method directly on Parque…

6154b2d

…tSource

remove macro

208b1cc

Revert "remove macro"

fd6dd78

This reverts commit 208b1cc.

FileSource - provide default implementations for schema_adapter_facto…

ee07b69

…ry methods

Revert "FileSource - provide default implementations for schema_adapt…

16eb25d

…er_factory methods" This reverts commit ee07b69.

Remove unused import of SchemaAdapterFactory from file_format.rs

f890e8d

Merge branch 'main' into file-source-with-adapter-factory

35036ec

Refactor imports in apply_schema_adapter_tests.rs for improved readab…

999e0cd

…ility

Merge branch 'main' into file-source-with-adapter-factory

befc171

alamb approved these changes May 27, 2025

View reviewed changes

kosiew mentioned this pull request May 28, 2025

Consolidate schema adapter tests in schema_adapter_integration_tests.rs #16202

Closed

alamb reviewed May 30, 2025

View reviewed changes

alamb merged commit 900279c into apache:main May 30, 2025
27 checks passed

alamb mentioned this pull request May 30, 2025

Simplify FileSource / SchemaAdapterFactory API #16214

Merged

		@@ -0,0 +1,208 @@
		// Licensed to the Apache Software Foundation (ASF) under one


		//! Macros for the datafusion-datasource crate

		/// Helper macro to generate schema adapter methods for FileSource implementations

		@@ -0,0 +1,214 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Implement schema adapter support for FileSource and add integration tests #16148

Implement schema adapter support for FileSource and add integration tests #16148

Uh oh!

Conversation

kosiew commented May 22, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew commented May 28, 2025

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kosiew May 26, 2025 •

edited

Loading

kosiew May 26, 2025 •

edited

Loading

kosiew May 29, 2025 •

edited

Loading

alamb left a comment •

edited

Loading