-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implement schema adapter support for FileSource and add integration tests #16148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
b23cd58
Implement schema adapter factory support for file sources
kosiew cd51b5a
Add schema adapter factory support to file sources
kosiew 2233456
Add SchemaAdapterFactory import to file source module
kosiew 60ff7e6
Add schema_adapter_factory field to JsonOpener and JsonSource structs
kosiew 4d6f8f7
Add missing import for as_file_source in source.rs
kosiew 011ce03
Fix formatting in ArrowSource implementation by removing extra newlines
kosiew 7fde396
Add integration and unit tests for schema adapter factory functionality
kosiew dc9b8ac
fix tests
kosiew 08b3ce0
Refactor adapt method signature and improve test assertions for schem…
kosiew aef5dd3
Simplify constructor in TestSource by removing redundant function def…
kosiew f964947
Remove redundant import of SchemaAdapterFactory in util.rs
kosiew d8720f0
fix tests: refactor schema_adapter_factory methods in TestSource for …
kosiew 652fbaf
feat: add macro for schema adapter methods in FileSource implementation
kosiew fbd8c99
feat: use macro implement schema adapter methods for various FileSour…
kosiew 7e9f070
refactor: clean up unused schema adapter factory methods in ParquetSo…
kosiew 4c23e82
feat: add macro for generating schema adapter methods in FileSource i…
kosiew e91eb1b
refactor: re-export impl_schema_adapter_methods from crate root
kosiew 9416efb
refactor: update macro usage and documentation for schema adapter met…
kosiew 5fb40df
refactor: clean up import statements in datasource module
kosiew 413ebe1
refactor: reorganize and clean up import statements in util.rs
kosiew a3fc370
Merge branch 'main' into file-source-merge
kosiew f11134a
Resolve merge conflict
kosiew c6ff4d5
Export macro with local inner macros for improved encapsulation
kosiew cb27246
fix clippy error
kosiew 613d115
fix doc tests
kosiew d2027f1
fix CI error
kosiew 727032b
Add metrics initialization to TestSource constructor
kosiew 148148c
Add comment for test_multi_source_schema_adapter_reuse
kosiew d3b1680
reduce test files, move non-redundant tests, consolidate in one file
kosiew 79a56f6
test_schema_adapter - add comments
kosiew 55dc418
remove redundant tests
kosiew e8f8df4
Refactor schema adapter application to use ParquetSource method directly
kosiew 6154b2d
Refactor apply_schema_adapter usage to call method directly on Parque…
kosiew 208b1cc
remove macro
kosiew fd6dd78
Revert "remove macro"
kosiew ee07b69
FileSource - provide default implementations for schema_adapter_facto…
kosiew 16eb25d
Revert "FileSource - provide default implementations for schema_adapt…
kosiew f890e8d
Remove unused import of SchemaAdapterFactory from file_format.rs
kosiew 35036ec
Merge branch 'main' into file-source-with-adapter-factory
kosiew 999e0cd
Refactor imports in apply_schema_adapter_tests.rs for improved readab…
kosiew befc171
Merge branch 'main' into file-source-with-adapter-factory
kosiew File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
260 changes: 260 additions & 0 deletions
260
datafusion/core/tests/integration_tests/schema_adapter_integration_tests.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,260 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Integration test for schema adapter factory functionality | ||
|
||
use arrow::datatypes::{DataType, Field, Schema, SchemaRef}; | ||
use arrow::record_batch::RecordBatch; | ||
use datafusion::datasource::object_store::ObjectStoreUrl; | ||
use datafusion::datasource::physical_plan::arrow_file::ArrowSource; | ||
use datafusion::prelude::*; | ||
use datafusion_common::Result; | ||
use datafusion_datasource::file::FileSource; | ||
use datafusion_datasource::file_scan_config::FileScanConfigBuilder; | ||
use datafusion_datasource::schema_adapter::{SchemaAdapter, SchemaAdapterFactory}; | ||
use datafusion_datasource::source::DataSourceExec; | ||
use datafusion_datasource::PartitionedFile; | ||
use std::sync::Arc; | ||
use tempfile::TempDir; | ||
|
||
#[cfg(feature = "parquet")] | ||
use datafusion_datasource_parquet::ParquetSource; | ||
#[cfg(feature = "parquet")] | ||
use parquet::arrow::ArrowWriter; | ||
#[cfg(feature = "parquet")] | ||
use parquet::file::properties::WriterProperties; | ||
|
||
#[cfg(feature = "csv")] | ||
use datafusion_datasource_csv::CsvSource; | ||
|
||
/// A schema adapter factory that transforms column names to uppercase | ||
#[derive(Debug)] | ||
struct UppercaseAdapterFactory {} | ||
|
||
impl SchemaAdapterFactory for UppercaseAdapterFactory { | ||
fn create(&self, schema: &Schema) -> Result<Box<dyn SchemaAdapter>> { | ||
Ok(Box::new(UppercaseAdapter { | ||
input_schema: Arc::new(schema.clone()), | ||
})) | ||
} | ||
} | ||
|
||
/// Schema adapter that transforms column names to uppercase | ||
#[derive(Debug)] | ||
struct UppercaseAdapter { | ||
input_schema: SchemaRef, | ||
} | ||
|
||
impl SchemaAdapter for UppercaseAdapter { | ||
fn adapt(&self, record_batch: RecordBatch) -> Result<RecordBatch> { | ||
// In a real adapter, we might transform the data too | ||
// For this test, we're just passing through the batch | ||
Ok(record_batch) | ||
} | ||
|
||
fn output_schema(&self) -> SchemaRef { | ||
let fields = self | ||
.input_schema | ||
.fields() | ||
.iter() | ||
.map(|f| { | ||
Field::new( | ||
f.name().to_uppercase().as_str(), | ||
f.data_type().clone(), | ||
f.is_nullable(), | ||
) | ||
}) | ||
.collect(); | ||
|
||
Arc::new(Schema::new(fields)) | ||
} | ||
} | ||
|
||
#[cfg(feature = "parquet")] | ||
#[tokio::test] | ||
async fn test_parquet_integration_with_schema_adapter() -> Result<()> { | ||
// Create a temporary directory for our test file | ||
let tmp_dir = TempDir::new()?; | ||
let file_path = tmp_dir.path().join("test.parquet"); | ||
let file_path_str = file_path.to_str().unwrap(); | ||
|
||
// Create test data | ||
let schema = Arc::new(Schema::new(vec![ | ||
Field::new("id", DataType::Int32, false), | ||
Field::new("name", DataType::Utf8, true), | ||
])); | ||
|
||
let batch = RecordBatch::try_new( | ||
schema.clone(), | ||
vec![ | ||
Arc::new(arrow::array::Int32Array::from(vec![1, 2, 3])), | ||
Arc::new(arrow::array::StringArray::from(vec!["a", "b", "c"])), | ||
], | ||
)?; | ||
|
||
// Write test parquet file | ||
let file = std::fs::File::create(file_path_str)?; | ||
let props = WriterProperties::builder().build(); | ||
let mut writer = ArrowWriter::try_new(file, schema.clone(), Some(props))?; | ||
writer.write(&batch)?; | ||
writer.close()?; | ||
|
||
// Create a session context | ||
let ctx = SessionContext::new(); | ||
|
||
// Create a ParquetSource with the adapter factory | ||
let source = ParquetSource::default() | ||
.with_schema_adapter_factory(Arc::new(UppercaseAdapterFactory {})); | ||
|
||
// Create a scan config | ||
let config = FileScanConfigBuilder::new( | ||
ObjectStoreUrl::parse(&format!("file://{}", file_path_str))?, | ||
schema.clone(), | ||
) | ||
.with_source(source) | ||
.build(); | ||
|
||
// Create a data source executor | ||
let exec = DataSourceExec::from_data_source(config); | ||
|
||
// Collect results | ||
let task_ctx = ctx.task_ctx(); | ||
let stream = exec.execute(0, task_ctx)?; | ||
let batches = datafusion::physical_plan::common::collect(stream).await?; | ||
|
||
// There should be one batch | ||
assert_eq!(batches.len(), 1); | ||
|
||
// Verify the schema has uppercase column names | ||
let result_schema = batches[0].schema(); | ||
assert_eq!(result_schema.field(0).name(), "ID"); | ||
assert_eq!(result_schema.field(1).name(), "NAME"); | ||
|
||
Ok(()) | ||
} | ||
|
||
#[tokio::test] | ||
async fn test_multi_source_schema_adapter_reuse() -> Result<()> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is not entirely clear to me what this test is verifying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test checks that:
|
||
// This test verifies that the same schema adapter factory can be reused | ||
// across different file source types. This is important for ensuring that: | ||
// 1. The schema adapter factory interface works uniformly across all source types | ||
// 2. The factory can be shared and cloned efficiently using Arc | ||
// 3. Various data source implementations correctly implement the schema adapter factory pattern | ||
|
||
// Create a test factory | ||
let factory = Arc::new(UppercaseAdapterFactory {}); | ||
|
||
// Apply the same adapter to different source types | ||
let arrow_source = | ||
ArrowSource::default().with_schema_adapter_factory(factory.clone()); | ||
|
||
#[cfg(feature = "parquet")] | ||
let parquet_source = | ||
ParquetSource::default().with_schema_adapter_factory(factory.clone()); | ||
|
||
#[cfg(feature = "csv")] | ||
let csv_source = CsvSource::default().with_schema_adapter_factory(factory.clone()); | ||
|
||
// Verify adapters were properly set | ||
assert!(arrow_source.schema_adapter_factory().is_some()); | ||
|
||
#[cfg(feature = "parquet")] | ||
assert!(parquet_source.schema_adapter_factory().is_some()); | ||
|
||
#[cfg(feature = "csv")] | ||
assert!(csv_source.schema_adapter_factory().is_some()); | ||
|
||
Ok(()) | ||
} | ||
|
||
// Helper function to test From<T> for Arc<dyn FileSource> implementations | ||
fn test_from_impl<T: Into<Arc<dyn FileSource>> + Default>(expected_file_type: &str) { | ||
let source = T::default(); | ||
let file_source: Arc<dyn FileSource> = source.into(); | ||
assert_eq!(file_source.file_type(), expected_file_type); | ||
} | ||
|
||
#[test] | ||
fn test_from_implementations() { | ||
// Test From implementation for various sources | ||
test_from_impl::<ArrowSource>("arrow"); | ||
|
||
#[cfg(feature = "parquet")] | ||
test_from_impl::<ParquetSource>("parquet"); | ||
|
||
#[cfg(feature = "csv")] | ||
test_from_impl::<CsvSource>("csv"); | ||
|
||
#[cfg(feature = "json")] | ||
test_from_impl::<datafusion_datasource_json::JsonSource>("json"); | ||
} | ||
|
||
/// A simple test schema adapter factory that doesn't modify the schema | ||
#[derive(Debug)] | ||
struct TestSchemaAdapterFactory {} | ||
|
||
impl SchemaAdapterFactory for TestSchemaAdapterFactory { | ||
fn create(&self, schema: &Schema) -> Result<Box<dyn SchemaAdapter>> { | ||
Ok(Box::new(TestSchemaAdapter { | ||
input_schema: Arc::new(schema.clone()), | ||
})) | ||
} | ||
} | ||
|
||
/// A test schema adapter that passes through data unmodified | ||
#[derive(Debug)] | ||
struct TestSchemaAdapter { | ||
input_schema: SchemaRef, | ||
} | ||
|
||
impl SchemaAdapter for TestSchemaAdapter { | ||
fn adapt(&self, record_batch: RecordBatch) -> Result<RecordBatch> { | ||
// Just pass through the batch unmodified | ||
Ok(record_batch) | ||
} | ||
|
||
fn output_schema(&self) -> SchemaRef { | ||
self.input_schema.clone() | ||
} | ||
} | ||
|
||
#[cfg(feature = "parquet")] | ||
#[test] | ||
fn test_schema_adapter_preservation() { | ||
// Create a test schema | ||
let schema = Arc::new(Schema::new(vec![ | ||
Field::new("id", DataType::Int32, false), | ||
Field::new("name", DataType::Utf8, true), | ||
])); | ||
|
||
// Create source with schema adapter factory | ||
let source = ParquetSource::default(); | ||
let factory = Arc::new(TestSchemaAdapterFactory {}); | ||
let file_source = source.with_schema_adapter_factory(factory); | ||
|
||
// Create a FileScanConfig with the source | ||
let config_builder = | ||
FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(), schema.clone()) | ||
.with_source(file_source.clone()) | ||
// Add a file to make it valid | ||
.with_file(PartitionedFile::new("test.parquet", 100)); | ||
|
||
let config = config_builder.build(); | ||
|
||
// Verify the schema adapter factory is present in the file source | ||
assert!(config.source().schema_adapter_factory().is_some()); | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very cool