Skip to content

Add ListingOptions::output_partitioning and FileScanConfig::output_partitioning for pre-defined file partitioning#22657

Merged
alamb merged 9 commits into
apache:mainfrom
gene-bordegaray:gene.bordegaray/2026/05/file-scan-output-partitioning
Jun 23, 2026
Merged

Add ListingOptions::output_partitioning and FileScanConfig::output_partitioning for pre-defined file partitioning#22657
alamb merged 9 commits into
apache:mainfrom
gene-bordegaray:gene.bordegaray/2026/05/file-scan-output-partitioning

Conversation

@gene-bordegaray

@gene-bordegaray gene-bordegaray commented May 30, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

This follows up on #22607 by replacing range-partitioning sqllogictest boilerplate with a general file/listing scan API for declared output partitioning.

Related: #21992, #22607, #22607 (comment)

What changes are included in this PR?

  • Add declared output_partitioning to file scan and listing table configuration.
  • Preserve declared partition counts during listing-table file grouping.
  • Serialize scan output_partitioning through physical plan proto.
  • Refactor range_partitioning.slt to use a CSV ListingTable instead of a custom test-only TableProvider / DataSource.

Contract:

  • Declared partitioning expressions are written against the full table schema before scan projection. For example, Range([range_key@0], [(10), (20)], 3) remains valid if the scan projects range_key and falls back to UnknownPartitioning(3) if range_key is not projected.
  • Listing tables create one file group per declared output partition (which can exceed target_partitions). It is up to the user to plan their partitioning. For example, a 4-partition range declaration creates four scan file groups, adding empty trailing groups when fewer files are present.
  • File group index is part of the contract: file group i must contain rows for declared output partition i. DataFusion does not validate row placement, matching other user-declared properties such as sortedness.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. This adds public API for declaring file/listing scan output partitioning. No breaking API changes.

@github-actions github-actions Bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate proto Related to proto crate datasource Changes to the datasource crate labels May 30, 2026
@github-actions

github-actions Bot commented May 30, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 102.144s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.037s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [ 103.091s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.035s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.619s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 207.893s] datafusion
    Building datafusion-catalog-listing v54.0.0 (current)
       Built [  46.257s] (current)
     Parsing datafusion-catalog-listing v54.0.0 (current)
      Parsed [   0.011s] (current)
    Building datafusion-catalog-listing v54.0.0 (baseline)
       Built [  45.759s] (baseline)
     Parsing datafusion-catalog-listing v54.0.0 (baseline)
      Parsed [   0.011s] (baseline)
    Checking datafusion-catalog-listing v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.081s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ListingOptions.output_partitioning in /home/runner/work/datafusion/datafusion/datafusion/catalog-listing/src/options.rs:95

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  93.106s] datafusion-catalog-listing
    Building datafusion-datasource v54.0.0 (current)
       Built [  36.969s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  36.515s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.032s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.250s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  74.836s] datafusion-datasource
    Building datafusion-physical-expr v54.0.0 (current)
       Built [  29.077s] (current)
     Parsing datafusion-physical-expr v54.0.0 (current)
      Parsed [   0.049s] (current)
    Building datafusion-physical-expr v54.0.0 (baseline)
       Built [  29.999s] (baseline)
     Parsing datafusion-physical-expr v54.0.0 (baseline)
      Parsed [   0.049s] (baseline)
    Checking datafusion-physical-expr v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.346s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  60.362s] datafusion-physical-expr
    Building datafusion-proto v54.0.0 (current)
       Built [  58.994s] (current)
     Parsing datafusion-proto v54.0.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-proto v54.0.0 (baseline)
       Built [  57.421s] (baseline)
     Parsing datafusion-proto v54.0.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-proto v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.255s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 118.270s] datafusion-proto
    Building datafusion-proto-models v54.0.0 (current)
       Built [  23.600s] (current)
     Parsing datafusion-proto-models v54.0.0 (current)
      Parsed [   0.126s] (current)
    Building datafusion-proto-models v54.0.0 (baseline)
       Built [  23.554s] (baseline)
     Parsing datafusion-proto-models v54.0.0 (baseline)
      Parsed [   0.129s] (baseline)
    Checking datafusion-proto-models v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.673s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field FileScanExecConf.output_partitioning in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/prost.rs:1820
  field FileScanExecConf.output_partitioning in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/prost.rs:1820

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  50.041s] datafusion-proto-models
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 173.975s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 173.265s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.023s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.087s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 349.881s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 30, 2026
@gene-bordegaray gene-bordegaray marked this pull request as ready for review June 2, 2026 12:33
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch from 4be99d8 to e3e6a51 Compare June 2, 2026 13:34
@gene-bordegaray gene-bordegaray changed the title [WIP] Add declared file scan output partitioning Add declared file scan output partitioning Jun 2, 2026
@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

cc: @alamb @stuhood @gabotechs @NGA-TRAN follow up for adding output_partitioning apis

@gabotechs

Copy link
Copy Markdown
Contributor

Nice! will take a look soon

@stuhood stuhood left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment thread datafusion/datasource/src/file_scan_config/mod.rs

@gabotechs gabotechs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job here! this seems to be going in the right direction, left some comments, let me know what you think

Comment thread datafusion/catalog-listing/src/options.rs Outdated
Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment thread datafusion/catalog-listing/src/options.rs
Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment thread datafusion/catalog-listing/src/table.rs Outdated
Comment thread datafusion/core/src/datasource/listing/table.rs Outdated
Comment thread datafusion/datasource/src/file_scan_config/mod.rs
@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

Talked with @gabotechs offline. TLDR the representation of this should most likely be behind the logical representation of partitioninig, not the physical one just as output_ordering does. Thus I will create a PR to support Range variant in the local enum before moving forward with this

cc: @stuhood

@alamb alamb marked this pull request as draft June 4, 2026 14:20
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch from e3e6a51 to bc1e2fc Compare June 5, 2026 20:37
@github-actions github-actions Bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates substrait Changes to the substrait crate common Related to common crate labels Jun 5, 2026
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch from bc1e2fc to b22530a Compare June 6, 2026 13:17
pull Bot pushed a commit to buraksenn/datafusion that referenced this pull request Jun 10, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#22778.
- Related: apache#21992, apache#22395.
- Needed by apache#22657.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Declared scan output partitioning should use logical partitioning
metadata, not physical partitioning types. This adds logical range
partitioning so range-partitioned sources can declare their layout at
the logical layer.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Add logical `Partitioning::Range` and `RangePartitioning`.
- Move `SplitPoint` and shared split-point validation to
`datafusion-common`.
- Wire logical range partitioning through expression traversal,
rewrites, and display.
- Keep planning, logical proto, and Substrait support explicitly
unsupported for now.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes. Unit tests added

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

Yes. This adds public logical range partitioning API. No breaking API
changes.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
AdamGS pushed a commit to AdamGS/arrow-datafusion that referenced this pull request Jun 11, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#22778.
- Related: apache#21992, apache#22395.
- Needed by apache#22657.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Declared scan output partitioning should use logical partitioning
metadata, not physical partitioning types. This adds logical range
partitioning so range-partitioned sources can declare their layout at
the logical layer.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Add logical `Partitioning::Range` and `RangePartitioning`.
- Move `SplitPoint` and shared split-point validation to
`datafusion-common`.
- Wire logical range partitioning through expression traversal,
rewrites, and display.
- Keep planning, logical proto, and Substrait support explicitly
unsupported for now.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes. Unit tests added

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

Yes. This adds public logical range partitioning API. No breaking API
changes.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch from b22530a to 05d7512 Compare June 12, 2026 08:08
@github-actions github-actions Bot removed logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates labels Jun 12, 2026
pull Bot pushed a commit to buraksenn/datafusion that referenced this pull request Jun 17, 2026
…tions` (apache#22969)

## Which issue does this PR close?

- Closes #.

## Rationale for this change

Something that was spotted during the review of:
- apache#22657

`ListingOptions::target_partitions` and `ListingOptions::collect_stat`
duplicate `SessionConfig`'s `execution.target_partitions` and
`execution.collect_statistics`.

After some investigation, I think they only live on `ListingOptions` for
historical reasons: when the struct was added (apache#1010 5 years ago),
`TableProvider::scan` had no access to the session, so the values had to
be copied onto the table at build time. Once apache#2660 passed `SessionState`
into `scan`, the fields became redundant (and had already drifted —
`scan` read them from the session config while `list_files_for_scan`
read the stale copy). This PR makes `SessionConfig` the single source of
truth.

## What changes are included in this PR?

- Remove `target_partitions`/`collect_stat` fields, their builders, and
`with_session_config_options` from `ListingOptions`.
- `ListingTable` now reads both values from the session config at scan
time.
- Reserve proto tags 8/9 in `ListingTableScanNode` and drop the related
(de)serialization.
- Update benchmarks, factory, and test call sites.

## Are these changes tested?

Yes, by existing tests

## Are there any user-facing changes?

Yes, breaking: the removed fields/builders require configuring
`SessionConfig` instead, and the two proto fields no longer round-trip.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch 2 times, most recently from 74da268 to 758278d Compare June 21, 2026 08:28
@alamb

alamb commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Looks like this PR has a conflict (and for some reason the CI isn't running). Maybe we can resolve / fix that and I will try and review it

@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

Looks like this PR has a conflict (and for some reason the CI isn't running). Maybe we can resolve / fix that and I will try and review it

@alamb working on it right now 🙇

@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch from 758278d to ba15301 Compare June 22, 2026 14:50
@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

ok this should be good for another look @alamb 🙇

@alamb alamb changed the title Add declared file scan output partitioning Add FileScanConfig::output_partitioning for pre-defined file partitioning Jun 22, 2026
@alamb alamb changed the title Add FileScanConfig::output_partitioning for pre-defined file partitioning Add ListingOptions::output_partitioning and FileScanConfig::output_partitioning for pre-defined file partitioning Jun 22, 2026

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API looks good to me -- thnk you @gene-bordegaray and @gabotechs

I had several code / testing comments that would be nice to address but are not required to merge in my opinion

Comment thread datafusion/datasource/src/file_scan_config/mod.rs
/// Declarations are limited to partitioning that can be represented by
/// assigning whole files to file groups.
///
/// Files are assigned to groups in path order. DataFusion does not validate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an example or two would help make this documentation clearer.

e.g. an example partitioning and 3 files -- how are the files assigned to partititions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added two examples and made more clear. Let me know if that ocvers the cases you had in mind

Comment thread datafusion/catalog-listing/src/options.rs
Comment thread datafusion/catalog-listing/src/table.rs Outdated
}
}

fn create_physical_output_partitioning(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions seem like they are generic -- could they be made as a method on LogicalPartitioning to make them more discoverable?

Or perhaps datafusion/physical-expr/src/physical_expr.rs, something like

  pub fn create_physical_partitioning(
      partitioning: &datafusion_expr::Partitioning,
      input_dfschema: &DFSchema,
      execution_props: &ExecutionProps,
  ) -> Result<Partitioning>

As they are similar "logical to physical" conversion functions

Comment thread datafusion/catalog-listing/src/table.rs Outdated
false,
)
}
let (mut file_groups, grouped_by_partition) = if has_declared_partitioning {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some way to refactor this function so it doesn't have a bunch of if has_declared_partitioning checks? it make it hard to understand what is going on. Maybe we can have two versions of the function or something like that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this into two function with some helpers, I thnk make logic more linear 👍

Comment thread datafusion/core/src/datasource/listing/table.rs
Comment thread datafusion/core/src/datasource/listing/table.rs Outdated
/// cannot be projected, or its partition count differs from `file_groups`,
/// this returns `UnknownPartitioning`.
///
/// Tradeoffs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tradeoffs still seem relevant -- why remove it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed like long docs that I thought would be better on the fields themselves, that's why I added: https://github.com/apache/datafusion/pull/22657/changes/BASE..ba15301de16ce79e7e84c134961c75e9ca4653b8#diff-a07222d670257887f5118197c485861c96635e2da6c2bf0007d2c21dda7df82aR206

It is shorter but still is same idea, I can revert this and modify if its preferred.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually if we are deprecating partitioned_by_file_group I will keep this and omit that new addition 👍

@github-actions github-actions Bot added the physical-expr Changes to the physical-expr crates label Jun 22, 2026
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch 3 times, most recently from b6ec867 to 5ea3b79 Compare June 23, 2026 00:48
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/file-scan-output-partitioning branch from 5ea3b79 to 03e860f Compare June 23, 2026 01:01

@gabotechs gabotechs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, great work @gene-bordegaray!

@alamb alamb added this pull request to the merge queue Jun 23, 2026
Merged via the queue into apache:main with commit 63ad991 Jun 23, 2026
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support declared output partitioning for file/listing scans

4 participants