Add ListingOptions::output_partitioning and FileScanConfig::output_partitioning for pre-defined file partitioning#22657
Conversation
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
4be99d8 to
e3e6a51
Compare
|
cc: @alamb @stuhood @gabotechs @NGA-TRAN follow up for adding |
|
Nice! will take a look soon |
gabotechs
left a comment
There was a problem hiding this comment.
Nice job here! this seems to be going in the right direction, left some comments, let me know what you think
|
Talked with @gabotechs offline. TLDR the representation of this should most likely be behind the logical representation of partitioninig, not the physical one just as cc: @stuhood |
e3e6a51 to
bc1e2fc
Compare
bc1e2fc to
b22530a
Compare
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#22778. - Related: apache#21992, apache#22395. - Needed by apache#22657. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Declared scan output partitioning should use logical partitioning metadata, not physical partitioning types. This adds logical range partitioning so range-partitioned sources can declare their layout at the logical layer. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Add logical `Partitioning::Range` and `RangePartitioning`. - Move `SplitPoint` and shared split-point validation to `datafusion-common`. - Wire logical range partitioning through expression traversal, rewrites, and display. - Keep planning, logical proto, and Substrait support explicitly unsupported for now. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes. Unit tests added ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> Yes. This adds public logical range partitioning API. No breaking API changes. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#22778. - Related: apache#21992, apache#22395. - Needed by apache#22657. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Declared scan output partitioning should use logical partitioning metadata, not physical partitioning types. This adds logical range partitioning so range-partitioned sources can declare their layout at the logical layer. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Add logical `Partitioning::Range` and `RangePartitioning`. - Move `SplitPoint` and shared split-point validation to `datafusion-common`. - Wire logical range partitioning through expression traversal, rewrites, and display. - Keep planning, logical proto, and Substrait support explicitly unsupported for now. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes. Unit tests added ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> Yes. This adds public logical range partitioning API. No breaking API changes. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
b22530a to
05d7512
Compare
…tions` (apache#22969) ## Which issue does this PR close? - Closes #. ## Rationale for this change Something that was spotted during the review of: - apache#22657 `ListingOptions::target_partitions` and `ListingOptions::collect_stat` duplicate `SessionConfig`'s `execution.target_partitions` and `execution.collect_statistics`. After some investigation, I think they only live on `ListingOptions` for historical reasons: when the struct was added (apache#1010 5 years ago), `TableProvider::scan` had no access to the session, so the values had to be copied onto the table at build time. Once apache#2660 passed `SessionState` into `scan`, the fields became redundant (and had already drifted — `scan` read them from the session config while `list_files_for_scan` read the stale copy). This PR makes `SessionConfig` the single source of truth. ## What changes are included in this PR? - Remove `target_partitions`/`collect_stat` fields, their builders, and `with_session_config_options` from `ListingOptions`. - `ListingTable` now reads both values from the session config at scan time. - Reserve proto tags 8/9 in `ListingTableScanNode` and drop the related (de)serialization. - Update benchmarks, factory, and test call sites. ## Are these changes tested? Yes, by existing tests ## Are there any user-facing changes? Yes, breaking: the removed fields/builders require configuring `SessionConfig` instead, and the two proto fields no longer round-trip. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
74da268 to
758278d
Compare
|
Looks like this PR has a conflict (and for some reason the CI isn't running). Maybe we can resolve / fix that and I will try and review it |
@alamb working on it right now 🙇 |
758278d to
ba15301
Compare
|
ok this should be good for another look @alamb 🙇 |
FileScanConfig::output_partitioning for pre-defined file partitioning
FileScanConfig::output_partitioning for pre-defined file partitioningListingOptions::output_partitioning and FileScanConfig::output_partitioning for pre-defined file partitioning
alamb
left a comment
There was a problem hiding this comment.
I think the API looks good to me -- thnk you @gene-bordegaray and @gabotechs
I had several code / testing comments that would be nice to address but are not required to merge in my opinion
| /// Declarations are limited to partitioning that can be represented by | ||
| /// assigning whole files to file groups. | ||
| /// | ||
| /// Files are assigned to groups in path order. DataFusion does not validate |
There was a problem hiding this comment.
I think an example or two would help make this documentation clearer.
e.g. an example partitioning and 3 files -- how are the files assigned to partititions
There was a problem hiding this comment.
added two examples and made more clear. Let me know if that ocvers the cases you had in mind
| } | ||
| } | ||
|
|
||
| fn create_physical_output_partitioning( |
There was a problem hiding this comment.
these functions seem like they are generic -- could they be made as a method on LogicalPartitioning to make them more discoverable?
Or perhaps datafusion/physical-expr/src/physical_expr.rs, something like
pub fn create_physical_partitioning(
partitioning: &datafusion_expr::Partitioning,
input_dfschema: &DFSchema,
execution_props: &ExecutionProps,
) -> Result<Partitioning>As they are similar "logical to physical" conversion functions
| false, | ||
| ) | ||
| } | ||
| let (mut file_groups, grouped_by_partition) = if has_declared_partitioning { |
There was a problem hiding this comment.
is there some way to refactor this function so it doesn't have a bunch of if has_declared_partitioning checks? it make it hard to understand what is going on. Maybe we can have two versions of the function or something like that?
There was a problem hiding this comment.
I split this into two function with some helpers, I thnk make logic more linear 👍
| /// cannot be projected, or its partition count differs from `file_groups`, | ||
| /// this returns `UnknownPartitioning`. | ||
| /// | ||
| /// Tradeoffs |
There was a problem hiding this comment.
these tradeoffs still seem relevant -- why remove it?
There was a problem hiding this comment.
Seemed like long docs that I thought would be better on the fields themselves, that's why I added: https://github.com/apache/datafusion/pull/22657/changes/BASE..ba15301de16ce79e7e84c134961c75e9ca4653b8#diff-a07222d670257887f5118197c485861c96635e2da6c2bf0007d2c21dda7df82aR206
It is shorter but still is same idea, I can revert this and modify if its preferred.
There was a problem hiding this comment.
actually if we are deprecating partitioned_by_file_group I will keep this and omit that new addition 👍
b6ec867 to
5ea3b79
Compare
5ea3b79 to
03e860f
Compare
gabotechs
left a comment
There was a problem hiding this comment.
This looks good, great work @gene-bordegaray!
Which issue does this PR close?
Rationale for this change
This follows up on #22607 by replacing range-partitioning sqllogictest boilerplate with a general file/listing scan API for declared output partitioning.
Related: #21992, #22607, #22607 (comment)
What changes are included in this PR?
output_partitioningto file scan and listing table configuration.output_partitioningthrough physical plan proto.range_partitioning.sltto use a CSVListingTableinstead of a custom test-onlyTableProvider/DataSource.Contract:
Range([range_key@0], [(10), (20)], 3)remains valid if the scan projectsrange_keyand falls back toUnknownPartitioning(3)ifrange_keyis not projected.target_partitions). It is up to the user to plan their partitioning. For example, a 4-partition range declaration creates four scan file groups, adding empty trailing groups when fewer files are present.imust contain rows for declared output partitioni. DataFusion does not validate row placement, matching other user-declared properties such as sortedness.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes. This adds public API for declaring file/listing scan output partitioning. No breaking API changes.