Skip to content

Conversation

@dqkqd
Copy link
Contributor

@dqkqd dqkqd commented Oct 22, 2025

Which issue does this PR close?

Rationale for this change

DataFrame::drop_columns only considers one field for each name,
it fails to drop columns from dataframe containing duplicated names
from different relations. Such as mark columns created by multiples
Join::LeftMark.

DataFrame::select_columns has the same issue, it fails to select columns
with the same name from different relations.

What changes are included in this PR?

Allow DataFrame::drop_columns and DataFrame::select_columns
work with duplicated names from different relations.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core DataFusion crate label Oct 22, 2025
@dqkqd dqkqd changed the title fix: DataFrame::drop_columns for qualified duplicated field names fix: select_columns and drop_columns for qualified duplicated field names Oct 23, 2025
"
);

let select_res = join_res.select_columns(&["c1"])?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line returned SchemaError without the code changes.

    Error: SchemaError(AmbiguousReference { field: Column { relation: None, name: "c1" } }, Some(""))

Comment on lines +627 to +641
assert_snapshot!(
batches_to_sort_string(&drop_res.clone().collect().await.unwrap()),
@r"
+----+
| c1 |
+----+
| b |
| c |
| d |
+----+
"
);

Ok(())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failed without the code changes

    running 1 test
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Snapshot Summary ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Snapshot: drop_columns_duplicated_names_from_different_qualifiers-2
    Source: datafusion/core/tests/dataframe/mod.rs:627
    ────────────────────────────────────────────────────────────────────────────────
    Expression: batches_to_sort_string(&drop_res.clone().collect().await.unwrap())
    ────────────────────────────────────────────────────────────────────────────────
    -old snapshot
    +new results
    ────────────┬───────────────────────────────────────────────────────────────────
        1       │-+----+
        2       │-| c1 |
        3       │-+----+
        4       │-| b  |
        5       │-| c  |
        6       │-| d  |
        7       │-+----+
              1 │++----+-------+-------+
              2 │+| c1 | mark  | mark  |
              3 │++----+-------+-------+
              4 │+| b  | true  | false |
              5 │+| c  | false | false |
              6 │+| d  | false | true  |
              7 │++----+-------+-------+
    ────────────┴───────────────────────────────────────────────────────────────────
    To update snapshots run `cargo insta review`
    Stopped on the first failure. Run `cargo insta test` to run all snapshots.
    test dataframe::drop_columns_duplicated_names_from_different_qualifiers ... FAILED

    failures:

    failures:
        dataframe::drop_columns_duplicated_names_from_different_qualifiers

    test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 769 filtered out; finished in 0.17s

@dqkqd dqkqd marked this pull request as ready for review October 23, 2025 20:49
Copy link

@albertlockett albertlockett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking this @dqkqd

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dqkqd and @albertlockett 🙏

@alamb alamb changed the title fix: select_columns and drop_columns for qualified duplicated field names fix: DataFrame::select_columns and DataFrame::drop_columns for qualified duplicated field names Oct 29, 2025
@alamb alamb added this pull request to the merge queue Oct 30, 2025
Merged via the queue into apache:main with commit 9435513 Oct 30, 2025
28 checks passed
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
…alified duplicated field names (apache#18236)

## Which issue does this PR close?

- Closes apache#18212.

## Rationale for this change

`DataFrame::drop_columns` only considers one field for each `name`,
it fails to drop columns from dataframe containing duplicated names
from different relations. Such as `mark` columns created by multiples 
`Join::LeftMark`.

`DataFrame::select_columns` has the same issue, it fails to select
columns
with the same name from different relations.

## What changes are included in this PR?

Allow `DataFrame::drop_columns` and `DataFrame::select_columns`
work with duplicated names from different relations.

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to drop the mark column added by JoinType::LeftMark if there are multiple joins

3 participants