Skip to content

Move implementation of upsert from Table to Transaction #1817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 13, 2025

Conversation

koenvo
Copy link
Contributor

@koenvo koenvo commented Mar 19, 2025

Rationale for this change

Previously, the upsert functionality was implemented at the table level, which meant it always initiated a new Transaction. This change moves the upsert implementation to the Transaction level while keeping table.upsert(...) as an entry point.

With this refactor, end users now have the flexibility to call upsert in two ways:

  • table.upsert(...) – which still starts a new transaction.
  • transaction.upsert(...) – allowing upserts within an existing transaction.

Are these changes tested?

Using existing tests.

Are there any user-facing changes?

Yes. This change enables users to perform upserts within an existing transaction using transaction.upsert(...), in addition to the existing table.upsert(...) method.

@koenvo koenvo marked this pull request as ready for review March 19, 2025 13:32
@koenvo koenvo changed the title Move actual implementation of upsert from Table to Transaction Move implementation of upsert from Table to Transaction Mar 19, 2025
@mattmartin14
Copy link
Contributor

I think since the transaction wrapper has been moved out, there should be a unit test added to do partial upsert and then throw an error and ensure the rollback occurs and we are not left in a state where a partial upsert succeeded.

Example:

  • start an upsert
  • let the update succeed
  • force an error on the insert component
  • rollback the transaction and make sure the update did not persist

Just my thoughts 😃. Thanks,
Matt

@koenvo
Copy link
Contributor Author

koenvo commented Mar 22, 2025

Agree! I will work on the test.

With "update" you mean "delete", right?

@mattmartin14
Copy link
Contributor

Agree! I will work on the test.

With "update" you mean "delete", right?

Hey sorry; just saw this; when i mean update, i mean it invokes an "overwrite" operation, which i believe is what delete's also trigger under the covers. 😀

@koenvo
Copy link
Contributor Author

koenvo commented Mar 27, 2025

There is a nice edgecase here..

tbl = catalog.create_table(identifier, schema=schema)

# Define exact schema: required int32 and required string
arrow_schema = pa.schema([
    pa.field("id", pa.int32(), nullable=False),
    pa.field("name", pa.string(), nullable=False),
])

tbl.append(pa.Table.from_pylist([{"id": 1, "name": "Alice"}], schema=arrow_schema))

df = pa.Table.from_pylist([{"id": 2, "name": "Bob"}, {"id": 1, "name": "Alicia"}], schema=arrow_schema)

with tbl.transaction() as txn:
    txn.upsert(df, join_cols=["id"])

    # This will re-insert Bob, instead of reading the uncommitted changes and ignore Bob
    txn.upsert(df, join_cols=["id"])

@Fokko should it be possible to read uncommitted changes?

@Fokko
Copy link
Contributor

Fokko commented Apr 17, 2025

@Fokko should it be possible to read uncommitted changes?

Yes, it should. If you do a subsequent upsert with the same data, it should be a no-op. This should be the case today, otherwise, #1903 will fix this.

@kevinjqliu
Copy link
Contributor

now that #1903 is merged, could you rebase this PR?

@koenvo koenvo force-pushed the feat/move-upsert-to-transaction branch from 56888d3 to f336c0b Compare May 13, 2025 07:40
matched_predicate = upsert_util.create_match_filter(df, join_cols)

# We must use Transaction.table_metadata for the scan. This includes all uncommitted - but relevant - changes.
matched_iceberg_table = DataScan(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most important change. Required #1903

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for pointing out, and also covering this in the tests 👍

txn.delete(delete_filter="id = 1")
txn.append(df)

# This should read the uncommitted changes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test uncommitted changes are read

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @koenvo

matched_predicate = upsert_util.create_match_filter(df, join_cols)

# We must use Transaction.table_metadata for the scan. This includes all uncommitted - but relevant - changes.
matched_iceberg_table = DataScan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for pointing out, and also covering this in the tests 👍

@Fokko Fokko merged commit d9f3a07 into apache:main May 13, 2025
10 checks passed
@koenvo koenvo deleted the feat/move-upsert-to-transaction branch May 13, 2025 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants