feat(api): implement `upsert()` using `MERGE INTO` #11624

deepyaman · 2025-09-16T21:47:07Z

Description of changes

Implement Backend.upsert() using sqlglot.expressions.merge() under the hood. Upsert support is very important, especially for data engineering use cases.

Starting with a most basic implementation, including only supporting one join column. I think this could be expanded to support a list without much effort.

MERGE INTO support is limited. DuckDB only added support for MERGE statements earlier today in 1.4.0, and many other backends don't support it. However, it seems like the more standard/correct approach for supporting upserts, and it doesn't require merge keys defined ahead of time on tables.

Backends that work:

DuckDB (from 1.4.0)
Flink
Oracle ~~(currently using a hack to work around "AS" getting added to MERGE statement)~~
MS SQL (currently throwing a ; onto the end of ~~every~~ statement 😅)
Postgres

Should work, need help to test:

Databricks
Snowflake
BigQuery

Backends that don't work:

PySpark ("MERGE INTO TABLE is not supported temporarily.")
Clickhouse
DataFusion
SQLite (supports the nonstandard UPSERT statement)
Impala ("The MERGE statement is only supported for Iceberg tables.")
MySQL
Polars
RisingWave
Athena ("MERGE INTO is transactional and is supported only for Apache Iceberg tables in Athena engine version 3.")
Trino ("connector does not support modifying table rows")

Issues closed

Resolves Add Backend.upsert #5391

deepyaman · 2025-09-18T00:54:13Z

@cpcloud Requesting review for initial feedback while I try to improve backend support; I assume I could always mark a lot of them notyet if the approach is correct but some translations just need work.

ibis/backends/oracle/__init__.py

ibis/backends/sql/__init__.py

deepyaman · 2025-09-18T21:42:37Z

ibis/backends/mssql/__init__.py

            query = query.sql(self.dialect)

+        if "MERGE" in query:
+            query = f"{query};"


I'm honestly not sure how to better do this; I wasn't able to figure out how to add a semicolon to an expression in SQLGlot.

This isn't ideal but I think is fine with me. The "cleaner" way would be to override the _build_upsert_from_table method, but the amount of boilerplate for that feels not worth it.

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend? or does that break on some backends?

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend?

I don't actually know how to stick a semicolon on the end of a SQLGlot expression. 😅 If _build_upsert_from_table() was handling the conversion to SQL, that would have been simple enough.

Ah, I understand.

huh, is this a bug in the upstream mssql engine? Can you confirm that only merge statements require trailing semicolons, but other sql queries/statements do not?

ibis/backends/sql/__init__.py

deepyaman · 2025-09-21T03:51:38Z

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

deepyaman · 2025-10-07T22:53:35Z

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

7fddcb9

0ca5379

adc70e7

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

@cpcloud FYI I've done this and everything is passing! Should be ready to go.

ibis/backends/tests/test_client.py

ibis/backends/sql/__init__.py

NickCrews · 2025-10-13T18:44:35Z

ibis/backends/tests/test_client.py

+    from_table = con.table(employee_data_3_temp_table)
+    df1 = temporary.execute().set_index("first_name")
+
+    con.upsert(employee_data_1_temp_table, obj=from_table, on="first_name")


Can you add an xfail test for on="bogus_column"?

Is this necessary? Seems like an unconventional use of xfail.

I think it would make more sense to test if we were explicitly testing whether the on value was in the set of available columns, but I don't see a similar test for something like join predicates, so I feel like it should be fine to leave it to the backend.

If this was the confusion, I guess Ireally meant with pytest.raises(): not an actual pytest.mark.xfail.

But, I think I'm ok with not testing for this behavior if you don't want to (leaving it as undefined behavior). As long as we test all the explicitly-supported behaviors I'm happy.

ibis/backends/__init__.py

ibis/backends/oracle/__init__.py

deepyaman · 2025-10-16T03:12:32Z

@NickCrews Addressed all your comments (except didn't add an xfail test, but replied to your suggestion). Please feel free to take another look!

NickCrews · 2025-10-16T19:22:37Z

ibis/backends/sql/__init__.py

+        compiler = self.compiler
+        quoted = compiler.quoted
+
+        columns = self._get_columns_to_insert(


Assuming an existing table with columns {i: int64, s: string, f: float64}, can you add a tests for upserting objects (using condition i=i) with schemas

{i: int64, s: string, f: float64} (works)

{s: !string, f: float32, i: uint8} (different order and flavors, but dtypes still compatible, works)

{i: int64} (success, but nothing is updated)

{i: int64, s: string} (only s is updated)

{s: string} (error, i not present)

{i:int64, b: boolean} (error, b is not in dest table)

Then, depending on how the "should we fallback to positional ordering" decision goes, we should add tests for that too. But I think all the above tests should still be valid regardless of what we decide there.

NickCrews · 2025-10-16T19:24:44Z

Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target.

NickCrews · 2025-10-16T19:28:43Z

ibis/backends/sql/__init__.py

        target_cols = self.get_schema(target, catalog=catalog, database=db).keys()

-        columns = (
+        return (


Oof, I see what you mean how you are inheriting this logic from insert. I think it is essential that they use the same logic.

But, I think this is a footgun waiting to happen. I think we should make a breaking change to .insert() and require that source be a subset of target. What do you think of this change @cpcloud ?

I don't think there is anywhere else in ibis where we rely on positional ordering of columns, is there? I think we should keep that in mind, if we make this change here, then I think we should wipe out all other instances of relying on positional ordering.

Asking copilot Is there anywhere in this codebase where we rely on the position of columns in tables/relations? Or do we always match by column name? gives:

Looking at the codebase, Ibis primarily relies on column names rather than positions, but there are some specific cases where positional ordering is used:

Column Name-Based Operations (Majority)

Most operations use column names:

Table.join() matches columns by name (e.g., "movieId")

Table.select() references columns by name

Schema.name_at_position() exists but is used to look up names by position, then uses names

Field operations match by column name

Position-Based Operations (Specific Cases)

However, there are cases where position matters:

Positional table access: Table.__getitem__ supports t[0] for first column, t[1] for second, etc.

Positional joins: There's a "positional" join kind mentioned in JoinKind that joins tables by row position rather than column values. See test_positional_join.

Schema comparison: Schema.equals() explicitly states that "The order of fields in the schema is taken into account when computing equality."

Column insertion during insert(): In SQLBackend._build_insert_from_table, columns are matched by position when source columns are not a subset of target columns.

Info operations: Table.info() includes a pos field tracking column position.

So the answer is: primarily name-based, but position is significant for schema equality, positional joins, and some insertion scenarios.

github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Sep 16, 2025

deepyaman mentioned this pull request Sep 16, 2025

feat(deps): support duckdb 1.4.0 #11622

Merged

deepyaman added the feature Features or general enhancements label Sep 16, 2025

github-actions bot added the oracle The Oracle backend label Sep 17, 2025

deepyaman force-pushed the feat/api/backend-upsert branch from 5a27b34 to 21994eb Compare September 17, 2025 23:14

deepyaman mentioned this pull request Sep 18, 2025

Merge SQL generation injects AS into USING clause tobymao/sqlglot#5910

Closed

deepyaman requested a review from cpcloud September 18, 2025 00:52

deepyaman commented Sep 18, 2025

View reviewed changes

ibis/backends/oracle/__init__.py Outdated Show resolved Hide resolved

deepyaman commented Sep 18, 2025

View reviewed changes

ibis/backends/sql/__init__.py Outdated Show resolved Hide resolved

github-actions bot added the mssql The Microsoft SQL Server backend label Sep 18, 2025

deepyaman force-pushed the feat/api/backend-upsert branch from 46b0e4a to 8e47b79 Compare September 18, 2025 04:54

deepyaman commented Sep 18, 2025

View reviewed changes

deepyaman force-pushed the feat/api/backend-upsert branch from 2e1403d to b40103e Compare September 19, 2025 12:57

deepyaman commented Sep 20, 2025

View reviewed changes

ibis/backends/sql/__init__.py Show resolved Hide resolved

deepyaman force-pushed the feat/api/backend-upsert branch from 25061ad to d052b43 Compare September 20, 2025 16:45

github-actions bot added the dependencies Issues or PRs related to dependencies label Sep 20, 2025

deepyaman mentioned this pull request Sep 20, 2025

chore(deps): upgrade DuckDB and SQLGlot dependency #11636

Closed

deepyaman force-pushed the feat/api/backend-upsert branch 4 times, most recently from d2b5922 to 8b0d6fe Compare September 21, 2025 02:57

github-actions bot added the bigquery The BigQuery backend label Sep 21, 2025

deepyaman mentioned this pull request Sep 21, 2025

Support upsert operations for SQL datasets kedro-org/kedro#5090

Open

deepyaman force-pushed the feat/api/backend-upsert branch from adc70e7 to ffc8125 Compare October 7, 2025 07:22