Skip to content

Conversation

deepyaman
Copy link
Contributor

@deepyaman deepyaman commented Sep 16, 2025

Description of changes

Implement Backend.upsert() using sqlglot.expressions.merge() under the hood. Upsert support is very important, especially for data engineering use cases.

Starting with a most basic implementation, including only supporting one join column. I think this could be expanded to support a list without much effort.

MERGE INTO support is limited. DuckDB only added support for MERGE statements earlier today in 1.4.0, and many other backends don't support it. However, it seems like the more standard/correct approach for supporting upserts, and it doesn't require merge keys defined ahead of time on tables.

Backends that work:

  • DuckDB (from 1.4.0)
  • Flink
  • Oracle (currently using a hack to work around "AS" getting added to MERGE statement)
  • MS SQL (currently throwing a ; onto the end of every statement 😅)
  • Postgres

Should work, need help to test:

  • Databricks
  • Snowflake
  • BigQuery

Backends that don't work:

  • PySpark ("MERGE INTO TABLE is not supported temporarily.")
  • Clickhouse
  • DataFusion
  • SQLite (supports the nonstandard UPSERT statement)
  • Impala ("The MERGE statement is only supported for Iceberg tables.")
  • MySQL
  • Polars
  • RisingWave
  • Athena ("MERGE INTO is transactional and is supported only for Apache Iceberg tables in Athena engine version 3.")
  • Trino ("connector does not support modifying table rows")

Issues closed

@github-actions github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Sep 16, 2025
@deepyaman deepyaman added the feature Features or general enhancements label Sep 16, 2025
@github-actions github-actions bot added the oracle The Oracle backend label Sep 17, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 5a27b34 to 21994eb Compare September 17, 2025 23:14
@deepyaman deepyaman requested a review from cpcloud September 18, 2025 00:52
@deepyaman
Copy link
Contributor Author

@cpcloud Requesting review for initial feedback while I try to improve backend support; I assume I could always mark a lot of them notyet if the approach is correct but some translations just need work.

@github-actions github-actions bot added the mssql The Microsoft SQL Server backend label Sep 18, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 46b0e4a to 8e47b79 Compare September 18, 2025 04:54
query = query.sql(self.dialect)

if "MERGE" in query:
query = f"{query};"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure how to better do this; I wasn't able to figure out how to add a semicolon to an expression in SQLGlot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't ideal but I think is fine with me. The "cleaner" way would be to override the _build_upsert_from_table method, but the amount of boilerplate for that feels not worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend? or does that break on some backends?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend?

I don't actually know how to stick a semicolon on the end of a SQLGlot expression. 😅 If _build_upsert_from_table() was handling the conversion to SQL, that would have been simple enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand.

huh, is this a bug in the upstream mssql engine? Can you confirm that only merge statements require trailing semicolons, but other sql queries/statements do not?

@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 2e1403d to b40103e Compare September 19, 2025 12:57
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 25061ad to d052b43 Compare September 20, 2025 16:45
@github-actions github-actions bot added the dependencies Issues or PRs related to dependencies label Sep 20, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch 4 times, most recently from d2b5922 to 8b0d6fe Compare September 21, 2025 02:57
@github-actions github-actions bot added the bigquery The BigQuery backend label Sep 21, 2025
@deepyaman
Copy link
Contributor Author

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

@deepyaman
Copy link
Contributor Author

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

@cpcloud FYI I've done this and everything is passing! Should be ready to go.

from_table = con.table(employee_data_3_temp_table)
df1 = temporary.execute().set_index("first_name")

con.upsert(employee_data_1_temp_table, obj=from_table, on="first_name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an xfail test for on="bogus_column"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Seems like an unconventional use of xfail.

I think it would make more sense to test if we were explicitly testing whether the on value was in the set of available columns, but I don't see a similar test for something like join predicates, so I feel like it should be fine to leave it to the backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was the confusion, I guess Ireally meant with pytest.raises(): not an actual pytest.mark.xfail.

But, I think I'm ok with not testing for this behavior if you don't want to (leaving it as undefined behavior). As long as we test all the explicitly-supported behaviors I'm happy.

@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch 3 times, most recently from 378bbb8 to 06ca69a Compare October 15, 2025 03:07
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 06ca69a to 712a8d7 Compare October 15, 2025 03:09
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 9224992 to a1c5406 Compare October 16, 2025 03:00
@deepyaman
Copy link
Contributor Author

@NickCrews Addressed all your comments (except didn't add an xfail test, but replied to your suggestion). Please feel free to take another look!

compiler = self.compiler
quoted = compiler.quoted

columns = self._get_columns_to_insert(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming an existing table with columns {i: int64, s: string, f: float64}, can you add a tests for upserting objects (using condition i=i) with schemas

  • {i: int64, s: string, f: float64} (works)
  • {s: !string, f: float32, i: uint8} (different order and flavors, but dtypes still compatible, works)
  • {i: int64} (success, but nothing is updated)
  • {i: int64, s: string} (only s is updated)
  • {s: string} (error, i not present)
  • {i:int64, b: boolean} (error, b is not in dest table)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, depending on how the "should we fallback to positional ordering" decision goes, we should add tests for that too. But I think all the above tests should still be valid regardless of what we decide there.

@NickCrews
Copy link
Contributor

Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target.

target_cols = self.get_schema(target, catalog=catalog, database=db).keys()

columns = (
return (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I see what you mean how you are inheriting this logic from insert. I think it is essential that they use the same logic.

But, I think this is a footgun waiting to happen. I think we should make a breaking change to .insert() and require that source be a subset of target. What do you think of this change @cpcloud ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is anywhere else in ibis where we rely on positional ordering of columns, is there? I think we should keep that in mind, if we make this change here, then I think we should wipe out all other instances of relying on positional ordering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking copilot Is there anywhere in this codebase where we rely on the position of columns in tables/relations? Or do we always match by column name? gives:

Looking at the codebase, Ibis primarily relies on column names rather than positions, but there are some specific cases where positional ordering is used:

Column Name-Based Operations (Majority)

Most operations use column names:

  • Table.join() matches columns by name (e.g., "movieId")
  • Table.select() references columns by name
  • Schema.name_at_position() exists but is used to look up names by position, then uses names
  • Field operations match by column name

Position-Based Operations (Specific Cases)

However, there are cases where position matters:

  1. Positional table access: Table.__getitem__ supports t[0] for first column, t[1] for second, etc.

  2. Positional joins: There's a "positional" join kind mentioned in JoinKind that joins tables by row position rather than column values. See test_positional_join.

  3. Schema comparison: Schema.equals() explicitly states that "The order of fields in the schema is taken into account when computing equality."

  4. Column insertion during insert(): In SQLBackend._build_insert_from_table, columns are matched by position when source columns are not a subset of target columns.

  5. Info operations: Table.info() includes a pos field tracking column position.

So the answer is: primarily name-based, but position is significant for schema equality, positional joins, and some insertion scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery The BigQuery backend dependencies Issues or PRs related to dependencies feature Features or general enhancements mssql The Microsoft SQL Server backend oracle The Oracle backend sql Backends that generate SQL tests Issues or PRs related to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Backend.upsert

3 participants