-
Notifications
You must be signed in to change notification settings - Fork 679
feat(api): implement upsert()
using MERGE INTO
#11624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5a27b34
to
21994eb
Compare
@cpcloud Requesting review for initial feedback while I try to improve backend support; I assume I could always mark a lot of them |
46b0e4a
to
8e47b79
Compare
query = query.sql(self.dialect) | ||
|
||
if "MERGE" in query: | ||
query = f"{query};" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure how to better do this; I wasn't able to figure out how to add a semicolon to an expression in SQLGlot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't ideal but I think is fine with me. The "cleaner" way would be to override the _build_upsert_from_table
method, but the amount of boilerplate for that feels not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table()
for every backend? or does that break on some backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend?
I don't actually know how to stick a semicolon on the end of a SQLGlot expression. 😅 If _build_upsert_from_table()
was handling the conversion to SQL, that would have been simple enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand.
huh, is this a bug in the upstream mssql engine? Can you confirm that only merge statements require trailing semicolons, but other sql queries/statements do not?
2e1403d
to
b40103e
Compare
25061ad
to
d052b43
Compare
d2b5922
to
8b0d6fe
Compare
@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to With that, all of the functionality to implement |
adc70e7
to
ffc8125
Compare
@cpcloud FYI I've done this and everything is passing! Should be ready to go. |
from_table = con.table(employee_data_3_temp_table) | ||
df1 = temporary.execute().set_index("first_name") | ||
|
||
con.upsert(employee_data_1_temp_table, obj=from_table, on="first_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an xfail test for on="bogus_column"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Seems like an unconventional use of xfail.
I think it would make more sense to test if we were explicitly testing whether the on
value was in the set of available columns, but I don't see a similar test for something like join predicates, so I feel like it should be fine to leave it to the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was the confusion, I guess Ireally meant with pytest.raises():
not an actual pytest.mark.xfail
.
But, I think I'm ok with not testing for this behavior if you don't want to (leaving it as undefined behavior). As long as we test all the explicitly-supported behaviors I'm happy.
378bbb8
to
06ca69a
Compare
06ca69a
to
712a8d7
Compare
9224992
to
a1c5406
Compare
@NickCrews Addressed all your comments (except didn't add an xfail test, but replied to your suggestion). Please feel free to take another look! |
compiler = self.compiler | ||
quoted = compiler.quoted | ||
|
||
columns = self._get_columns_to_insert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming an existing table with columns {i: int64, s: string, f: float64}, can you add a tests for upserting objects (using condition i=i) with schemas
- {i: int64, s: string, f: float64} (works)
- {s: !string, f: float32, i: uint8} (different order and flavors, but dtypes still compatible, works)
- {i: int64} (success, but nothing is updated)
- {i: int64, s: string} (only s is updated)
- {s: string} (error, i not present)
- {i:int64, b: boolean} (error, b is not in dest table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, depending on how the "should we fallback to positional ordering" decision goes, we should add tests for that too. But I think all the above tests should still be valid regardless of what we decide there.
Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target. |
target_cols = self.get_schema(target, catalog=catalog, database=db).keys() | ||
|
||
columns = ( | ||
return ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, I see what you mean how you are inheriting this logic from insert. I think it is essential that they use the same logic.
But, I think this is a footgun waiting to happen. I think we should make a breaking change to .insert() and require that source be a subset of target. What do you think of this change @cpcloud ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is anywhere else in ibis where we rely on positional ordering of columns, is there? I think we should keep that in mind, if we make this change here, then I think we should wipe out all other instances of relying on positional ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking copilot Is there anywhere in this codebase where we rely on the position of columns in tables/relations? Or do we always match by column name?
gives:
Looking at the codebase, Ibis primarily relies on column names rather than positions, but there are some specific cases where positional ordering is used:
Column Name-Based Operations (Majority)
Most operations use column names:
Table.join()
matches columns by name (e.g.,"movieId"
)Table.select()
references columns by nameSchema.name_at_position()
exists but is used to look up names by position, then uses namesField
operations match by column name
Position-Based Operations (Specific Cases)
However, there are cases where position matters:
-
Positional table access:
Table.__getitem__
supportst[0]
for first column,t[1]
for second, etc. -
Positional joins: There's a
"positional"
join kind mentioned inJoinKind
that joins tables by row position rather than column values. Seetest_positional_join
. -
Schema comparison:
Schema.equals()
explicitly states that "The order of fields in the schema is taken into account when computing equality." -
Column insertion during
insert()
: InSQLBackend._build_insert_from_table
, columns are matched by position when source columns are not a subset of target columns. -
Info operations:
Table.info()
includes apos
field tracking column position.
So the answer is: primarily name-based, but position is significant for schema equality, positional joins, and some insertion scenarios.
Description of changes
Implement
Backend.upsert()
usingsqlglot.expressions.merge()
under the hood. Upsert support is very important, especially for data engineering use cases.Starting with a most basic implementation, including only supporting one join column. I think this could be expanded to support a list without much effort.
MERGE INTO
support is limited. DuckDB only added support forMERGE
statements earlier today in 1.4.0, and many other backends don't support it. However, it seems like the more standard/correct approach for supporting upserts, and it doesn't require merge keys defined ahead of time on tables.Backends that work:
(currently using a hack to work around "AS" getting added to MERGE statement);
onto the end ofeverystatement 😅)Should work, need help to test:
Backends that don't work:
MERGE
statement is only supported for Iceberg tables.")MERGE INTO
is transactional and is supported only for Apache Iceberg tables in Athena engine version 3.")Issues closed
Backend.upsert
#5391