Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: MERGE/Upsert Support #1534

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

mattmartin14
Copy link

Hi,

This is my first PR to the pyiceberg project, so I hope it is well received and I’m open to any feedback. This feature has been long asked by the pyiceberg community to be able to perform an “upsert” on an iceberg table a.k.a. MERGE command. There is a lot to unpack on this PR, so I’ll start with some high level items and go into more detail.

For basic terminology, an upsert/merge is where you take a new dataset and merge it to the target Iceberg table by doing an update on the target table for rows that have changed and an insert for new rows. This is all performed atomically as 1 transaction, meaning both the update and the insert succeed together, or they fail; this ensures the table is not left in an unknown state if one of the actions were able to succeed but the other errored out.

Over the last decade, the ANSI SQL MERGE command has evolved a lot to handle more than just update existing rows, insert new rows. This PR aims to just cover the BASIC upsert pattern; it has been architected to allow more of the ANSI MERGE command features over time (via the merge_options paramater) to be included if others would like to contribute, such as “WHEN NOT MATCHED BY SOURCE THEN DELETE”.

In order to efficiently process an upsert on the Iceberg table, it requires an engine to detect what rows have changed and what rows are new. The engine I have chosen for this new feature is datafusion, which is a rust based high performance pyarrow data processing engine; I realize this introduces a new dependency for the pyiceberg project, which i've flagged as optional in the pyproject.toml file. My rational on choosing datafusion is this is also the same library that is being used for the iceberg-rust project. Thus, down the road, remediation of this merge command to the more modern iceberg-rust implementation should be minimal.

As far as test coverage is concerned, my unit tests cover performing upserts on both single key and composite key Iceberg tables (meaning the table has more than 1 key field it needs to join on). For the performance testing, single key tables scale just fine. My unit tests does a 10k update/insert for one of the series of tets, but I have on my local workstation (Mac M2 pro) ran a test of 500k rows of updates/inserts on a single key join and it scales just fine.

Where I am hitting a wall on performance and am needing other’s help here is when you want to use a composite key. The composite key code builds an overwrite filter, and once that filter gets too lengthy (in my testing more than 200 rows), the visitor “OR” function in pyiceberg hits a recursion depth error. I took a hard look at the visitor code in the expressions folder and I was hesitant to try and change any of those functions due to me not having a clear understanding of how they work, and this is where I need other’s help. If you want to see this scaling cocern, simply update the paramater on my test_merge_scenario_3_composite_key def to generate a target table with 1000 rows and a source table with 2000 rows and you will see the error surface. I don't think its smart nor pratical to try and change the python recursion depth default limit because we will still hit the wall at some point unless the visitor "OR" function gets reworked to avoid recursion.

Some other ideas i've kicked around to try and mitigate this, is to have the merge_rows code temporarily modify the source dataframe and the target iceberg table to build a concatenated single key of all the composite keys to join on e.g. "col1_col2_col[n]...". But that would require modifying the entire iceberg target table with a full overwrite, which then defeats the purpose of an upsert to be able to run incremental updates on a table not have to overwrite the entire table.

I think this PR is a good first step to finally realizing MERGE/upsert support for pyiceberg and gets a piercing in the armor. Again, I’m welcome to other’s feedback on this topic and look forward to partnering with you all on this journey.

Thanks,
Matt Martin

@kevinjqliu kevinjqliu self-requested a review January 17, 2025 18:08
@mattmartin14
Copy link
Author

@kevinjqliu - i'm doing this work on behalf of my company and when i ran my tests, i used a standard python virtual environment venv; i haven't figured out quite yet how to get poetry to work inside my companies firewall. So, not sure if those are errors I can address or if someone else can pitch in here.

@Fokko Fokko self-requested a review January 17, 2025 19:20
@bitsondatadev
Copy link
Contributor

bitsondatadev commented Jan 17, 2025

@kevinjqliu - i'm doing this work on behalf of my company and when i ran my tests, i used a standard python virtual environment venv; i haven't figured out quite yet how to get poetry to work inside my companies firewall. So, not sure if those are errors I can address or if someone else can pitch in here.

@mattmartin14, what's going on man!? Thanks for working on this and most impressively thanks for the comprehensive description.

Out of curiosity, did you discuss your approach with anyone before putting this together? This is good but a few flags for OSS contributions to lower the upfront back and forth:

  1. My main concern is it looks like you're introducing datafusion to get this feature out and adding dependencies is generally avoided until absolutely necessary (i.e. the value of the new dependency serves many use cases or won't cause a lot of maintenance). If you haven't, I recommend informally discussing this with @Fokko or @kevinjqliu to see if datafusion is already on the roadmap or can be considered. Likely the conversation can finish with them, but they may suggest asking on the mailing list to document and discuss it with more folks. There may already be ways of doing this using the existing library and not adding another dependency.

  2. As you've noticed the build is breaking with the poetry issue and it sounds like this is happening on a work laptop. I recommend working with your company to allow you to work on OSS code outside of a company device as I'm familiar with how locked down their system is and it will be difficult to contribute behind the corporate VPN. I did this for contributions where I submitted the code to my repository on my OSS account (we weren't required to have company accounts so this worked out anyways). This would require a review to ensure sensitive information wasn't leaked. Once I submitted the code to my branch, I was authorized to work on that code on my personal laptop (generally off hours to get things polished for merging).

  3. For Apache projects, you have to add license notifications to each file you add and these files are missing:

     !????? /home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/merge_rows_util.py
     !????? /home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/test.py
     !????? /home/runner/work/iceberg-python/iceberg-python/tests/table/test_merge_rows.py
    
  4. There's bits of dead code scattered about, I'm not sure if that's for discussion or todo. I would either remove it or comment on it when this PR is ready for review.

  5. Where I am hitting a wall on performance

    Aside from total disregard for it, performance should be pretty low on the list of concerns. This was also a new paradigm when I submitted to open source. You're not writing code for a corporate team-sized set of eyes, you're writing it for many people to evaluate. I always recommend doing the bare minimum unless the performant version adds ten or twenty lines of code. Focus on correctness and edge cases first. Second focus on minimalism and clarity. There's 500+ lines of code that people have to get their heads around. That can be justified, but often can be broken up to lower the burden on those reviewing the code and lowers the chances you'll sneak a bug in unintentionally.

Contributing to OSS has a different focus than internal code, so hopefully these help. This does look well thought out in terms of implementation, but performance should be second or third try in favor of having code history that everyone in the community can wrap their heads around.

I'd suggest to get these addressed before Fokko and Kevin scan it. I'll be happy to do a quick glance once the tests are running and there's some consensus around datafusion.

PR number one yeah!

@mattmartin14
Copy link
Author

Thanks @bitsondatadev for all this great feedback. I'll get working on your suggestions and push an update next week and will address all your concerns.

@kevinjqliu
Copy link
Contributor

Thanks @mattmartin14 for the PR! And thanks @bitsondatadev on the tips on working in OSS. I certainly had to learn a lot of these over the years.

A couple things I think we can address first.

  1. Support for MERGE INTO / Upsert

This has been a much anticipated and asked feature in the community. Issue #402 has been tracking it with many eyes on it. I think we still need to figure out the best approach to support this feature.

Like you mentioned in the description, MERGE INTO is a query engine feature. Pyiceberg itself is a client library to support the Iceberg python ecosystem. Pyiceberg aims to provide the necessary Iceberg building blocks so that other engines/programs can interact with Iceberg tables easily.

As we’re building out more of more engine-like features, it becomes harder to support more complex and data-intensive workloads such as MERGE INTO. We have been able to use pyarrow for query processing but it has its own limitations. For more compute intensive workloads, such as Bucket and Truncate transform, we were able to leverage rust (iceberg-rust) to handle the computation.

Looking at #402, I don’t see any concrete plans on how we can support MERGE INTO. I’ve added this as an agenda on the monthly pyiceberg sync and will post the update. Please join us if you have time!

  1. Taking on Datafusion as a dependency

I’m very interested in exploring datafusion and ways we can leverage it for this project. As I mentioned above, we currently use pyarrow to handle most of the compute. It’ll be interesting to evaluate datafusion as an alternative. Datafusion has its own ecosystem of expression api, dataframe api, and runtime. All of which are good complements to pyiceberg. It has integrations with the rust side as well, something I have started exploring in apache/iceberg-rust#865

That said, I think we need a wider discussion and alignment on how to integrate with datafusion. It’s a good time to start thinking about it! I’ve added this as another discussion item on the monthly sync.

  1. Performance concerns

Compute intensive workloads are generally a bottleneck in python. I am excited for future pyiceberg <> iceberg-rust integration where we can leverage rust to perform those computations.

The composite key code builds an overwrite filter, and once that filter gets too lengthy (in my testing more than 200 rows), the visitor “OR” function in pyiceberg hits a recursion depth error.

This is an interesting observation and I think I’ve seen someone else run into this issue before. We’d want to address this separately. This is something we might want to explore using datafusion’s expression api to replace our own parser.

@mattmartin14
Copy link
Author

@kevinjqliu @Fokko @bitsondatadev - the issues should be resolved. I got poetry working in my company's firewall; i've also removed the dead code and added the license headers to each file. please take a look

@mattmartin14
Copy link
Author

also - i added datafusion to the poetry toml file and lock and it appears that you all need to resolve the conflict here, as it's not letting me.

@mattmartin14
Copy link
Author

Also @kevinjqliu - To address your question on datafusion. When I looked into this feature, I explored these 3 options for an arrow processing engine:

  1. Duckdb
  2. Datafusion
  3. Daft

I ultimately decided that datafusion would make the most sense, given these things it had going:

  • It's already owned by the Apache foundation. So licensing would be a non-issue
  • its very light weight and specifically designed to process and query arrow tables
  • it's rust based and if pyiceberg is ultimately going to be migrated to iceberg-rust one day, the integrations would be easier
  • The iceberg rust project is already building integrations for it, as seen here.

Hope this helps on how I arrived at that conclusion. Just using native pyarrow to try and process the data would be a very large uphill battle as we would effectively have to build our own data processing engine with it e.g. hash joins, sorting, optimizations, etc. I figured it does not make sense to reinvent the wheel and instead use an engine that is already out there (datafusion) and put it to good use.

I took a look at the attachment you posted for any upcoming meetings for the pyiceberg sync, but did not see any 2025 meetings listed. I'd be glad to attend to discuss this further, if needed.

Thanks,
Matt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants