-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REPLACE TABLE Support #281
Comments
@syun64 thanks for raising this. With the freshly merged write support, we can do: cat = load_catalog('default')
tbl = cat.load_table('default.some_table')
tbl.overwrite(df: pa.Table) The overwrite will create a new snapshot with the existing schema. This would only replace the data, and keep the metadata. Would that work for you? The next step would be to automatically evolve the schema as well (with flags for compatible/incompatible changes). |
@Fokko evolving schema would be nice, also would it be possible to evolve partitions? e.g. after a specific overwrite I want to have schema evolution and possibility to replace the table adding partitions to specific columns. |
Thank you for the great points @Fokko and @nicor88 . Just like @nicor88 mentioned, I think RTAS will be slightly different from overwrite in the sense that the schema, the partitioning scheme, sort order or any of the table properties can also be updated atomically with this operation. In short, the function needs to support updating any of the arguments that are currently supported on create_table function, in addition to overwriting the Iceberg table data with the input pyarrow table. I'm wondering if it would be better to have a separate function that achieves these goals in a single transaction?
|
Schema evolution is already possible today. What's missing there is the with table.update_schema() as update:
update.union_by_name_with(new_schema) This will promote the schema to the new schema if compatible. If you also want to allow incompatible changes, you can do: with table.update_schema(allow_incompatible_changes=True) as update:
update.union_by_name_with(new_schema) By making this explicit, you make sure that you don't break the table for the downstream consumers. If you want to add partition columns in a single transaction, you can do: with table.transaction() as tx:
with tx.update_schema() as update:
update.union_by_name_with(new_schema)
with tx.update_spec() as update:
update.add_field("dt_transaction", "transaction", DayTransform()) Note: The updating partitions is still underway (#245), but is expected to be included in 0.6.0
We could do something similar as |
Makes sense @Fokko . Thank you very much for taking the time to lay all these options where a user may have to handle compatible or incompatible schema updates. As you suggested, I think it would make sense to port over the Java code to support the replace operation for the case where we just want to replace the existing table with the new schema and data in a single transaction. I’ll read through the code and take a stab at the PR that will introduce the operation to PyIceberg. |
In order to reduce duplication of code, would it make sense to combine the job of TypeUtil.assignFreshIds with UnionByNameVisitor? They seem to be doing the same task of ensuring that the new schema reuses the field_id of a column that existed in the original schema. The only difference is that TypeUtil.assignFreshIds drops columns that are not in the target schema, where as UnionByNameVisitor unions the original and target schemas. If that sounds like a good idea, #284 will be a prerequisite to building this feature. |
@syun64 I started on #284 today. It re-uses the My suggestion is to create a name-mapping from an Arrow table, that we can use in |
Hi @Fokko - sounds like you beat me to it 😄 Please let me know if you need any additional heavy lifting on #284 . Happy to help as always. The reason I was curious if there's an opportunity to deduplicate code here, is because buildReplacement code in Java also takes Iceberg Schema as an input. It then compares the new updated schema against the existing schema to use the existing ID if the corresponding field name exists, or assign a new ID starting from the next increment from last_column_id of the table. On second thought, I'm wondering if it would actually make sense to extend functionality of the PyArrow Schema Visitor we are planning to implement for #278 and have the schema visitor take the last_column_id and the base Schema as the input so that we assign the existing field ID if it exists, and assign a new field ID that starts from last_column_id. What are your thoughts on this idea? |
There's a PR in progress that will introduce 'REPLACE TABLE' support, but I don't think we've come to a consensus yet on how we would want to support 'AS SELECT' semantics in PyIceberg, and if we want to introduce it at all. If we wanted to, I think we could update all
as an optional parameter and add a snapshot update with full table static overwrite to the new table metadata. Does that sound like a reasonable enhancement? |
@syun64 Yes, that sounds like a great suggestion to me. I think it makes a lot of sense to create a table including a snapshot. I'm not sure if For most catalogs, this will be quite straightforward. For the REST catalog, there is one subtle implementation detail that we want to take into account, and that we need to leverage the staged creation. Details can be found here: https://github.com/apache/iceberg/blob/bb53c3d4e0e27ac6706803c2371793ad2476ae04/open-api/rest-catalog-open-api.yaml#L480-L492 |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
Feature Request / Improvement
SparkCatalog supports "REPLACE TABLE ... AS SELECT" as an atomic operation when using a SparkCatalog.
Similarly, we should introduce support for an atomic RTAS operation in PyIceberg.
Atomic table replacement creates a new snapshot with the results of the SELECT query, but keeps table history.
https://iceberg.apache.org/docs/latest/spark-ddl/#replace-table--as-select
The text was updated successfully, but these errors were encountered: