-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce assign_fresh_ids
flag and allow skipping fresh assignment of IDs on Table creation
#1304
base: main
Are you sure you want to change the base?
Conversation
assign_fresh_ids
and allow skipping fresh assignment of IDs on table creationassign_fresh_ids
flag and allow skipping fresh assignment of IDs on Table creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @sungwy. This looks good, I left two small comments. Thanks for adding all the tests 👍
@@ -122,32 +122,13 @@ schema = Schema( | |||
), | |||
) | |||
|
|||
from pyiceberg.partitioning import PartitionSpec, PartitionField |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, thanks for cleaning this up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally got a chance to look over this, sorry for the delay
if assign_fresh_ids: | ||
fresh_schema = assign_fresh_schema_ids(schema) | ||
partition_spec = assign_fresh_partition_spec_ids(partition_spec, schema, fresh_schema) | ||
sort_order = assign_fresh_sort_order_ids(sort_order, schema, fresh_schema) | ||
schema = fresh_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where assign_fresh_ids
is ultimately used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this function is called by _create_staged_table
and create_table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both functions take schema: Union[Schema, "pa.Schema"],
as input.
- If
pa.Schema
is given, we want to convert and assign id (this is currently done by setting theassign_fresh_ids
flag to True) - If
Schema
is given, currently the default is to assign the schema ids,assign_fresh_ids: bool = True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal is to not include assign_fresh_ids
as a flag in functions other than new_table_metadata
.
So when _create_staged_table
and create_table
is given
- a
pa.Schema
, convert toSchema
and setassign_fresh_ids
to True innew_table_metadata
- a
Schema
. Assume the user createdSchema
with the correct IDs (possibly verify some correctness characteristics such as uniqueness). And use the schema as is
If a user wants to reassign IDs forSchema
, this can be done outside thecreate_table
functions and we can even provide a helper function to do so.
I feel like this way can help break apart the responsibilities of schema id assignment from the create_table
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LMK if this makes sense or if im missing something!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevinjqliu thank you for the review! Yes, I agree that the code path would be simpler if we didn't expose assign_fresh_ids
as a parameter for the API. However, I think there were some concerns that were raised in not surfacing that as an argument and having two code paths based strictly on the input parameter. #1284
I will add this to the agenda for the PyIceberg Sync on Tuesday and see if we that will help the community in reaching a consensus.
Implements: #1284