-
Notifications
You must be signed in to change notification settings - Fork 351
Validate overwrite filter #582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi Adrian, thanks for working on this and the very comprehensive write-up. My first questions is, what is the main goal of this PR. Let me elaborate with more context. Looking at the Spark syntax, this originated from the Hive era, and assumes that there is a single partition spec on the table. With Iceberg, the partitioning can evolve over time and therefore also older partitioning strategies can be present. CREATE TABLE prod.my_app.logs (
env STRING,
logline STRING,
created_at TIMESTAMP_NTZ -- Nobody likes timezones
)
USING iceberg
PARTITION BY (months(created_at))
-- Later on when there is more data
REPLACE PARTITION FIELD dt_month WITH days(days(created_at))
-- Or, when we want to split the logs per environment
REPLACE PARTITION FIELD dt_month WITH days(env, months(created_at)) There is a fair chance that we have to rewrite actual data. When updating the spec, a full rewrite of the table is not done directly. Otherwise it would be very costly to update the partition spec on big tables. If the data is still partitioned monthly, update data using the new partitioning (on a daily basis), it will read in the Parquet files that match the filter. Let's say that you do:
I think the top one is a dynamic overwrite since it does not explicitly calls out the partition that it will overwrite. In that case we should compute all the affected partition values by applying the current partition spec on the dataframe. Based on that we can first delete the affected partitions, and then append the data. def overwrite_dynamic(df: pa.Table) -> None: The second one looks like a static overwrite: def overwrite(df: pa.Table, expr: Union[BooleanExpression, str]) -> None:
...
overwrite(df, "level = 'INFO'") I agree that this is a bit odd:
Open questions:
|
@@ -1776,7 +1776,10 @@ def write_parquet(task: WriteTask) -> DataFile: | |||
fo = io.new_output(file_path) | |||
with fo.create(overwrite=True) as fos: | |||
with pq.ParquetWriter(fos, schema=arrow_file_schema, **parquet_writer_kwargs) as writer: | |||
writer.write(pa.Table.from_batches(task.record_batches), row_group_size=row_group_size) | |||
arrow_table = pa.Table.from_batches(task.record_batches) | |||
# align the columns accordingly in case input arrow table has columns in order different from iceberg table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example of when this would happen? This only handles top-level columns.
Hi @Fokko @adrianqin I think the goal of this PR is to create a distinction to the semantic of a 'static overwrite' onto a partitioned table, from that of a 'delete' + 'append'. As we discussed in the last PyIceberg Sync, if we believe that a partitioned table overwrite should maintain the expectation of comparing the values provided in the overwrite_filter, to the values provided in the arrow table, and hence I think we'd need these validations. We would first need to run these validations on the predicate expression provided in the overwrite_filter so that we can then compare the values in the arrow table. In the community sync, we discussed whether the community was in favor of drawing a distinction between For example, where dt and level are both partition columns:
If we wanted to handle the validation only in the Static overwrite on the other hand, would eagerly validate the predicate expression against the table schema, and the values in the arrow table and throw instead. Please let me know what you think - I think figuring out how static overwrite semantically differs from delete + append (or not) would be crucial in understanding how we want to shape up this API |
This is what I tried to explain in the comment earlier above: sometimes you have to do rewrites because there was a different partitioning strategy before where still some rows match. I'm adding that currently in #569. A table can have older manifests that are still written using an older partition spec.
I missed this part. We can add it, but I would say that it is up to the user. To simplify it, this means doing this additional check (pseudocode): def overwrite(df: pa.Table, overwrite_filter: Union[str, BooleanExpression]) -> None:
row_filter = _parse_row_filter(row_filter) # Turns the str into a boolean expression
pa_row_filter = expression_to_pyarrow(row_filter)
num_invalid_rows = len(df) - df.filter(pa_row_filter)
if len(num_invalid_rows) > 0:
raise ValueError(f"Found {num_invalid_rows} rows that don't match the overwrite predicate") |
A delete + append will do the static overwrite and most of the validation will be done by delete itself. Closing the pr and will open a static overwrite PR when DELETE and MERGE_APPEND are ready with the following two flags:
|
Background
When we are doing a static overwrite, we could choose to overwrite the full table or overwrite some partitions of the table.
Example spark sql counterpart in iceberg spark static overwrite for full table overwrite is
And example spark sql counterpart for specified partition overwrite is
Goal
When we overwrite the table, we could provide an expression as overwrite_filter in accord with these 2 cases. This pr is to validate that the filter expression should conform to certain rules such as it has to be on a partition column that does not use hidden partitioning and the fields in the filter have to be in accord with the input arrow table in a certain way so that the dataframe does not include values that the filter does not specify.
Rules and Test Cases
Rule : The expression could only use IsNull or EqualTo as building blocks and concatenated by And
Tests :
test__validate_static_overwrite_filter_expr_type
parametrize 1-8Rule : The building block predicates (IsNull and EqualTo) should not have conflicting values.
Tests :
test__validate_static_overwrite_filter_expr_type
parametrize 9-11Rule : The terms (fields) should refer to existing fields in the iceberg schema, and also the literal in the predicate (if any) should match the iceberg field type. These mean the expression could be bound with table schema successfully.
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_on_non_schema_fields_in_filter
test__bind_and_validate_static_overwrite_filter_predicate_fails_to_bind_due_to_incompatible_predicate_value
Rule : If expression specifies a field which is required in iceberg schema, it should not be isnull in the expression.
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_to_bind_due_to_non_nullable
Rule : The fields in the expression should be within partition columns
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_on_non_part_fields_in_filter
Rule : The iceberg table fields specified in the expression could not have hidden partitioning, however, the non-specified fields could.
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_on_non_identity_transorm_filter
test__bind_and_validate_static_overwrite_filter_predicate_succeeds_on_an_identity_transform_field_although_table_has_other_hidden_partition_fields
Rule : The partition column values in the dataframe should conform to the filter. (To implement in the static overwrite function when partion keys are extracted)
Rule Necessity Justification using Spark Counterparts
To better understand these rules, let us provide spark static overwrite crash counterparts. For which, we have following set up:
this gives such table schema:
with such data:
Now let us check the rules
Rule 1. The expression could only use IsNull or EqualTo as building blocks and concatenated by And.
For example:
or
"foo = 'hello' AND (baz IS NULL AND boo = 'hello')
Spark counterpart example:
gives:
Other predicates of 'in', '!=', etc and other expression such as 'Or' give similar errors.
**Rule 2. The building block predicates (IsNull and EqualTo) should not have conflicting values. **
This means
and
are not allowed.
However,
is allowed and shall be deduplicated.
Spark counterpart example:
gives
Rule 3. The terms (fields) should refer to existing fields in the iceberg schema, and also the literal in the predicate (if any) should match the iceberg field type. These mean the expression could be bound with table schema successfully.
Spark counterpart example:
gives:
Rule 4. If expression specifies a field which is required in iceberg schema, it should not be isnull in the expression.
Spark counterpart example:
gives:
Rule 5. The fields in the expression should be within partition columns
Spark counterpart example:
gives:
Rule 6. The iceberg table fields specified in the expression could not have hidden partitioning, however, the non-specified fields could.
Spark counterpart example:
gives:
However, if we specify the other partition column with identity transform, it works: