Add Data Files from Parquet Files to UnPartitioned Table #506

sungwy · 2024-03-08T03:28:48Z

PyIceberg's version of add_files Spark migration procedure.

Some early ideas on its implementation:

instead of staying with the input interface for Spark's Procedure, we could just allow the users to pass a list of full file_paths instead. This approach allows users to submit explicit file_paths, instead of leaving Iceberg to make assumptions based on the file_paths which has led to issues in the Java implementation.
current implementation infers the partition values from the path and doesn't validate if the files themselves have the partition values. We could instead of statistics from the parquet metadata to check the min and max values of the partition columns (min and max should be the same for a partition column) and use that value to derive the partition record value instead. If the statistic is present, this would be more accurate than inferring the value through string match on the partition path. Without these checks, there's a possibility that files with wrong partition values in the manifest, versus in the parquet file will be added to the table.
only Identity Transforms are currently supported. This is because in order to construct the manifest entries for the data files from the partition path, we need to convert human string values to their respective internal partition representations that get encoded as the partition values in the avro files. This is challenging to do for the Transform partitions, since the we will need to create a reverse transformation of the human string to partition representation for every supported type of IcebergType+Transform pairs. Related issue in Java.

EDIT: Supporting addition of parquet files as data files to partitioned tables will be introduced in a separate PR. Options have been discussed in the comments on this PR, and we are breaking it up to make code reviews easier

sungwy · 2024-03-12T15:11:45Z

Updates from offline discussions:

The task of creating the correct Iceberg Table Schema with the desired Partition Spec, from an external table (like Hive) is out of scope of this PR. Atomically creating a table and adding files will be supported through the combination of this PR, and CreateTableTransaction (WIP)
We will replace file_path based partition inference with parquet metadata footer based partition inference. Currently we only support IdentityPartitions, and we can infer the partition values from the metadata footer's statistics. (upper and lower bounds should be equal). This will also allow us to create extend partition inference to numeric Transforms (YearTransform, etc) by applying the transforms on the lower and upper bounds.
Overwrites are acknowledged as a valid modes of adding files. This is out of scope of this PR, and it can be supported atomically by deleting Expression values + adding files all within the same transaction block
There are a lot of gotchas in ensuring transactional guarantees - we'll think through possible race conditions and try to put all updates into the same transaction block if possible

sungwy · 2024-03-13T14:42:46Z

We will replace file_path based partition inference with parquet metadata footer based partition inference. Currently we only support IdentityPartitions, and we can infer the partition values from the metadata footer's statistics. (upper and lower bounds should be equal). This will also allow us to create extend partition inference to numeric Transforms (YearTransform, etc) by applying the transforms on the lower and upper bounds.

I just realized that this approach won't work if we want to add files from HIVE tables, because HIVE style partitioning results in parquet files that do not actually have the partition data in them. The partition columns are inferred from the directory structure. But I think the suggested approach should be favored over file path inference if it is possible.

@Fokko , I'd love to get your opinion on the following:

We will introduce two modes of add_files: Hive path partition inference, versus parquet metadata min/max based partition inference. The former mode doesn't care if the fields are missing in the data file and uses the partition values from the file_path by casting the String value to their respective data types. The metadata based approach requires that the partition data be in the parquet file, and complies with the Iceberg spec.
To support Hive path partition inference mode, we will be suppressing the schema parity checks in fill_parquet_file_metadata function when we are using path based partition inference.

These two modes cover some of the options that were discussed in the initial discussion of the add_files migration procedure.

Fokko · 2024-03-13T15:56:20Z

So both of the approaches have pro's and con's. One thing I would like to avoid is having to rely on Hive directly, this will make sure that we can generalize it to also import generic Parquet files.

One problematic thing is that with Iceberg hidden partitioning we actually have the source-id that points to the field where the data is being kept. If the Hive partitioning is just arbitrary, eg:

INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount FROM some_other_table

In this case there is no relation between the partition and any column in the table. In Iceberg you would expect something like:

INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount, created_at FROM some_other_table

Where the partitioning is year(created_at). If this column is not in there, I don't think we can import it into Iceberg because we cannot set the source-id of the partition spec.

I would also expect the user to pre-create the partition spec prior to the import, because inferring is tricky.

sungwy · 2024-03-13T17:01:03Z

So both of the approaches have pro's and con's. One thing I would like to avoid is having to rely on Hive directly, this will make sure that we can generalize it to also import generic Parquet files.

One problematic thing is that with Iceberg hidden partitioning we actually have the source-id that points to the field where the data is being kept. If the Hive partitioning is just arbitrary, eg:
INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount FROM some_other_table
In this case there is no relation between the partition and any column in the table. In Iceberg you would expect something like:
INSERT INTO transactions PARTITION (year = '2023') AS SELECT name, amount, created_at FROM some_other_table
Where the partitioning is year(created_at). If this column is not in there, I don't think we can import it into Iceberg because we cannot set the source-id of the partition spec.

I would also expect the user to pre-create the partition spec prior to the import, because inferring is tricky.

Thank you for the context @Fokko . What I meant by partition inference is the act of inferring the partition values instead of the Partition Spec itself. So this function only runs after the Iceberg Table has been created with its expected PartitionSpec.

But because Hive tables have the partition values in the file paths instead of in the actual data files, I'm proposing that we have the two modes of partition value inference: one from the file paths, and the other based on the upper and lower bound values from the parquet metadata

Fokko · 2024-03-14T06:49:41Z

@syun64 I'm all for it if it works, but I see a lot of issues with inferring it from the Hive path.

Fokko

Looking good @syun64. Could you also update the docs? We could also defer the partitioning into a separate PR, up to you 👍

Fokko · 2024-03-14T07:30:38Z

pyiceberg/table/__init__.py

+        if any(not isinstance(field.transform, IdentityTransform) for field in self.metadata.spec().fields):
+            raise NotImplementedError("Cannot add_files to a table with Transform Partitions")


We can be more permissive. It isn't a problem the table's current partitioning has something different than a IdentitiyTransform, the issue is that we cannot add DataFiles that use this partitioning (until we find a clever way of checking this).

Fokko · 2024-03-14T07:31:24Z

pyiceberg/table/__init__.py

+        if any(not isinstance(field.transform, IdentityTransform) for field in self.metadata.spec().fields):
+            raise NotImplementedError("Cannot add_files to a table with Transform Partitions")
+
+        if self.name_mapping() is None:


Technically you don't have to add a name-mapping if the field-IDs are set

@Fokko Yeah I think you are right!

When field IDs are in the files, and the name_mapping is also present, the field_ids take precedence over the name_mapping in schema resolution. So the name_mapping here would essentially be meaningless in that case.

I'm on the fence between moving forward with your suggestion (create name_mapping if there are no field_ids) or whether we should always assert that the parquet files that we want to add have no field IDs. And that's because the field_ids that we actually use in our Iceberg generated parquet files, is the Iceberg Table's internal notion of field IDs. Whenever a new table gets created, new field IDs are assigned, and Iceberg keeps track of these field IDs internally to ensure that the same field can be treated the same through column renaming.

When we add_files, we are introducing files that have been produced by an external process to Iceberg, which isn't aware of Iceberg's internal fields metadata. In that sense, I feel that allowing files that have field_ids to be added could result in unexpected errors for the user that are difficult to diagnose.

My main concern is that the Parquet file and the mapping don't match. For example, there are more fields in the parquet file than in the mapping. I think it is good to add checks there.

I've added this check here @Fokko let me know if that makes sense to you

pyiceberg/table/__init__.py

sungwy · 2024-03-14T13:38:17Z

@syun64 I'm all for it if it works, but I see a lot of issues with inferring it from the Hive path.

Yeah. I don't personally need migration procedures to add files from Hive tables, but I am aware of various teams and community members that want this sort of feature to migrate to Iceberg from Hive without having to rewrite all of their files.

I do think that partition inference from partition path is more complicated and has more gotchas that need to be discussed at length than the more accurate approach based on the partition metadata. I will pull that feature out and put together a follow up PR that only introduces file addition to partitioned tables using the lower and upper bounds of the partition column in the partition metadata.

Fokko · 2024-03-14T18:49:55Z

Makefile

@@ -42,7 +42,7 @@ test-integration:
 	docker-compose -f dev/docker-compose-integration.yml up -d
 	sleep 10
 	docker-compose -f dev/docker-compose-integration.yml exec -T spark-iceberg ipython ./provision.py
-	poetry run pytest tests/ -v -m integration ${PYTEST_ARGS}
+	poetry run pytest tests/integration/test_add_files.py -v -m integration ${PYTEST_ARGS}


This was committed by accident?

Suggested change

poetry run pytest tests/integration/test_add_files.py -v -m integration ${PYTEST_ARGS}

poetry run pytest tests/ -v -m integration ${PYTEST_ARGS}

I always do 😅

Fokko · 2024-03-14T18:53:09Z

pyiceberg/table/__init__.py

+        if any(not isinstance(field.transform, IdentityTransform) for field in self.metadata.spec().fields):
+            raise NotImplementedError("Cannot add_files to a table with Transform Partitions")
+
+        if self.name_mapping() is None:


My main concern is that the Parquet file and the mapping don't match. For example, there are more fields in the parquet file than in the mapping. I think it is good to add checks there.

pyiceberg/table/__init__.py

Fokko · 2024-03-14T18:56:12Z

tests/integration/test_add_files.py

+    df = spark.table(identifier)
+    assert df.count() == 6, "Expected 6 rows"
+    assert len(df.columns) == 4, "Expected 4 columns"
+    df.show()


I think this was for testing, can we remove this one? .show() is a spark action, meaning it will run the pipeline.

Fokko · 2024-03-14T19:27:29Z

@syun64 Can you add this also to the docs? :)

HonahX

Thanks @syun64 ! Adding 2 quick comments

pyiceberg/table/__init__.py

HonahX · 2024-03-15T07:14:57Z

mkdocs/docs/api.md

@@ -292,6 +292,39 @@ The nested lists indicate the different Arrow buffers, where the first write res

 <!-- prettier-ignore-end -->

+### Add Files
+
+Expert Iceberg users may choose to commit existing parquet files to the Iceberg table as data files, without rewriting them.


Shall we mention in the doc that this procedure currently only work for unpartitioned table?

Maybe! We've already discussed the different approaches for supporting adds to partitioned tables extensively, so I'm optimistic we'll get it in before the next release. I'll put it up shortly after this is merged.

Sounds great! Thanks!

Co-authored-by: Honah J. <[email protected]>

sungwy · 2024-03-15T11:57:51Z

Thank you for the reviews @Fokko and @HonahX . Could either of you help me merge it in as well?

HonahX

Thanks @syun64 for the great work and @Fokko for reviewing!

sungwy added 4 commits March 6, 2024 23:24

append files

9c16634

Merge branch 'main' into add-files

65e28fa

add_files

e250ffc

only support identity transforms

39436e6

sungwy marked this pull request as draft March 8, 2024 03:29

sungwy added 3 commits March 8, 2024 03:37

lint

fe51f4c

more tests

05ed020

Merge branch 'main' into add-files

8ccbf45

sungwy mentioned this pull request Mar 9, 2024

[Bug Fix] Allow Partition data to be nullable in ManifestEntry #509

Merged

Merge branch 'main' into add-files

f529f1b

support hive partition files

c8232f5

Fokko reviewed Mar 14, 2024

View reviewed changes

adopt review comments

d63c775

sungwy changed the title ~~[WIP] Add Data Files from Parquet Files~~ Add Data Files from Parquet Files to UnPartitioned Table Mar 14, 2024

sungwy marked this pull request as ready for review March 14, 2024 17:53

sungwy requested a review from Fokko March 14, 2024 17:53

Fokko reviewed Mar 14, 2024

View reviewed changes

review suggestions

a355a2a

HonahX reviewed Mar 15, 2024

View reviewed changes

Update __init__.py

fded34b

Co-authored-by: Honah J. <[email protected]>

Fokko approved these changes Mar 15, 2024

View reviewed changes

HonahX approved these changes Mar 16, 2024

View reviewed changes

HonahX merged commit 7f712fd into apache:main Mar 16, 2024
7 checks passed

sungwy mentioned this pull request Mar 18, 2024

add_files support partitioned tables #531

Merged

Fokko mentioned this pull request Mar 26, 2024

Create table from plain Parquet files #445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Data Files from Parquet Files to UnPartitioned Table #506

Add Data Files from Parquet Files to UnPartitioned Table #506

sungwy commented Mar 8, 2024 •

edited

Loading

sungwy commented Mar 12, 2024 •

edited

Loading

sungwy commented Mar 13, 2024 •

edited

Loading

Fokko commented Mar 13, 2024

sungwy commented Mar 13, 2024

Fokko commented Mar 14, 2024

Fokko left a comment

Fokko Mar 14, 2024

Fokko Mar 14, 2024

sungwy Mar 14, 2024

Fokko Mar 14, 2024

sungwy Mar 15, 2024

sungwy commented Mar 14, 2024

Fokko Mar 14, 2024

sungwy Mar 15, 2024

Fokko Mar 14, 2024

Fokko Mar 14, 2024

Fokko commented Mar 14, 2024

HonahX left a comment

HonahX Mar 15, 2024

sungwy Mar 15, 2024

HonahX Mar 16, 2024

sungwy commented Mar 15, 2024

HonahX left a comment

		if any(not isinstance(field.transform, IdentityTransform) for field in self.metadata.spec().fields):
		raise NotImplementedError("Cannot add_files to a table with Transform Partitions")

	poetry run pytest tests/integration/test_add_files.py -v -m integration ${PYTEST_ARGS}
	poetry run pytest tests/ -v -m integration ${PYTEST_ARGS}

Add Data Files from Parquet Files to UnPartitioned Table #506

Add Data Files from Parquet Files to UnPartitioned Table #506

Conversation

sungwy commented Mar 8, 2024 • edited Loading

sungwy commented Mar 12, 2024 • edited Loading

sungwy commented Mar 13, 2024 • edited Loading

Fokko commented Mar 13, 2024

sungwy commented Mar 13, 2024

Fokko commented Mar 14, 2024

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented Mar 14, 2024

HonahX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Mar 15, 2024

HonahX left a comment

Choose a reason for hiding this comment

sungwy commented Mar 8, 2024 •

edited

Loading

sungwy commented Mar 12, 2024 •

edited

Loading

sungwy commented Mar 13, 2024 •

edited

Loading