Write support #41

Fokko · 2023-10-04T16:16:20Z

Experimental branch to implement writing. Much of the changes here will be split out into small manageable PRs.

Resolves #181
Resolves #23

For V1 and V2 there are some differences that are hard to enforce without this: - `1: snapshot_id` is required for V1, optional for V2 - `105: block_size_in_bytes` needs to be written for V1, but omitted for V2 (this leverages the `write-default`). - `3: sequence_number` and `4: file_sequence_number` can be omited for V1. Everything that we read, we map it to V2. However, when writing we also want to be compliant with the V1 spec, and this is where the writer tree comes in since we construct a tree for V1 or V2.

samplec0de · 2023-10-11T13:20:11Z

Very relevant! I'm looking forward to it, thank you!

Fokko · 2024-01-16T12:22:25Z

mkdocs/docs/api.md

+
+When reading the table `tbl.scan().to_arrow()` you can see that `Groningen` is now also part of the table:
+
+```


While working on this, I also checked the field-ids:

parq 00000-0-27345354-67b8-4861-95ca-c2de9dc8d3fe.parquet --schema # Schema <pyarrow._parquet.ParquetSchema object at 0x11eca2e00> required group field_id=-1 schema { optional binary field_id=1 city (String); optional double field_id=2 lat; optional double field_id=3 long; }

rdblue · 2024-01-17T18:46:18Z

mkdocs/docs/api.md

+schema = Schema(
+    NestedField(1, "city", StringType(), required=False),
+    NestedField(2, "lat", DoubleType(), required=False),
+    NestedField(3, "long", DoubleType(), required=False),


Isn't required=False the default?

No, the default is the more strict True. I've set it to False because PyArrow produces nullable fields by default

rdblue · 2024-01-17T20:49:22Z

pyiceberg/table/__init__.py

+        if len(self.sort_order().fields) > 0:
+            raise ValueError("Cannot write to tables with a sort-order")
+
+        snapshot_id = self.new_snapshot_id()


Minor: this can be handled inside of _MergeAppend since it has the table.

rdblue · 2024-01-17T20:50:32Z

pyiceberg/table/__init__.py

+        snapshot_id = self.new_snapshot_id()
+
+        data_files = _dataframe_to_data_files(self, df=df)
+        merge = _MergeAppend(operation=Operation.APPEND, table=self, snapshot_id=snapshot_id)


Is this really a "merge append" if the operation may be overwrite? You might consider using _MergingCommit or _MergingSnapshotProducer (if you want to follow the Java convention).

pyiceberg/table/__init__.py

rdblue · 2024-01-17T21:10:09Z

pyiceberg/table/__init__.py

+                        for entry in manifest.fetch_manifest_entry(self._table.io, discard_deleted=True)
+                    ]
+
+                list_of_entries = executor.map(_get_entries, previous_snapshot.manifests(self._table.io))


It may be a good idea to defensively use only data manifests here instead of all manifests.

rdblue · 2024-01-17T21:12:54Z

pyiceberg/table/__init__.py

+                            status=ManifestEntryStatus.DELETED,
+                            snapshot_id=entry.snapshot_id,
+                            data_sequence_number=entry.data_sequence_number,
+                            file_sequence_number=entry.file_sequence_number,


Looks good.

rdblue · 2024-01-17T21:15:04Z

pyiceberg/table/__init__.py

+            raise ValueError(f"Not implemented for: {self._operation}")
+
+    def _manifests(self) -> List[ManifestFile]:
+        manifests = []


Minor: Since this is empty, it looks like this is just the newly created manifests. It may be a good idea to name this new_manifests.

rdblue · 2024-01-17T21:17:26Z

pyiceberg/table/__init__.py

+            summary=Summary(operation=self._operation, **self._summary()),
+            previous_summary=previous_snapshot.summary if previous_snapshot is not None else None,
+            truncate_full_table=self._operation == Operation.OVERWRITE,
+        )


I think this block could be moved to _summary so that it produces the correct summary without the need to modify it afterward. That seems a bit cleaner to me, rather than having a two-step process split across methods.

Yes, that's a great point 👍

rdblue · 2024-01-17T21:20:50Z

pyiceberg/table/__init__.py

+            if self._operation == Operation.APPEND and previous_snapshot is not None:
+                # In case we want to append, just add the existing manifests
+                writer.add_manifests(previous_snapshot.manifests(io=self._table.io))
+            writer.add_manifests(new_manifests)


Similar to the note above, I think it would be cleaner to have _manifests produce the complete set of manifests, not just the replacement ones. That method already relies on _deleted_entries to produce deletes, so it may as well also be responsible for checking whether to include the existing manifests.

Another option is to make _manifests produce just manifests for the appended files and handle deletes separately, but it looks like your approach here is to create just one manifest with both deletes and appends.

Great suggestion. I've moved all the logic to _manifests()

rdblue · 2024-01-17T21:26:02Z

pyiceberg/table/__init__.py

+                    )
+
+                for delete_entry in deleted_entries:
+                    writer.add_entry(delete_entry)


I think this approach works fine, but I want to point out that there are drawbacks to writing the deletes in the same manifest:

A reader has to load all of the deletes, even though the files aren't useful. If they are in a separate manifest, readers can filter out manifests that have no EXISTING or ADDED data files.

Manifests with no data files can be removed in future append commits.

This write is single-threaded. In the Java implementation, we produce a manifest of deleted data files for each existing manifest. That allows us to parallelize the operation.

Here's the logic we use to drop manifests that aren't needed on the Java side when producing the new list of manifests:

// only keep manifests that have live data files or that were written by this commit Predicate<ManifestFile> shouldKeep = manifest -> manifest.hasAddedFiles() || manifest.hasExistingFiles() || manifest.snapshotId() == snapshotId();

I missed this one, thanks for suggesting it! 👍 I've split out the ADDED, EXISTING, and DELETE entries into separate manifests that write in parallel.

rdblue · 2024-01-17T21:48:20Z

mkdocs/docs/api.md

@@ -175,6 +175,104 @@ static_table = StaticTable.from_metadata(

 The static-table is considered read-only.

+## Write support
+
+With PyIceberg 0.6.0 write support is added through Arrow. Let's consider an Arrow Table:


Thanks for this example! Made it really easy to test out.

The example works great cut & pasted into a REPL. I also tested modifications to the dataframe schema passed to append and it does the right thing. I get a schema error for a few cases:

Missing column long

Type mismatch string instead of double

Extra column country

Looks like Arrow requires that the schema matches, which is great.

It would be nice to allow some type promotion in the future. I'm not sure whether arrow would automatically write floats into double columns, for example. I would also like to make sure we have better error messages, not just "ValueError: Table schema does not match schema used to create file: ...". Those will be good follow ups.

Yes, I think this ties into the work that @syun64 is doing where we have to make sure that we map the fields correctly, and then I think we can add options to massage the Arrow schema into the Iceberg one (which should be leading).

We can create a visitorWithPartner that will see if the promotions are possible. One that comes to my mind directly, is checking if there are any nulls. Arrow marks the schemas as nullable by default, while there are no nulls.

rdblue · 2024-01-17T21:49:47Z

@Fokko, this works great and I don't see any blockers so I've approved it.

I think there are a few things to consider in terms of how we want to do this moving forward (whether to use separate manifests for example) but we can get this in and iterate from there. It also looks like this is pretty close to being able to run the overwrite filter, too! Great work.

Fokko · 2024-01-18T10:20:52Z

Many thanks again for the great review @rdblue. I went forward and merged it 🙌 Probably we'll improve a bit more on the code-style structure when we add #270 We went a bit back and forth a couple of times. Great having this in 🚀

asheeshgarg · 2024-01-24T21:01:00Z

tests/integration/test_writes.py

+}
+
+TABLE_SCHEMA = Schema(
+    NestedField(field_id=1, name="bool", field_type=BooleanType(), required=False),


@Fokko I have tested the write I see if we make Table schema for any field required =True example NestedField(field_id=13, name="fixed", field_type=FixedType(16), required=True)

ValueError: Table schema does not match schema used to create file:

I always fails 1102 if not table.schema.equals(self.schema, check_metadata=False):
1103 msg = ('Table schema does not match schema used to create file: '
1104 '\ntable:\n{!s} vs. \nfile:\n{!s}'
1105 .format(table.schema, self.schema))
-> 1106 raise ValueError(msg)

PyArrow fields by default are nullable, which matches all the nested fields in TABLE_SCHEMA. If you want to test against non-nullable fields, then arrow_table_with_null or whatever other pyarrow table you are instantiating should have nullable=False for the field that has required=True.

@sebpretzer thanks for the clarification. Tested it work as expected

mkleinbort-ic · 2024-01-31T16:30:36Z

Is there an ETA for write functionality in the released version?

EternalDeiwos · 2024-01-31T16:35:10Z

Check the attached milestone for progress. When those issues are resolved it will be ready for release.

sungwy · 2024-01-31T16:37:27Z

Hi @mkleinbort-ic we've just started voting on the first release candidate that incorporates this change

Fokko added 8 commits October 3, 2023 13:10

WIP: Write

7133054

Add logic to generate a new snapshot-id

ffecf72

WIP

25eb597

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

4cd493e

WIP

b88f736

WIP

f53626d

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

0c665ef

Fokko added this to the PyIceberg 0.6.0 release milestone Oct 9, 2023

Fokko added 10 commits October 9, 2023 16:26

Fix linting

3f79dbd

Make the tests pass

02430bb

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

eb4dd62

Add support for V2

c891382

pre-commit

aae5a57

Move things outside of pyarrow.py

cff3a1d

Append WIP

082387e

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

997b673

WIP

9d52906

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

8893cf3

Fokko mentioned this pull request Oct 11, 2023

Python write support #23

Closed

4 tasks

Fokko added 9 commits October 11, 2023 16:18

Add v1 to v2 promotion tests

55f27c9

Add _MergeSnapshots

9a0096b

Work on the Summary

4f5b710

Add tests

926d947

Add Snapshot logic and Summary generation

50575a8

WIP

5482ae0

WIP

2fa01f4

Cleanup

580c824

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

254d7e8

Fokko added 4 commits January 15, 2024 13:23

Remove doc

abc0741

Fix the tests

b817a15

Refactor

48ba852

Move to fast-appends

664e113

Fokko commented Jan 16, 2024

View reviewed changes

Fokko added 2 commits January 17, 2024 13:36

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

85ac0eb

Merge branch 'main' of github.com:apache/iceberg-python into fd-write

7e8c04f

rdblue reviewed Jan 17, 2024

View reviewed changes

keep track of deleted files

7baf3ec

rdblue reviewed Jan 17, 2024

View reviewed changes

pyiceberg/table/__init__.py Show resolved Hide resolved

rdblue reviewed Jan 17, 2024

View reviewed changes

rdblue approved these changes Jan 17, 2024

View reviewed changes

Comments

ab020b9

Fokko merged commit 8f7927b into apache:main Jan 18, 2024

Fokko deleted the fd-write branch January 18, 2024 10:19

asheeshgarg reviewed Jan 24, 2024

View reviewed changes

mfatihaktas mentioned this pull request Feb 13, 2024

feat: support iceberg read/write ibis-project/ibis#8343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write support #41

Write support #41

Fokko commented Oct 4, 2023 •

edited

Loading

samplec0de commented Oct 11, 2023

Fokko Jan 16, 2024

rdblue Jan 17, 2024

Fokko Jan 17, 2024

rdblue Jan 17, 2024

rdblue Jan 17, 2024

rdblue Jan 17, 2024

rdblue Jan 17, 2024

rdblue Jan 17, 2024

rdblue Jan 17, 2024

Fokko Jan 18, 2024

rdblue Jan 17, 2024

Fokko Jan 18, 2024

rdblue Jan 17, 2024

Fokko Jan 18, 2024

rdblue Jan 17, 2024

Fokko Jan 18, 2024

rdblue commented Jan 17, 2024

Fokko commented Jan 18, 2024

asheeshgarg Jan 24, 2024

sebpretzer Jan 25, 2024

asheeshgarg Jan 25, 2024

mkleinbort-ic commented Jan 31, 2024

EternalDeiwos commented Jan 31, 2024

sungwy commented Jan 31, 2024


		When reading the table `tbl.scan().to_arrow()` you can see that `Groningen` is now also part of the table:

		```

Write support #41

Write support #41

Conversation

Fokko commented Oct 4, 2023 • edited Loading

samplec0de commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jan 17, 2024

Fokko commented Jan 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkleinbort-ic commented Jan 31, 2024

EternalDeiwos commented Jan 31, 2024

sungwy commented Jan 31, 2024

Fokko commented Oct 4, 2023 •

edited

Loading